Artificial Inteligence
  • Preface
  • Introduction
  • Machine Learning
    • Linear Algebra
    • Supervised Learning
      • Neural Networks
      • Linear Classification
      • Loss Function
      • Model Optimization
      • Backpropagation
      • Feature Scaling
      • Model Initialization
      • Recurrent Neural Networks
        • Machine Translation Using RNN
    • Deep Learning
      • Convolution
      • Convolutional Neural Networks
      • Fully Connected Layer
      • Relu Layer
      • Dropout Layer
      • Convolution Layer
        • Making faster
      • Pooling Layer
      • Batch Norm layer
      • Model Solver
      • Object Localization and Detection
      • Single Shot Detectors
        • Yolo
        • SSD
      • Image Segmentation
      • GoogleNet
      • Residual Net
      • Deep Learning Libraries
    • Unsupervised Learning
      • Principal Component Analysis
      • Generative Models
    • Distributed Learning
    • Methodology for usage
      • Imbalanced/Missing Datasets
  • Artificial Intelligence
    • OpenAI Gym
    • Tree Search
    • Markov Decision process
    • Reinforcement Learning
      • Q_Learning_Simple
      • Deep Q Learning
      • Deep Reinforcement Learning
    • Natural Language Processing
      • Word2Vec
  • Appendix
    • Statistics and Probability
      • Probability
        • Markov Chains
        • Random Walk
    • Lua and Torch
    • Tensorflow
      • Multi Layer Perceptron MNIST
      • Convolution Neural Network MNIST
      • SkFlow
    • PyTorch
      • Transfer Learning
      • DataLoader and DataSets
      • Visualizing Results
Powered by GitBook
On this page
  • Theoretically
  • Experimentally
  • Some assumptions
  • Conditional Probability
  • Dependence/Independence of events
  • Dependent event example:
  • Independent event example:
  • Random Variables
  • Probability Distribution
  • Expected Value
  • Joint probability distribution
  • References:

Was this helpful?

  1. Appendix
  2. Statistics and Probability

Probability

PreviousStatistics and ProbabilityNextMarkov Chains

Last updated 5 years ago

Was this helpful?

Now that we gather some data and extracted features from them in form of variables is time to do inference, or educated guess with the data that you have available. In order to do inference we need to first understand probability.

Consider probability as a measure of the likelihood that some event will occur. Where P(E)=0P(E)=0P(E)=0 means that the probability of this event to occur is "impossible" and P(E)=1P(E)=1P(E)=1 means that the event will occur with 100% chance.

Probability can be determined experimentally or theoretically

Theoretically

Consider below a fair dice

P(E)=Possible ways of ENumber of possible outcomesP(E)=\frac{\text{Possible ways of E}}{\text{Number of possible outcomes}}P(E)=Number of possible outcomesPossible ways of E​

Here the probability of tossing the six-sided fair dice and having the value 1 is

P(E=1)=16=16.7%P(E=1)=\frac{1}{6}=16.7\%P(E=1)=61​=16.7%

On each toss only one value is possible (the dice only give one value at a time) and there are 6 possible values.

In order to make easier to find theoretical probabilities we may need to organize the data on tables or trees.

Experimentally

P(E)=number events occurednumber of trialsP(E)=\frac{\text{number events occured}}{\text{number of trials}}P(E)=number of trialsnumber events occured​

Consider that we rolled the dice 12 times with the results: 6,3,4,1,2,2,1,3,1,5,3,5

P(E=1)=312=25%P(E=1)=\frac{3}{12}=25\%P(E=1)=123​=25%

Here the experimental value is wrong but as we do more experiments the experimental probability tends to reach the theoretical value.

Some assumptions

Here we present some assumptions to guide you:

  • Probability of A or B P(A or B)=P(A)+P(B)−P(A and B)P(\text{A or B})=P(A)+P(B)-P(\text{A and B})P(A or B)=P(A)+P(B)−P(A and B)

  • Probability of A and B P(A and B)=P(A).P(B)=P(A∩B)P(\text{A and B})=P(A).P(B)=P(A \cap B)P(A and B)=P(A).P(B)=P(A∩B)

  • Probability of A not happening P(Not A)=1−P(A)P(\text{Not A})=1-P(A)P(Not A)=1−P(A)

  • The sum of all probabilities is always 1

Sometimes "OR" is substituted by the union symbol ∪\cup∪ and the "AND" is substituted by the intersection symbol ∩\cap∩

Conditional Probability

What is the probability of rolling a dice and it's value is less than 4 (B) given that the value is a odd number(A). In other words:

P(B∣A)P(B|A)P(B∣A)

As the dice has 6 possible values (1,2,3,4,5,6) it's probability of having a value less than 4 (3,2,1) will be P(B)=36=0.5P(B)=\frac{3}{6}=0.5P(B)=63​=0.5.

Now about the probability of just having an odd number (1,3,5) will be P(A)=36=0.5P(A)=\frac{3}{6}=0.5P(A)=63​=0.5

P(B∣A)=P(A∩B)P(A)P(B|A)=\frac{P(A \cap B)}{P(A)}P(B∣A)=P(A)P(A∩B)​

On this case the probability will be:

P(A∩B)=26=0.333P(A \cap B)=\frac{2}{6}=0.333P(A∩B)=62​=0.333

P(A)=0.5P(A)=0.5P(A)=0.5

P(B∣A)=P(A∩B)P(A)=0.3330.5=0.666P(B|A)=\frac{P(A \cap B)}{P(A)}=\frac{0.333}{0.5}=0.666P(B∣A)=P(A)P(A∩B)​=0.50.333​=0.666

Dependence/Independence of events

It's important to define if the events has some kind of dependence or not because it will affect the way you calculate the probabilities.

  • Independent events: One event does not affect the likelihood of the next event to occur.

  • Dependent event: One event does affect the likelihood of the next event to occur.

Dependent event example:

You have a deck of cards, you shuffle them, then you take one card, and leave it out of the deck, them you take another card again, what is the probability of both been jokers.

First a deck has 52 cards, plus 2 jokers

So the probability of having a joker for the first time will be: (There are 2 jokers available on 54 cards)

P(joker)=254=0.037P(\text{joker})=\frac{2}{54}=0.037P(joker)=542​=0.037

Now take the joker out of the deck, shuffle and take one card again... What is the probability of having another joker ...

Now you need to consider the fact that one joker is already out

P(jokeragain)=153=0.018P(\text{joker}_\text{again})=\frac{1}{53}=0.018P(jokeragain​)=531​=0.018

The complete probability will be:

P(joker and joker)dependent=P(joker).P(jokeragain)=254.153=11431P(\text{joker and joker})_\text{dependent}=P(\text{joker}).P(\text{joker}_\text{again})=\frac{2}{54}.\frac{1}{53}=\frac{1}{1431}P(joker and joker)dependent​=P(joker).P(jokeragain​)=542​.531​=14311​

Independent event example:

Now we do the same experiment but after we take the card you put it back again on the deck. On this case observe that the fact that we add the card back again make the events independent.

So the probability of having a joker for the first time will be: (There are 2 jokers available on 54 cards)

P(joker)=254=0.037P(\text{joker})=\frac{2}{54}=0.037P(joker)=542​=0.037

Now we but the card again on the deck and shuffle. What is the probability of having the joker again

P(joker)=254=0.037P(\text{joker})=\frac{2}{54}=0.037P(joker)=542​=0.037

The complete probability will be:

P(joker and joker)independent=P(joker).P(joker)=254.254=1729P(\text{joker and joker})_\text{independent}=P(\text{joker}).P(\text{joker})=\frac{2}{54}.\frac{2}{54}=\frac{1}{729}P(joker and joker)independent​=P(joker).P(joker)=542​.542​=7291​

Random Variables

Random variables are any result of a stochastic system (random process) for example how many heads will occur in a series of 20 flips. The name variable is confusing is more like the output of a stochastic system. There are 2 types of random variables:

  • Discrete: Example X=P(head)X = P(\text{head})X=P(head)

  • Continuous: Y=mass of random animal at the zooY = \text{mass of random animal at the zoo}Y=mass of random animal at the zoo

Normally all the observations from a process with uncertainties are random variables.

Example: X = Number of heads after 3 flips of a coin, calculate:

  • P(X=0)

  • P(X=1)

  • P(X=2)

  • P(X=3)

Adding all possible outcomes of a fair coin tossed 3 times

HHH

THH

HHT

THT

HTH

TTH

HTT

TTT

On this case:

  • P(X=0)=18P(X=0)=\frac{1}{8}P(X=0)=81​

  • P(X=1)=38P(X=1)=\frac{3}{8}P(X=1)=83​

  • P(X=2)=38P(X=2)=\frac{3}{8}P(X=2)=83​

  • P(X=3)=18P(X=3)=\frac{1}{8}P(X=3)=81​

So as mention before X is the output of a probabilistic process. Now let's also draw the probabilistic distribution for X

Probability Distribution

The probability distribution is a table, graph or function that describe all the probabilities for each possible outcome of a random process. Depending of the type of random variable the Probability distribution has different names:

  • Probability mass function: If variable is discrete

  • Probability density function: If variable is continuous.

All the values of a probability distribution are non-negative and sum to one.

The importance of the probability distribution is that you can easily infer information from it:

  • Mode: Gives you the most probable value, is the peak of the probability density function.

  • Mean or expected value: Is the weighted average of the possible values, using their probabilities as their weights

  • Median: The value such that the set of values less/bigger than the median has a probability of one-half.

Also the probability density function tells something about the random process: For example let X be the outcome of a fair dice.

On this case it's clear that all possible outcomes are equally probable.

Expected Value

The expected value is the average of a random variable. Or is a sum of the product between the random variable value and it's probability.

E[X]=∑i=1∞xi.piE[X]=\sum_{i=1}^{\infty} x_i.p_iE[X]=∑i=1∞​xi​.pi​

E[X]=∫−∞∞xf(x)dxE[X]=\int_{-\infty}^{\infty} xf(x) dxE[X]=∫−∞∞​xf(x)dx

Example: Let X represent the outcome of a roll of a fair six-sided, calculate the expected value (or expectation) of X

E[X]=1.16+2.16+3.16+4.16+5.16+6.16=3.5E[X]=1.\frac{1}{6}+2.\frac{1}{6}+3.\frac{1}{6}+4.\frac{1}{6}+5.\frac{1}{6}+6.\frac{1}{6}=3.5E[X]=1.61​+2.61​+3.61​+4.61​+5.61​+6.61​=3.5

Another example given the following probability distribution of a discrete random variable X

x

0

1

2

p(x)

0.16

0.48

0.36

Joint probability distribution

The joint probability distribution is the probability space formed by 2 or more random variables.

Things that you can get from Joint probability distributions:

  • Check if the variables are independent and get the marginal probability function

  • Derive the joint distribution function

  • Derive the conditional probability function, conditional expectations, and conditional variance

  • Derive the joint expectation (Expected value of the product of 2 random variables]

References:

https://www.youtube.com/watch?v=VqndHpCfUWA
https://www.youtube.com/watch?v=JGeTcRfKgBo
https://www.youtube.com/watch?v=OvTEhNL96v0
https://www.youtube.com/watch?v=Zxm4Xxvzohk
https://www.khanacademy.org/math/probability/probability-geometry
https://www.youtube.com/watch?v=H02B3aMNKzE
https://www.youtube.com/watch?v=KT726O6gDZY
https://www.youtube.com/watch?v=jos1yBC_L8E
https://www.youtube.com/watch?v=Ws63I3F7Moc
https://www.khanacademy.org/computing/computer-science/informationtheory/moderninfotheory/v/a-mathematical-theory-of-communication?utm_source=YT&utm_medium=Desc&utm_campaign=computerscience
https://www.khanacademy.org/math/statistics-probability
https://www.khanacademy.org/math/statistics-probability/random-variables-stats-library/discrete-and-continuous-random-variables/v/discrete-probability-distribution
https://www.youtube.com/watch?v=j__Kredt7vY
https://en.wikipedia.org/wiki/Expected_value
https://en.wikipedia.org/wiki/Probability_distribution
https://en.wikipedia.org/wiki/Probability_mass_function
https://en.wikipedia.org/wiki/Probability_density_function
https://www.youtube.com/watch?v=UrOXRvG9oYE
https://en.wikipedia.org/wiki/Likelihood_function
https://www.youtube.com/watch?v=DNcdvJFLgSM