Introduction to Probability and Statistics
Basic probability and statistics are introduced in this article.
1. Introduction
Frequentist vs Bayesian interpretations
- Frequentists say that probability measures the frequency of various outcomes of an experiment. The frequentist approach has long been dominant in fields like biology, medicine, public health and social sciences
- Bayesians say that probability is an abstract concept that measures a state of knowledge or a degree of belief in a given proposition. The Bayesian approach has enjoyed a resurgence in the era of powerful computers and big data
Permutations and combinations
In general, the rule of product tells us that the number of
permutations of a set of
In combinations order does not matter:
2. Terminology
Experiment: a repeatable procedure with well-defined possible outcomes
Sample space: the set of all possible outcomes. We usually denote the sample space by
, sometimes by Event: a subset of the sample space
Probability function: a function giving the probability for each outcome
Probability density: a continuous distribution of probabilities
Random variable: a random numerical outcome
A discrete sample space is one that is listable, it can be either finite or infinite
The probability function
For a discrete sample space
a probability function assigns to each outcome a number called the probability of . must satisfy two rules: (probabilities are between 0 and 1) - The sum of the probabilities of all possible outcomes is 1
The probability of an event
is the sum of the probabilities of all the outcomes in . That is
3. Conditional probability
The conditional probability of
Multiplication rule
Law of total probability
Support the sample space
Independence
Two events are independent if knowledge that one occurred does not
change the probability that the other occurred. I.e.
Bayes' Theorem
Bayes'theorem is a pillar of both probability and statistics. For two
events
4. Discrete Random Variable
Random variables
Def: Let
Def: the probability mass function (pmf) of a discrete random
variable is the function
- We always have
- We allow
to be any number. If is a value that never takes, then
Def: The cumulative distribution function (cdf) of a random variable
Properties of the cdf
is non-decreasing ,
Specific distributions
Bernoulli distributions
Model: The Bernoulli distribution models one trial in an experiment
that can result in either success or failure. A random variable
takes the values 0 and 1 , and
Binomial distributions
The binomial distribution
Geometric distributions
A geometric distribution models the number of tails before the first head in a sequence of coin flips (Bernoulli trials)
Formal definition: the random variable
takes the values - its pmf is given by
Uniform distribution
The uniform distribution models any situation where all the outcomes
are equally likely.
Expected value
Def: Suppose
Algebraic properties of
If
and are random variables on a sample space then If
and are constants then
Thus the expected value of
Proofs of the
algebraic properties of
For the property 1:
Thus the mean of a geometric distribution:
Expected values of functions of a random variable
5. Variance of Discrete Random Variables
Variance and standard deviation
Taking the mean as the center of a random variable's probability distribution, the variance is a measure of how much the probability mass is spread out around this center.
Definition: If
The standard deviation
If the relevant random variable is clear from context, then the
variance and standard deviation are often denoted by
The variance of a
Bernoulli( ) random variable
Bernoulli random variables are fundamental. If
Def: The discrete random variables
Properties of variance
The three most useful properties for computing variance are: 1. If
Variance of binomial( )
Suppose
6. Continuous Random Variables
Continuous random variables and probability density functions
A continuous random variable takes a range of values, which may be finite or infinite in extent.
Def: A random variable
The function
Unlike
Cumulative distribution function
The cumulative distribution function (cdf) of a continuous random
variable
Properties of cumulative distribution functions
. is non-decreasing. and . . .
Uniform distribution
- Parameters:
- Range:
- Notation: uniform
or - Density:
for - Distribution:
for . - Models: All outcomes in the range have equal probability.
Exponential distribution
- Parameter:
- Range:
- Notation: exponential
or - Density:
for . - Distribution:
for . - Right tail distribution:
. - Models: The waiting time for a continuous process to change state.
Memorylessness * The exponential distribution has a property that it
is memoryless. * Proof Since
Normal distribution
- Parameters:
- Range:
- Notation: normal
or - Density:
- Distribution:
has no formula - Models: Measurement error, intelligence/ability, height, averages of lots of data.
The standard normal distribution
Normal probabilities
To make approximations it is useful to remember the following rule of
thumb for three approximate probabilities:
Pareto distribution
- Parameters:
and - Range:
- Notation: Pareto
- Density:
- Distribution:
, for . - Tail distribution:
, for . - Models: The Pareto distribution models a power law, where the probability that an event occurs varies as a power of some attribute of the event. Many phenomena follow a power law, such as the size of meteors, income levels across a population, and population levels across cities.
Transformations of random variables
For continuous random variables transforming the pdf is just change
of variables ('
Expectation, variance and standard deviation for continuous random variables
These summary statistics have the same meaning for continuous random
variables * The expected value
Expected value of a continuous random variable
Def: Let
Properties of
Expectation of function
If
Variance of a continuous random variable
Def: Let
Properties of variance
- If
and are independent then - For constants
and , . - Theorem:
Quantiles
Def: The median of
Def: The
In this notation, the median is
7. Central Limit Theorem and the Law of Large Numbers
The law of large numbers
Suppose
Let
Note that
- The law of large numbers (LoLN): As
grows, the probability that is close to goes to 1. - The central limit theorem (CLT): As
grows, the distribution of converges to the normal distribution
Formal statement of the law of large numbers
Theorem: Suppose
The central Limit Theorem
Standardization
Given a random variable
Statement of the Central Limit Theorem
Suppose
The properties of mean and variance show
Central Limit Theorem: For large
- In words,
is approximately a normal distribution with the same mean as but a smaller variance. is approximately normal. - Standardized
and are approximately standard normal.
A precise statement of the CLT is that the cdf's of
Standard normal probabilities
To apply the CLT, we will want to have some normal probabilities. Let
; more precisely
How big does have to be apply the CLT?
Short answer: often, not that big.
8. Introduction to statistics
Statistics deals with data. Generally speaking, the goal of statistics is to make inferences based on data.
What is a statistic
Def: A statistic is anything that can be computed from the collected data.
In scientific experiments we start with a hypothesis and collect data
to test the hypothesis. We will often let
The left-hand term is the probability our hypothesis is true given the data we collected.
9. Maximum Likelihood Estimates
We are often faced with the situation of having random data which we known (or believes) is drawn from a parametric model, whose parameters we do not know.
Maximum Likelihood Estimates
There are many methods for estimating unknown parameters from data. We first consider the maximum likelihood estimate (MLE), which answers the question: For which parameter value does the observed data have the biggest probability?
The MLE is an example of a point estimate because it gives a single value for the unknown parameter.
Def: Given data the maximum likelihood estimate (MLE) for the
parameter
The maximum likelihood estimate (MLE) can be computed with derivation.
Log likelihood
It is often easier to work with the natural log of the likelihood
function. Since
Maximum likelihood for continuous distributions
For continuous distributions, we use the probability density function
to define the likelihood.
Properties of the MLE
The MLE behaves well under transformations. That is, if
Furthermore, the MLE is asymptotically unbiased and has
asymptomatically minimal variance. Note that the MLE is itself a random
variable since the data is random and the MLE is computed from the data.
Let
10. Bayesian Updating
Bayesian with discrete priors
Review of Bayes' theorem
If
Some terminology
- Experiment: for example, pick a coin from the drawer at random, flip it, and record the result.
- Data: the result of our experiment. In this case the event
"heads". We think of as data that provides evidence for or against each hypothesis. - Hypotheses: we are testing several hypotheses. For example the coin is fair or unfair with 0.7 probability of giving head.
- Prior probability: the probability of each hypothesis prior to tossing the coin (collecting data).
- Likelihood: (This is the same likelihood we used for the MLE.) The
likelihood function is
, i.e., the probability of the data assuming that the hypothesis is true. Most often we will consider the data as fixed and let the hypothesis vary. For example, probability of heads if the coin is fair. - Posterior probability: the probability (posterior to) of each
hypothesis given the data from tossing the coin:
. These posterior probabilities are what the problem asks us to find.
The Bayes numerator is the product of the prior and the likelihood.
The process of going from the prior probability
We can express Bayes' theorem in:
This leads to the most elegant form of Bayes' theorem in the context
of Bayesian updating:
Prior and posterior probability mass functions
Our standard notations will be: *
Probabilistic prediction
Probabilistic prediction simply means assigning a probability to each possible outcomes of an experiment.
Def: These probabilities give a (probabilistic) prediction of what will happen if the coin is tossed. Because they are computed before we collect any data they are called prior predictive probabilities.
Def: These probabilities give a (probabilistic) prediction of what will happen if the coin is tossed again. Because they are computed after collecting data and updating the prior to the posterior, they are called posterior predictive probabilities.
Prior and posterior probabilities are for hypothesis. Prior predictive and posterior predictive probabilities are for data.
Odds
When comparing two events, it common to phrase probability statements in terms of odds.
Def: The odds of event
For example,
Conversion formulas
If
Updating odds
We can update prior odds to posterior odds with Bayesian updating.
Bayes factors and strength of evidence
Def: For a hypothesis
We see that the Bayes' factor tells us whether the data provides evidence for or against the hypothesis.
Continuous Priors
Notational conventions
We will often use the letter
We have two parallel notations for outcomes and probability: 1. (Big
letters) * Event
These notations are related by
The law of total probability
Theorem: Law of total probability. Suppose we have a continuous
parameter
Bayes' theorem for continuous probability densities
Theorem: Bayes' theorem. Use the same assumptions as in the law of
total probability:
Flat priors
One important prior is called a flat or uniform prior. A flat prior assumes that every hypothesis is equally probable.
11. Beta distributions
The beta distribution
Beta priors and posteriors for binomial random variables
If the probability of heads is
hypothesis | data | prior | likelihood | posterior |
---|---|---|---|---|
Conjugate priors
That beta distribution is called a conjugate prior for the binomial distribution. This means that if the likelihood function is binomial, then a beta prior gives a beta posterior. In fact, the beta distribution is a conjugate prior for the Bernoulli and geometric distributions as well.
12. Conjugate priors
Conjugate priors are useful because they reduce Bayesian updating to modifying the parameters of the prior distribution (so-called hyperparameters) rather than computing integrals.
Def: Suppose we have data with likelihood function
Beta distribution
We saw that the beta distribution is a conjugate prior for the binomial distribution. This means that if the likelihood function is binomial and the prior distribution is beta then the posterior is also beta.
More specifically, suppose that the likelihood follows a
binomial(
Here
Beta distribution is a conjugate prior for a geometric likelihood as well.
Normal begets normal
The normal distribution is its own conjugate prior. In particular, if the likelihood function is normal with known variance, then a normal prior gives a normal posterior.
Normal-normal
update formulas for data
points
13. Choosing priors
When the prior is known there is no controversy on how to proceed. The art of statistics starts when the prior is not known with certainty. There are two main schools on how to proceed in this case: Bayesian and frequentist.
Recall that given data
- Bayesian: Bayesians make inferences using the posterior
, and therefore always need a prior . If a prior is not known with certainty the Bayesian must try to make a reasonable choice. There are many ways to do this and reasonable people might make different choices. In general it is good practice to justify your choices and to explore a range of priors to see if they all point to the same conclusion. - Benefits:
- The posterior probability
for the hypothesis given the evidence is usually exactly what we'd like to know. The Bayesian can say something like "the parameter of interest has probability 0.95 of being between 0.49 and 0.51". - The assumptions that go into choosing the prior can be clearly spelled out.
- The posterior probability
- Choose prior
- Uniform prior/Flat prior
- Informed prior: statistic with prior information.
- Rigid priors: Too rigid a prior belief can overwhelm any amount of data.
- Benefits:
- Frequentist: Very briefly, frequentists do not try to create a
prior. Instead, they make inferences using the likelihood
. - More good data: It is always the case that more good data allows for stronger conclusions and lessens the influence of the prior. The emphasis should be as much on good data (quality) as on more data (quantity).
14. Probability intervals
Suppose we have a pmf
Def: A
- In the discrete case with pmf
, this means . - In the continuous case with pdf
, this means . - We may say 90%-probability internal to mean 0.9-probability interval. Probability intervals are also called credible intervals to contrast them with confidence intervals.
Notice that the
Symmetric probability intervals
The interval
Uses of probability intervals
Summarizing and communicating your beliefs
Constructing a prior using subjective probability intervals
Probability intervals are also useful when we do not have a pmf or
pdf at hand. In this case, subjective probability intervals gives us a
method for constructing a reasonable prior for
15. The Frequentist School Statistics
Both schools of statistics start with probability. in particular both know and love Bayes' theorem. Bayes' theorem is a complete recipe for updating our beliefs in the face of new data. In practice, different people will have different priori belief. But we would still like to make useful inferences from data. Bayesians and frequentists take fundamentally different approaches to this challenge. * Bayesians require a prior, so they develop one from the best information they have. * Without a known prior frequentists draw inferences from just the likelihood function.
In short, Bayesians put probability distributions on everything (hypotheses and data), while frequentists put probability distributions on (random, repeatable, experimental) data given a hypothesis. For the frequentist, when dealing with data from an unknown distribution, only the likelihood has meaning. The prior and posterior do not.
- Point statistic: A point statistic is a single value computed from data. For example, the mean and the maximum are both point statistics.
- Interval statistic: An interval statistic is an interval computed from data. For example, the range from the minimum to maximum is an interval statistic.
- Set statistic: An interval statistic is a set computed from data. For example, the set of possible outcomes of a dice.
- Sampling distribution: The probability distribution of a statistic is called its sampling distribution.
- Point estimate: We can use statistics to make a point estimate of a
parameter
.
16. Null Hypothesis Significance Testing
Introduction
Frequentist statistics is often applied in the framework of null hypothesis significance testing (NHST). Stated simply, this method asks if the data is well outside the region where we would expect to see it under the null hypothesis. If so, then we reject the null hypothesis in favor of a second hypothesis called the alternative hypothesis.
The computations done here all involve the likelihood function. There are two main differences between what we'll do here and what we did in Bayesian updating. 1. The evidence of the data will be considered purely through the likelihood function it will not be weighted by our prior beliefs. 2. We will need a notion of extreme data, e.g. 95 out of 100 heads in a coin toss or a Mayfly that lives for a month.
Significance testing
Ingredients
: the null hypothesis. This is the default assumptioin for the model generating the data. : The alternative hypothesis. If we reject the null hypothesis we accept this alternative as the best explanation for the data. : The test statistic. We compute this from the data. - Null distribution: the probability distribution of
assuming . - Rejection region: If
is in the rejection region we reject in favor of . - Non-rejection region: The complement to the rejection region.
Simple and composite hypotheses
Def: (simple hypothesis): A simple hypothesis is one for which we can specify its distribution completely. A typical simple hypothesis is that a parameter of interest takes a specific value.
Def: (composite hypothesis): If its distribution cannot be fully specified, we say that the hypothesis is composite. A typical composite hypthesis is that a parameter of interest lies in a range of values.
Type of error
There are two types of errors we can make. We can incorrectly reject the null hypothesis when it is true or we can incorrectly fail to reject it when it is false. These are unimaginatively labeled type I and type II errors.
Significance level and power
Significance level and power are used to quantify the quality of the significance test. Ideally a significance test would not make errors.
The two probabilities we focus on:
Ideally, a hypothesis test should have a small significance level (near 0) and a large power (near 1).
Critical values
Critical values are like quantiles expect they refer to the probability to the right of the value instead of the left.
-values
In practice people often specify the significance level and do the
significance test using what are called
Def: The
-tests
Many significance tests assume that the data are drawn from a normal distribution, so before using such a test you should examine the data to see if the normality assumption is reasonable.
-test
- Data: we assume
, where is unknown and is known. - Null hypothesis:
for some specific value . - Test statistic:
standardized mean. - Null distribution:
is the pdf of . - One-sided
-value (right side): One-sided -value (left side): Two-sided -value:
The Student distribution
The
One-sample -test
For the
If's a theorem (not an assumption) that if the data is normal with
mean
Two-sample -test with equal variances
We next consider the case of comparing the means of two samples. For
example, we might be interested in comparing the mean efficacies of two
medical treatments. * Data: We assume we have two sets of data drawn
from normal distributions
17. Comparison of Frequentist and Bayesian Inference
- Bayesian inference
- uses probabilities for both hypothesis and data.
- depends on the prior and likelihood of observed data.
- requires one to know or construct a 'subjective prior'.
- dominated statistical practice before the 20th century.
- may be computationally intensive due to integration over many parameters.
- Frequentist inference (NHST)
- never uses or gives the probability of a hypothesis (no prior or posterior).
- depends on the likelihood
for both observed and unobserved data. - does not require a prior
- dominated statistical practice during the 20th century
- tends to be less computationally intensive.
18. Confidence intervals
Suppose we have a model (probability distribution) for observed data
with an unknown parameter. We have seen how NHST uses data to test the
hypothesis that the unknown parameter has a particular value.
Statisticians augment point estimates with confidence intervals. For
example, to estimate an unknown mean
Based on normal data
Interval statistics
Technically an interval statistic is nothing more than a pair of
point statistics giving the lower and upper bounds of the interval. Our
reason for emphasizing that the interval is a statistic is to highlight
the following: 1. The interval is random - new random data will produce
a new interval 2. As frequentists, we perfectly happy using it because
it doesn't depend on the value of an unknown parameter or hypothesis 3.
As usual with frequentist statistics we have to assume a certain
hypothesis, e.g. value of
confidence intervals for the mean
Throughout this section we will assume that we have normally
distributed data:
Def: Suppose the data
Manipulating intervals: pivoting
Here is a quick summary of intervals around
There is a symmetry:
-confidence intervals for the mean
This will be nearly identical to normal confidence intervals. In this
setting
Def: Suppose that
Suppose that
Chi-square confidence intervals for the variance
Def: Suppose the data