Introduction to Probability and Statistics

Posted on 2022-01-03 Edited on 2022-07-29 Word count in article: 28k Reading time ≈ 26 mins.

Basic probability and statistics are introduced in this article.

1. Introduction

Frequentist vs Bayesian interpretations

Frequentists say that probability measures the frequency of various outcomes of an experiment. The frequentist approach has long been dominant in fields like biology, medicine, public health and social sciences
Bayesians say that probability is an abstract concept that measures a state of knowledge or a degree of belief in a given proposition. The Bayesian approach has enjoyed a resurgence in the era of powerful computers and big data

Permutations and combinations

In general, the rule of product tells us that the number of permutations of a set of elements is For permutations the order matters

In combinations order does not matter:

number of permutations of distinct elements from a set of size

number of combinations of elements from a set of size

2. Terminology

Experiment: a repeatable procedure with well-defined possible outcomes
Sample space: the set of all possible outcomes. We usually denote the sample space by , sometimes by
Event: a subset of the sample space
Probability function: a function giving the probability for each outcome
Probability density: a continuous distribution of probabilities
Random variable: a random numerical outcome
A discrete sample space is one that is listable, it can be either finite or infinite
The probability function

For a discrete sample space a probability function assigns to each outcome a number called the probability of . must satisfy two rules:
- (probabilities are between 0 and 1)
- The sum of the probabilities of all possible outcomes is 1
The probability of an event is the sum of the probabilities of all the outcomes in . That is

3. Conditional probability

The conditional probability of knowing that occurred is written This is read as or Let and be events. We define the conditional probability of given as

Multiplication rule

Law of total probability

Support the sample space is divided into 3 disjoint events . Then for any event :

Independence

Two events are independent if knowledge that one occurred does not change the probability that the other occurred. I.e. Formal definition of independence: Two events and are independent if

Bayes' Theorem

Bayes'theorem is a pillar of both probability and statistics. For two events and Bayes' theorem:

4. Discrete Random Variable

Random variables

Def: Let be a sample space. A discrete random variable is a function , that takes a discrete set of values

Def: the probability mass function (pmf) of a discrete random variable is the function

We always have
We allow to be any number. If is a value that never takes, then

Def: The cumulative distribution function (cdf) of a random variable is the function given by . We often shorten this to distribution function

Properties of the cdf :

is non-decreasing
,

Specific distributions

Bernoulli distributions

Model: The Bernoulli distribution models one trial in an experiment that can result in either success or failure. A random variable has a Bernoulli distribution with parameter if:

takes the values 0 and 1
, and

Binomial distributions

The binomial distribution , or , models the number of successes in independent trials. is the same as . Its pmf is

Geometric distributions

A geometric distribution models the number of tails before the first head in a sequence of coin flips (Bernoulli trials)

Formal definition: the random variable follows a geometric distribution with parameter if

takes the values
its pmf is given by

Uniform distribution

The uniform distribution models any situation where all the outcomes are equally likely. means takes value , each with probability .

Expected value

Def: Suppose is a discrete random variable that takes values with probabilities . The expected value of is denoted and defined by For example, the expected value of is

Algebraic properties of

is linear

If and are random variables on a sample space then
If and are constants then

Thus the expected value of is

Proofs of the algebraic properties of

For the property 1: For the property 2:

Thus the mean of a geometric distribution:

Expected values of functions of a random variable

5. Variance of Discrete Random Variables

Variance and standard deviation

Taking the mean as the center of a random variable's probability distribution, the variance is a measure of how much the probability mass is spread out around this center.

Definition: If is a random variable with mean , then the variance of is defined by

The standard deviation of is defined by

If the relevant random variable is clear from context, then the variance and standard deviation are often denoted by and . If takes values with probability mass function then

The variance of a Bernoulli() random variable

Bernoulli random variables are fundamental. If , then

Def: The discrete random variables and are independent if for any values . That is, the probabilities multiply.

Properties of variance

The three most useful properties for computing variance are: 1. If and are independent then . 2. For constants and , . 3. .

Variance of binomial()

Suppose . Since is the sum of independent Bernoulli() variables and each Bernoulli variable has variance , we have

6. Continuous Random Variables

Continuous random variables and probability density functions

A continuous random variable takes a range of values, which may be finite or infinite in extent.

Def: A random variable is continuous if there is a function such that for any we have

The function is called the probability density function (pdf). The pdf always satisfies the following properties: 1. 2.

Unlike , the pdf is not a probability. You have to integrate it to get probability. Since is not a probability, there is no restriction that be less than or equal to 1.

Cumulative distribution function

The cumulative distribution function (cdf) of a continuous random variable is defined in exactly the same way as the cdf of a discrete random variable. In practice we often say ' has distribution '.

Properties of cumulative distribution functions

.
is non-decreasing.
and .
.
.

Uniform distribution

Parameters:
Range:
Notation: uniform or
Density: for
Distribution: for .
Models: All outcomes in the range have equal probability.

Exponential distribution

Parameter:
Range:
Notation: exponential or
Density: for .
Distribution: for .
Right tail distribution: .
Models: The waiting time for a continuous process to change state.

Memorylessness * The exponential distribution has a property that it is memoryless. * Proof Since we have: QED

Normal distribution

Parameters:
Range:
Notation: normal or
Density:
Distribution: has no formula
Models: Measurement error, intelligence/ability, height, averages of lots of data.

The standard normal distribution has mean 0 and variance 1.

Normal probabilities

To make approximations it is useful to remember the following rule of thumb for three approximate probabilities:

Pareto distribution

Parameters: and
Range:
Notation: Pareto
Density:
Distribution: , for .
Tail distribution: , for .
Models: The Pareto distribution models a power law, where the probability that an event occurs varies as a power of some attribute of the event. Many phenomena follow a power law, such as the size of meteors, income levels across a population, and population levels across cities.

Transformations of random variables

For continuous random variables transforming the pdf is just change of variables ('-substitution') from calculus. Transforming the cdf makes direct use of the definition of the cdf.

Expectation, variance and standard deviation for continuous random variables

These summary statistics have the same meaning for continuous random variables * The expected value is a measure of location or central tendency. * The standard deviation is a measure of the spread or scale. * The variance is the square of the standard deviation.

Expected value of a continuous random variable

Def: Let be a continuous random variable with range and probability density function . The expected value of is defined by

Properties of

Expectation of function

If is a function then is a random variable and

Variance of a continuous random variable

Def: Let be a continuous random variable with mean . The variance of is

Properties of variance

If and are independent then
For constants and , .
Theorem:

Quantiles

Def: The median of is the value for which , i.e. the value of such that .

Def: The quantile of is the value such that .

In this notation, the median is . For convenience, quantiles are often described in terms of percentiles, deciles or quartiles. The percentile is the same as the quantile.

7. Central Limit Theorem and the Law of Large Numbers

The law of large numbers

Suppose are independent random variables with the same underlying distribution.In this case, we say that the are independent and identically-distributed, or i.i.d. In particular, the all have the same mean and standard deviation .

Let

Note that is itself a random variable.

The law of large numbers (LoLN): As grows, the probability that is close to goes to 1.
The central limit theorem (CLT): As grows, the distribution of converges to the normal distribution

Formal statement of the law of large numbers

Theorem: Suppose are i.i.d. random variables with mean and variance . For each , let be the average of the first variables. Then for any , we have

The central Limit Theorem

Standardization

Given a random variable with mean and standard deviation , we define its standardization of as the new random variable Note that has mean 0 and standard deviation 1.

Statement of the Central Limit Theorem

Suppose are i.i.d. random variable each having mean and standard deviation . Let

The properties of mean and variance show Since they are multiples of each other, and have the same standardization

Central Limit Theorem: For large Notes:

In words, is approximately a normal distribution with the same mean as but a smaller variance.
is approximately normal.
Standardized and are approximately standard normal.

A precise statement of the CLT is that the cdf's of converge to :

Standard normal probabilities

To apply the CLT, we will want to have some normal probabilities. Let , a standard normal random variable. Then with rounding we have:

; more precisely

How big does have to be apply the CLT?

Short answer: often, not that big.

8. Introduction to statistics

Statistics deals with data. Generally speaking, the goal of statistics is to make inferences based on data.

What is a statistic

Def: A statistic is anything that can be computed from the collected data.

In scientific experiments we start with a hypothesis and collect data to test the hypothesis. We will often let represent the event "our hypothesis is true" and let be the collected data. In these words Bayes' theorem says:

The left-hand term is the probability our hypothesis is true given the data we collected.

9. Maximum Likelihood Estimates

We are often faced with the situation of having random data which we known (or believes) is drawn from a parametric model, whose parameters we do not know.

Maximum Likelihood Estimates

There are many methods for estimating unknown parameters from data. We first consider the maximum likelihood estimate (MLE), which answers the question: For which parameter value does the observed data have the biggest probability?

The MLE is an example of a point estimate because it gives a single value for the unknown parameter.

Def: Given data the maximum likelihood estimate (MLE) for the parameter is the value of that maximizes the likelihood . That is, the MLE is the value of for which the data is most likely.

The maximum likelihood estimate (MLE) can be computed with derivation.

Log likelihood

It is often easier to work with the natural log of the likelihood function. Since is an increasing function, the maxima of the likelihood and log likelihood coincide.

Maximum likelihood for continuous distributions

For continuous distributions, we use the probability density function to define the likelihood.

Properties of the MLE

The MLE behaves well under transformations. That is, if is the MLE for and is a one-to-one function, then is the MLE for .

Furthermore, the MLE is asymptotically unbiased and has asymptomatically minimal variance. Note that the MLE is itself a random variable since the data is random and the MLE is computed from the data. Let be an infinite sequence of samples from a distribution with parameter . Let be the MLE for based on the data . Asymptotically unbiased means that as the amount of data grows, the mean of the MLE converges to : as . Asymptotically minimal variance means that as the amount of data grows, the MLE has the minimal variance among all unbiased estimators of . In symbols: for any unbiased estimator and we have that as .

10. Bayesian Updating

Bayesian with discrete priors

Review of Bayes' theorem

If and are events, then:

Some terminology

Experiment: for example, pick a coin from the drawer at random, flip it, and record the result.
Data: the result of our experiment. In this case the event "heads". We think of as data that provides evidence for or against each hypothesis.
Hypotheses: we are testing several hypotheses. For example the coin is fair or unfair with 0.7 probability of giving head.
Prior probability: the probability of each hypothesis prior to tossing the coin (collecting data).
Likelihood: (This is the same likelihood we used for the MLE.) The likelihood function is , i.e., the probability of the data assuming that the hypothesis is true. Most often we will consider the data as fixed and let the hypothesis vary. For example, probability of heads if the coin is fair.
Posterior probability: the probability (posterior to) of each hypothesis given the data from tossing the coin: . These posterior probabilities are what the problem asks us to find.

The Bayes numerator is the product of the prior and the likelihood. The process of going from the prior probability to the posterior is called Bayesian updating. Bayesian updating uses the data to alter our understanding of the probability of each of the possible hypotheses.

We can express Bayes' theorem in:

This leads to the most elegant form of Bayes' theorem in the context of Bayesian updating:

Prior and posterior probability mass functions

Our standard notations will be: * is the value of the hypothesis. * is the prior probability mass function of the hypothesis. * is the posterior probability mass function of the hypothesis given the data. * is the likelihood function. (This is not a pmf.)

Probabilistic prediction

Probabilistic prediction simply means assigning a probability to each possible outcomes of an experiment.

Def: These probabilities give a (probabilistic) prediction of what will happen if the coin is tossed. Because they are computed before we collect any data they are called prior predictive probabilities.

Def: These probabilities give a (probabilistic) prediction of what will happen if the coin is tossed again. Because they are computed after collecting data and updating the prior to the posterior, they are called posterior predictive probabilities.

Prior and posterior probabilities are for hypothesis. Prior predictive and posterior predictive probabilities are for data.

Odds

When comparing two events, it common to phrase probability statements in terms of odds.

Def: The odds of event versus event are the ratio of their probabilities . If unspecified, the second event is assumed to be the complement . So the odds of are:

For example, means "the odds of rain are 2 to 1$.

Conversion formulas

If then . If then .

Updating odds

We can update prior odds to posterior odds with Bayesian updating.

Bayes factors and strength of evidence

Def: For a hypothesis and data , the Bayes factor is the ratio of the likelihoods:

We see that the Bayes' factor tells us whether the data provides evidence for or against the hypothesis.

Continuous Priors

Notational conventions

We will often use the letter to stand for an arbitrary hypothesis. This will leave symbols like , , and to take there usual meanings as pmf, pdf, and data.

We have two parallel notations for outcomes and probability: 1. (Big letters) * Event , probability function . * From hypothesis and data we compute several associated probabilities: . 2. (Little letters) * Value , pmf or pdf . * Hypothesis values and data values both have probabilities or probability densities:

These notations are related by , where is a value the discrete random variable and is the corresponding event.

The law of total probability

Theorem: Law of total probability. Suppose we have a continuous parameter in the range , and discrete random data . Assume is itself random with density and that and have likelihood . In this case, the total probability of is given by the formula: We call the prior predictive probability for .

Bayes' theorem for continuous probability densities

Theorem: Bayes' theorem. Use the same assumptions as in the law of total probability:

Flat priors

One important prior is called a flat or uniform prior. A flat prior assumes that every hypothesis is equally probable.

11. Beta distributions

The beta distribution is a two-parameter distribution with range and pdf In the context of Bayesian updating, and are often called hyperparameters to distinguish them from the unknown parameter representing our hypotheses. In a sense, and are "one level up" from since they parameterize its pdf.

Beta priors and posteriors for binomial random variables

If the probability of heads is , the number of heads in tosses follows a binomial() distribution. We have seen that if the prior on is beta distribution then so is the posterior; only the parameters of the beta distribution change. We assume the data is heads in tosses:

hypothesis	data	prior	likelihood	posterior

Conjugate priors

That beta distribution is called a conjugate prior for the binomial distribution. This means that if the likelihood function is binomial, then a beta prior gives a beta posterior. In fact, the beta distribution is a conjugate prior for the Bernoulli and geometric distributions as well.

12. Conjugate priors

Conjugate priors are useful because they reduce Bayesian updating to modifying the parameters of the prior distribution (so-called hyperparameters) rather than computing integrals.

Def: Suppose we have data with likelihood function depending on a hypothesized parameter. Also suppose the prior distribution for is one of a family of parametrized distributions. If the posterior distribution for is in this family then we say the prior is a conjugate prior for the likelihood.

Beta distribution

We saw that the beta distribution is a conjugate prior for the binomial distribution. This means that if the likelihood function is binomial and the prior distribution is beta then the posterior is also beta.

More specifically, suppose that the likelihood follows a binomial() distribution where is known and is the (unknown) parameter of interest. We also have that the data from one trial is an integer between and . Then for a beta prior we have: | hypothesis | data | prior | likelihood | posterior | |---|---|---|---|---| | | | | | | | | | | |

Here

Beta distribution is a conjugate prior for a geometric likelihood as well.

Normal begets normal

The normal distribution is its own conjugate prior. In particular, if the likelihood function is normal with known variance, then a normal prior gives a normal posterior.

Normal-normal update formulas for data points

The easier to read form, note that is a weighted average of and the sample average : If the number of data points is large then the weight is large and will have a strong influence on the posterior. If is small then the weight is large and will have a strong influence on the posterior.

13. Choosing priors

When the prior is known there is no controversy on how to proceed. The art of statistics starts when the prior is not known with certainty. There are two main schools on how to proceed in this case: Bayesian and frequentist.

Recall that given data and a hypothesis we used Bayes' theorem to write

Bayesian: Bayesians make inferences using the posterior , and therefore always need a prior . If a prior is not known with certainty the Bayesian must try to make a reasonable choice. There are many ways to do this and reasonable people might make different choices. In general it is good practice to justify your choices and to explore a range of priors to see if they all point to the same conclusion.
- Benefits:
  1. The posterior probability for the hypothesis given the evidence is usually exactly what we'd like to know. The Bayesian can say something like "the parameter of interest has probability 0.95 of being between 0.49 and 0.51".
  2. The assumptions that go into choosing the prior can be clearly spelled out.
- Choose prior
  - Uniform prior/Flat prior
  - Informed prior: statistic with prior information.
  - Rigid priors: Too rigid a prior belief can overwhelm any amount of data.
Frequentist: Very briefly, frequentists do not try to create a prior. Instead, they make inferences using the likelihood .
More good data: It is always the case that more good data allows for stronger conclusions and lessens the influence of the prior. The emphasis should be as much on good data (quality) as on more data (quantity).

14. Probability intervals

Suppose we have a pmf or pdf describing our belief about the value of an unknown parameter of interest .

Def: A -probability interval for is an interval with .

In the discrete case with pmf , this means .
In the continuous case with pdf , this means .
We may say 90%-probability internal to mean 0.9-probability interval. Probability intervals are also called credible intervals to contrast them with confidence intervals.

Notice that the -probability interval for is not unique.

-notation. We can phrase probability intervals in terms of quantiles. Recall that the -quantile for is the value with . So for , the amount of probability between the -quantile and the -quantile is just . In these terms, a -probability interval is any interval with .

Symmetric probability intervals

The interval is symmetric because the amount of probability remaining on either side of the interval is the same, namely 0.25.

Uses of probability intervals

Summarizing and communicating your beliefs

Constructing a prior using subjective probability intervals

Probability intervals are also useful when we do not have a pmf or pdf at hand. In this case, subjective probability intervals gives us a method for constructing a reasonable prior for "from scratch". The thought process is to ask yourself a series of questions, e.g. "what is my expected value for ?"; "my 0.5-probability interval?"; "my 0.9-probability interval?".

15. The Frequentist School Statistics

Both schools of statistics start with probability. in particular both know and love Bayes' theorem. Bayes' theorem is a complete recipe for updating our beliefs in the face of new data. In practice, different people will have different priori belief. But we would still like to make useful inferences from data. Bayesians and frequentists take fundamentally different approaches to this challenge. * Bayesians require a prior, so they develop one from the best information they have. * Without a known prior frequentists draw inferences from just the likelihood function.

In short, Bayesians put probability distributions on everything (hypotheses and data), while frequentists put probability distributions on (random, repeatable, experimental) data given a hypothesis. For the frequentist, when dealing with data from an unknown distribution, only the likelihood has meaning. The prior and posterior do not.

Point statistic: A point statistic is a single value computed from data. For example, the mean and the maximum are both point statistics.
Interval statistic: An interval statistic is an interval computed from data. For example, the range from the minimum to maximum is an interval statistic.
Set statistic: An interval statistic is a set computed from data. For example, the set of possible outcomes of a dice.
Sampling distribution: The probability distribution of a statistic is called its sampling distribution.
Point estimate: We can use statistics to make a point estimate of a parameter .

16. Null Hypothesis Significance Testing

Introduction

Frequentist statistics is often applied in the framework of null hypothesis significance testing (NHST). Stated simply, this method asks if the data is well outside the region where we would expect to see it under the null hypothesis. If so, then we reject the null hypothesis in favor of a second hypothesis called the alternative hypothesis.

The computations done here all involve the likelihood function. There are two main differences between what we'll do here and what we did in Bayesian updating. 1. The evidence of the data will be considered purely through the likelihood function it will not be weighted by our prior beliefs. 2. We will need a notion of extreme data, e.g. 95 out of 100 heads in a coin toss or a Mayfly that lives for a month.

Significance testing

Ingredients

: the null hypothesis. This is the default assumptioin for the model generating the data.
: The alternative hypothesis. If we reject the null hypothesis we accept this alternative as the best explanation for the data.
: The test statistic. We compute this from the data.
Null distribution: the probability distribution of assuming .
Rejection region: If is in the rejection region we reject in favor of .
Non-rejection region: The complement to the rejection region.

Simple and composite hypotheses

Def: (simple hypothesis): A simple hypothesis is one for which we can specify its distribution completely. A typical simple hypothesis is that a parameter of interest takes a specific value.

Def: (composite hypothesis): If its distribution cannot be fully specified, we say that the hypothesis is composite. A typical composite hypthesis is that a parameter of interest lies in a range of values.

Type of error

There are two types of errors we can make. We can incorrectly reject the null hypothesis when it is true or we can incorrectly fail to reject it when it is false. These are unimaginatively labeled type I and type II errors.

Significance level and power

Significance level and power are used to quantify the quality of the significance test. Ideally a significance test would not make errors.

The two probabilities we focus on:

Ideally, a hypothesis test should have a small significance level (near 0) and a large power (near 1).

Critical values

Critical values are like quantiles expect they refer to the probability to the right of the value instead of the left.

-values

In practice people often specify the significance level and do the significance test using what are called -values. We will first define -value and see that: If the -value is less than the significance level then qw reject . Otherwise we do not reject .

Def: The -value is the probability, assuming the null hypothesis, of seeing data at least as extreme as the experimental data. What "at least as extreme" means depends on the experimental design.

-tests

Many significance tests assume that the data are drawn from a normal distribution, so before using such a test you should examine the data to see if the normality assumption is reasonable.

-test

Data: we assume , where is unknown and is known.
Null hypothesis: for some specific value .
Test statistic: standardized mean.
Null distribution: is the pdf of .
One-sided -value (right side): One-sided -value (left side): Two-sided -value:

The Student distribution

The -distribution is symmetric and bell-shaped like the normal distribution. It has a parameter which stands for degrees of freedom. For small, the -distribution has more probability in its tails than the standard normal distribution. As increases becomes more and more like the standard normal distribution.

One-sample -test

For the -test, we assumed that the variance of the underlying distribution of the data was known. However, it is often the case that we don't know and therefore we must estimate it from the data. In these cases, we use a one sample -test instead of a -test and the studentized mean in place of the standardized mean * Data: we assume , where both and are unknown. * Null hypothesis: for some specific value . * Test statistic: where here is called the Studentized mean and is called the sample variance. The latter is an estimate of the true variance . * Null distribution: is the pdf of , the distribution with degrees of freedom. * One sided -value (right side): One-sided -value (left side): Two-sided -value:

If's a theorem (not an assumption) that if the data is normal with mean then the Studentized mean follows a -distribution.

Two-sample -test with equal variances

We next consider the case of comparing the means of two samples. For example, we might be interested in comparing the mean efficacies of two medical treatments. * Data: We assume we have two sets of data drawn from normal distributions where the means and and the variance are all unknown. Notice the assumption that the two distributions have the same variance. Also notice that there are samples in the first group and samples in the second. * Null hypothesis: . (The values of and are not specified) * Test statistic: where is the pooled variance here and are the sample variances of the and respectively. The expression for is somewhat complicated, but the basic idea remains the same and it still results in a known null distribution. * Null distribution: is the pdf of * One sided -value (right side): One-sided -value (left side): Two-sided -value: 1

17. Comparison of Frequentist and Bayesian Inference

Bayesian inference
- uses probabilities for both hypothesis and data.
- depends on the prior and likelihood of observed data.
- requires one to know or construct a 'subjective prior'.
- dominated statistical practice before the 20th century.
- may be computationally intensive due to integration over many parameters.
Frequentist inference (NHST)
- never uses or gives the probability of a hypothesis (no prior or posterior).
- depends on the likelihood for both observed and unobserved data.
- does not require a prior
- dominated statistical practice during the 20th century
- tends to be less computationally intensive.

18. Confidence intervals

Suppose we have a model (probability distribution) for observed data with an unknown parameter. We have seen how NHST uses data to test the hypothesis that the unknown parameter has a particular value. Statisticians augment point estimates with confidence intervals. For example, to estimate an unknown mean we might be able to say that our best estimate of the mean is with confidence interval. You should think of the confidence level of an interval as analogous to the significance level of a NHST.

Based on normal data

Interval statistics

Technically an interval statistic is nothing more than a pair of point statistics giving the lower and upper bounds of the interval. Our reason for emphasizing that the interval is a statistic is to highlight the following: 1. The interval is random - new random data will produce a new interval 2. As frequentists, we perfectly happy using it because it doesn't depend on the value of an unknown parameter or hypothesis 3. As usual with frequentist statistics we have to assume a certain hypothesis, e.g. value of , before we can compute probabilities about the interval. 4. Be careful in your thinking about these probabilities. Confidence intervals are a frequentist notion. Since frequentists do not compute probabilities of hypothesis, the confidence level is never a probability that the unknown parameter is in the confidence interval

confidence intervals for the mean

Throughout this section we will assume that we have normally distributed data:

Def: Suppose the data , with unknown mean and known variance . The confidence interval for is where is the right critical value .

Manipulating intervals: pivoting

Here is a quick summary of intervals around and and what is called pivoting. Pivoting is the idea the is in says exactly the same thing as is in .

There is a symmetry: is in the interval is equivalent to is in the interval .

-confidence intervals for the mean

This will be nearly identical to normal confidence intervals. In this setting is not known, so we have to make the following replacements. 1. Use instead of . Here is the sample variance we used before in -tests. 2. use -critical values instead of -critical values.

Def: Suppose that , where the values of the mean and the standard deviation are both unknown. The confidence interval for is here is the right critical value for and is the sample variance of the data.

Suppose that data points are drawn from where and are unknown. We'll derive the confidence interval following the same pattern as for the confidence interval. Under the null hypothesis , we have . So the studentized mean follows a Student distribution with degrees of freedom: Let be the critical value: , where . We know from running one-sample -tests that the non-rejection region is given by Using the definition of the -statistic to write the rejection region in terms of we get: at significance level we don't reject if Geometrically, the right hand side says that we don't reject if This is exactly equivalent to saying that we don't reject if

Chi-square confidence intervals for the variance

Def: Suppose the data is drawn from with mean and standard deviation both unknown. The confidence interval for the variance is Here is the right critical value for and is the sample variance of the data.