**Prob and Stats**

# Comprehensive Guide on Sample Variance

*schedule*May 21, 2023

*toc*Table of Contents

*expand_more*

**mathematics behind data science**with 100+ top-tier guides

Start your free 7-days trial now!

# Sample variance

The sample variance of a sample $(x_1,x_2,\cdots,x_n)$ is computed by:

Where $n$ is the sample size and $\bar{x}$ is the sample mean. For the intuition behind this formula, please consult our guide on measures of spread.

Notice how we compute the average by dividing by $n-1$ instead of $n$. This is because dividing by $n-1$ makes the sample variance an unbiased estimator for the population variance - we give the prooflink below, but please consult our guide to understand what bias means.

## Computing the sample variance of a sample

Compute the sample variance of the following sample:

Solution. Here, the size of the sample is $n=4$. We first start by computing the sample mean:

Let's now compute the sample variance $s^2$ using the formula:

This means that, on average, the square of the difference between each point and the sample mean is around $6.67$. This interpretation is precise but quite awkward. Therefore, instead of quoting the sample variance of a single sample, we often compare the sample variance of two different samples to understand which sample is more spread out.

## Intuition behind why we divide by n-1 instead of n

Although we will formally provelink that dividing by $n-1$ will give us an unbiased estimator of the population variance, let's understand from another perspective why we should divide by $n-1$.

Ideally, our estimate of the population variance would be:

Where $\mu$ is the population mean. In fact, if the population mean is known, then the sample variance should be computed as above without dividing by $n-1$. However, in most cases, the population mean is unknown, so the best we can do is to replace $\mu$ with the sample mean $\bar{x}$ like so:

However, when we replace $\mu$ with $\bar{x}$, it turns out that we would, on average, underestimate the population variance. We will now mathematically prove this.

Let's focus on the sum of squared differences. Instead of the sample mean $\bar{x}$, let's replace that with a variable $t$ and consider the expression as a function of $t$ like so:

Using calculus, our goal is to show that that $t=\bar{x}$ minimizes this function. Let's take the first derivative of $f(t)$ with respect to $t$ like so:

Setting this equal to zero gives:

Let's also check the nature of this stationary point by referring to the second derivative:

Since the sample size $n$ is positive, we have that the second derivative is always positive. This means that the stationary point $t=\bar{x}$ is indeed a minimum! In other words, out of all the values $t$ can take, setting $t=\bar{x}$ will minimize the sum of squared differences:

The population mean $\mu$ is some unknown constant, but we now know that:

Even though we don't know what $\mu$ is, we know that the sum of squared differences when $t=\mu$ must be at least as large as the sum of squared differences when $t=\bar{x}$.

Let's divide both sides of \eqref{eq:kUfz4YNwhBVtS8B1ZF0} by $n$ to get:

The right-hand side is our ideal estimate \eqref{eq:ohGzVCDYbDArl9d4nZX} from earlier. To make this clear, let's write \eqref{eq:Vd8ISUnkMkIvhi6wExH} as:

This means that estimate of the population variance using the left-hand side of \eqref{eq:mfxzwx5FHb6tVM1v3Zl} will be generally less than the ideal estimate. In order to compensate this underestimation, we must make the left-hand side larger. One way of doing so is by dividing by a smaller amount, say $n-1$:

Of course, this leads to more questions such as why we should divide specifically by $n-1$ instead of say $n-2$ or $n-3$, which all have the effect of making the left-hand side \eqref{eq:mfxzwx5FHb6tVM1v3Zl} larger. The motivation behind this exercise is merely to understand that dividing by some number less than $n$ accounts for the underestimation. As for why we specifically divide by $n-1$, we prove mathematically below that dividing by $n-1$ adjusts our estimate exactly such that we no longer neither underestimate nor overestimate.

# Properties of sample variance

## Unbiased estimator of the population variance

The sample variance $S^2$ is an unbiased estimator for the population variance $\sigma^2$, that is:

Proof. We start off with the following algebraic manipulation:

Multiplying both sides by $1/(n-1)$ gives:

The left-hand side is the formula for the sample variance $S^2$ so:

Now, let's take the expected value of both sides and use the property of linearity of expected values to simplify:

Now, from the property of variance, we know that:

We have derived TODO the variance as well as the expected value of $\bar{X}$ to be:

Substituting these values into \eqref{eq:ZQMklBf4CcDEfOxcVdJ} gives:

Once again, from the same property of variance, we have that:

Substituting \eqref{eq:LlQymAMmsVKtv6MIqTc} and \eqref{eq:OPC1YMGbDIHlCRGd6IJ} into \eqref{eq:MGFWQ0zdxObMW1zhXiV} gives:

This proves that the sample variance $S^2$ is an unbiased estimator for the population variance $\sigma^2$.

# Computing sample variance using Python

We can easily compute the sample variance using Python's `NumPy`

library. By default, the `var(~)`

method returns the following biased sample variance:

To compute the unbiased sample variance instead, supply the argument `ddof=1`

:

```
import numpy as np
6.666666666666667
```

Note that `ddof`

stands for degree of freedom and represents the following quantity: