search
Search
Unlock 100+ guides
search toc
close
account_circle
Profile
exit_to_app
Sign out
What does this mean?
Why is this true?
Give me some examples!
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
Doc Search
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Shrink
Navigate to
casino
Prob and Stats
38 guides
keyboard_arrow_down
check_circle
Mark as learned
thumb_up
0
thumb_down
0
chat_bubble_outline
0
Comment
auto_stories Bi-column layout
settings

# Comprehensive Guide on Sample Variance

schedule Aug 12, 2023
Last updated
local_offer
Probability and Statistics
Tags
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
Definition.

# Sample variance

The sample variance of a sample $(x_1,x_2,\cdots,x_n)$ is computed by:

$$s^2=\frac{1}{n-1}\sum^n_{i=1}(x_i-\bar{x})^2$$

Where $n$ is the sample size and $\bar{x}$ is the sample mean. For the intuition behind this formula, please consult our guide on measures of spread.

Notice how we compute the average by dividing by $n-1$ instead of $n$. This is because dividing by $n-1$ makes the sample variance an unbiased estimator for the population variance - we give the prooflink below, but please consult our guide to understand what bias means.

Example.

## Computing the sample variance of a sample

Compute the sample variance of the following sample:

$$(1,3,5,7)$$

Solution. Here, the size of the sample is $n=4$. We first start by computing the sample mean:

\begin{align*} \bar{x}&=\frac{1}{4}\sum^4_{i=1}x_i\\ &=\frac{1}{4}(1+3+5+7)\\ &=4 \end{align*}

Let's now compute the sample variance $s^2$ using the formula:

\begin{align*} s^2 &=\frac{1}{n-1}\sum^n_{i=1}(x_i-\bar{x})^2\\ &=\frac{1}{3}\sum^4_{i=1}(x_i-4)^2\\ &=\frac{1}{3}[(1-4)^2+(3-4)^2+(5-4)^2+(7-4)^2]\\ &=\frac{20}{3}\\ &\approx6.67\\ \end{align*}

This means that, on average, the square of the difference between each point and the sample mean is around $6.67$. This interpretation is precise but quite awkward. Therefore, instead of quoting the sample variance of a single sample, we often compare the sample variance of two different samples to understand which sample is more spread out.

## Intuition behind why we divide by n-1 instead of n

Although we will formally provelink that dividing by $n-1$ will give us an unbiased estimator of the population variance, let's understand from another perspective why we should divide by $n-1$.

Ideally, our estimate of the population variance would be:

$$$$\label{eq:ohGzVCDYbDArl9d4nZX} s^2=\frac{1}{n}\sum^n_{i=1}(x_i-\mu)^2$$$$

Where $\mu$ is the population mean. In fact, if the population mean is known, then the sample variance should be computed as above without dividing by $n-1$. However, in most cases, the population mean is unknown, so the best we can do is to replace $\mu$ with the sample mean $\bar{x}$ like so:

$$$$\label{eq:NrJOSeZL5qE9DxVIics} s^2=\frac{1}{n}\sum^n_{i=1}(x_i-\bar{x})^2$$$$

However, when we replace $\mu$ with $\bar{x}$, it turns out that we would, on average, underestimate the population variance. We will now mathematically prove this.

Let's focus on the sum of squared differences. Instead of the sample mean $\bar{x}$, let's replace that with a variable $t$ and consider the expression as a function of $t$ like so:

$$f(t)=\sum^n_{i=1}(x_i-t)^2$$

Using calculus, our goal is to show that that $t=\bar{x}$ minimizes this function. Let's take the first derivative of $f(t)$ with respect to $t$ like so:

\begin{align*} f'(t)&=\frac{d}{dt}\sum_{i=1}^n(x_i-t)^2\\ &=\sum_{i=1}^n\frac{d}{dt}(x_i-t)^2\\ &=-2\sum_{i=1}^n(x_i-t) \end{align*}

Setting this equal to zero gives:

\begin{align*} -2\sum_{i=1}^n(x_i-t)&=0\\ \sum_{i=1}^n(x_i-t)&=0\\ \sum_{i=1}^nx_i-\sum_{i=1}^nt&=0\\ \Big(\sum_{i=1}^nx_i\Big)-nt&=0\\ t&=\frac{1}{n}\sum_{i=1}^nx_i\\ t&=\bar{x}\\ \end{align*}

Let's also check the nature of this stationary point by referring to the second derivative:

\begin{align*} f''(t)&=\frac{d}{dt}f'(t)\\ &=\frac{d}{dt}\Big(-2\sum_{i=1}^n(x_i-t)\Big)\\ &=-2\Big(\sum_{i=1}^n\frac{d}{dt}(x_i-t)\Big)\\ &=-2\Big(\sum_{i=1}^n-1\Big)\\ &=2n \\ \end{align*}

Since the sample size $n$ is positive, we have that the second derivative is always positive. This means that the stationary point $t=\bar{x}$ is indeed a minimum! In other words, out of all the values $t$ can take, setting $t=\bar{x}$ will minimize the sum of squared differences:

$$\sum^n_{i=1}(x_i-\bar{x})^2 \le \sum^n_{i=1}(x_i-t)^2$$

The population mean $\mu$ is some unknown constant, but we now know that:

$$$$\label{eq:kUfz4YNwhBVtS8B1ZF0} \sum^n_{i=1}(x_i-\bar{x})^2 \le \sum^n_{i=1}(x_i-\mu)^2$$$$

Even though we don't know what $\mu$ is, we know that the sum of squared differences when $t=\mu$ must be at least as large as the sum of squared differences when $t=\bar{x}$.

Let's divide both sides of \eqref{eq:kUfz4YNwhBVtS8B1ZF0} by $n$ to get:

$$$$\label{eq:Vd8ISUnkMkIvhi6wExH} \frac{1}{n}\sum^n_{i=1}(x_i-\bar{x})^2 \le \frac{1}{n}\sum^n_{i=1}(x_i-\mu)^2$$$$

The right-hand side is our ideal estimate \eqref{eq:ohGzVCDYbDArl9d4nZX} from earlier. To make this clear, let's write \eqref{eq:Vd8ISUnkMkIvhi6wExH} as:

$$$$\label{eq:mfxzwx5FHb6tVM1v3Zl} \frac{1}{n}\sum^n_{i=1}(x_i-\bar{x})^2 \le \text{ideal}$$$$

This means that estimate of the population variance using the left-hand side of \eqref{eq:mfxzwx5FHb6tVM1v3Zl} will be generally less than the ideal estimate. In order to compensate this underestimation, we must make the left-hand side larger. One way of doing so is by dividing by a smaller amount, say $n-1$:

$$\frac{1}{n-1}\sum^n_{i=1}(x_i-\bar{x})^2$$

Of course, this leads to more questions such as why we should divide specifically by $n-1$ instead of say $n-2$ or $n-3$, which all have the effect of making the left-hand side \eqref{eq:mfxzwx5FHb6tVM1v3Zl} larger. The motivation behind this exercise is merely to understand that dividing by some number less than $n$ accounts for the underestimation. As for why we specifically divide by $n-1$, we prove mathematically below that dividing by $n-1$ adjusts our estimate exactly such that we no longer neither underestimate nor overestimate.

# Properties of sample variance

Theorem.

## Unbiased estimator of the population variance

The sample variance $S^2$ is an unbiased estimator for the population variance $\sigma^2$, that is:

$$\mathbb{E}(S^2)=\sigma^2$$

Proof. We start off with the following algebraic manipulation:

\begin{align*} \sum^n_{i=1}(X_i-\bar{X})^2 &=\sum^n_{i=1}(X_i^2-2X_i\bar{X}+\bar{X}^2)\\ &=\Big(\sum^n_{i=1}X_i^2\Big)-2\bar{X}\Big(\sum^n_{i=1}X_i\Big)+\Big(\sum^n_{i=1}\bar{X}^2\Big)\\ &=\Big(\sum^n_{i=1}X_i^2\Big)-2\bar{X}\sum^n_{i=1}\left(n\cdot\frac{X_i}{n}\right)+n\bar{X}^2\\ &=\Big(\sum^n_{i=1}X_i^2\Big)-2n\bar{X}\cdot\Big(\frac{1}{n}\sum^n_{i=1}X_i\Big)+n\bar{X}^2\\ &=\Big(\sum^n_{i=1}X_i^2\Big)-2n\bar{X}^2+n\bar{X}^2\\ &=-n\bar{X}^2+\sum^n_{i=1}X_i^2\\ \end{align*}

Multiplying both sides by $1/(n-1)$ gives:

$$\frac{1}{n-1} \sum^n_{i=1}(X_i-\bar{X})^2= \frac{1}{n-1}\Big(-n\bar{X}^2+\sum^n_{i=1}X_i^2\Big)$$

The left-hand side is the formula for the sample variance $S^2$ so:

$$S^2= \frac{1}{n-1}\Big(-n\bar{X}^2+\sum^n_{i=1}X_i^2\Big)$$

Now, let's take the expected value of both sides and use the property of linearity of expected values to simplify:

\label{eq:MGFWQ0zdxObMW1zhXiV} \begin{aligned}[b] \mathbb{E}(S^2)&= \mathbb{E} \Big[\frac{1}{n-1}\Big(-n\bar{X}^2+\sum^n_{i=1}X_i^2\Big)\Big]\\ &= \frac{1}{n-1}\mathbb{E} \Big(-n\bar{X}^2+\sum^n_{i=1}X_i^2\Big)\\ &=\frac{1}{n-1}\Big[\mathbb{E} \Big(-n\bar{X}^2\Big)+\mathbb{E}\Big(\sum^n_{i=1}X_i^2\Big)\Big]\\ &= \frac{1}{n-1}\Big[-n\cdot\mathbb{E} \Big(\bar{X}^2\Big)+\sum^n_{i=1}\mathbb{E}(X_i^2)\Big]\\ \end{aligned}

Now, from the property of variance, we know that:

$$$$\label{eq:ZQMklBf4CcDEfOxcVdJ} \mathbb{E}(\bar{X}^2)= \mathbb{V}(\bar{X})+[\mathbb{E}(\bar{X})]^2$$$$

We have derived TODO the variance as well as the expected value of $\bar{X}$ to be:

\begin{align*} \mathbb{V}(\bar{X})&=\frac{\sigma^2}{n}\\ \mathbb{E}(\bar{X})&=\mu \end{align*}

Substituting these values into \eqref{eq:ZQMklBf4CcDEfOxcVdJ} gives:

$$$$\label{eq:LlQymAMmsVKtv6MIqTc} \mathbb{E}(\bar{X}^2)=\frac{\sigma^2}{n}+\mu^2$$$$

Once again, from the same property of variance, we have that:

\label{eq:OPC1YMGbDIHlCRGd6IJ} \begin{aligned}[b] \mathbb{E}(X_i^2)&=\mathbb{V}(X_i)+[\mathbb{E}(X_i)]^2\\ &=\sigma^2+\mu^2 \end{aligned}

Substituting \eqref{eq:LlQymAMmsVKtv6MIqTc} and \eqref{eq:OPC1YMGbDIHlCRGd6IJ} into \eqref{eq:MGFWQ0zdxObMW1zhXiV} gives:

\begin{align*} \mathbb{E}(S^2)&= \frac{1}{n-1}\Big[-n\cdot \Big(\frac{\sigma^2}{n}+\mu^2\Big) +\sum^n_{i=1}(\sigma^2+\mu^2)\Big]\\ &=\frac{1}{n-1}\Big[-\sigma^2-n\mu^2 +n(\sigma^2+\mu^2)\Big]\\ &=\frac{1}{n-1}\Big(-\sigma^2-n\mu^2 +n\sigma^2+n\mu^2\Big)\\ &=\frac{1}{n-1}\Big(n\sigma^2-\sigma^2\Big)\\ &=\frac{1}{n-1}\Big[\sigma^2(n-1)\Big]\\ &=\sigma^2\\ \end{align*}

This proves that the sample variance $S^2$ is an unbiased estimator for the population variance $\sigma^2$.

# Computing sample variance using Python

We can easily compute the sample variance using Python's NumPy library. By default, the var(~) method returns the following biased sample variance:

$$s^2=\frac{1}{n}\sum^n_{i=1}(x_i-\bar{x})^2$$

To compute the unbiased sample variance instead, supply the argument ddof=1:

import numpy as np
np.var([3,5,1,7], ddof=1)
6.666666666666667

Note that ddof stands for degree of freedom and represents the following quantity:

$$\frac{1}{n\color{green}{-\mathrm{ddof}}}\sum_{i=0}^{n}\left(x_i-\bar{x}^2\right)$$
Edited by 0 others
thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...
thumb_up
0
thumb_down
0
chat_bubble_outline
0
settings
Enjoy our search
Hit / to insta-search docs and recipes!