search
Search
Login
Math ML Join our weekly DS/ML newsletter
menu
menu search toc more_vert
Robocat
Guest 0reps
Thanks for the thanks!
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
help Ask a question
Share on Twitter
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to
A
A
brightness_medium
share
arrow_backShare
Twitter
Facebook
check_circle
Mark as learned
thumb_up
0
thumb_down
0
chat_bubble_outline
0
auto_stories new
settings

Comprehensive Guide on Sample Variance

Probability and Statistics
chevron_right
Basic estimators
schedule Nov 5, 2022
Last updated
local_offer
Tags
Definition.

Sample variance

The sample variance of a sample $(x_1,x_2,\cdots,x_n)$ is computed by:

$$s^2=\frac{1}{n-1}\sum^n_{i=1}(x_i-\bar{x})^2$$

Where $n$ is the sample size and $\bar{x}$ is the sample mean. For the intuition behind this formula, please consult our guide on measures of spread.

Notice how we compute the average by dividing by $n-1$ instead of $n$. This is because dividing by $n-1$ makes the sample variance an unbiased estimator for the population variance - we give the prooflink below, but please consult our guide to understand what bias means.

Example.

Computing the sample variance of a sample

Compute the sample variance of the following sample:

$$(1,3,5,7)$$

Solution. Here, the size of the sample is $n=4$. We first start by computing the sample mean:

$$\begin{align*} \bar{x}&=\frac{1}{4}\sum^4_{i=1}x_i\\ &=\frac{1}{4}(1+3+5+7)\\ &=4 \end{align*}$$

Let's now compute the sample variance $s^2$ using the formula:

$$\begin{align*} s^2 &=\frac{1}{n-1}\sum^n_{i=1}(x_i-\bar{x})^2\\ &=\frac{1}{3}\sum^4_{i=1}(x_i-4)^2\\ &=\frac{1}{3}[(1-4)^2+(3-4)^2+(5-4)^2+(7-4)^2]\\ &=\frac{20}{3}\\ &\approx6.67\\ \end{align*}$$

This means that, on average, the square of the difference between each point and the sample mean is around $6.67$. This interpretation is precise but quite awkward. Therefore, instead of quoting the sample variance of a single sample, we often compare the sample variance of two different samples to understand which sample is more spread out.

Intuition behind why we divide by n-1 instead of n

Although we will formally provelink that dividing by $n-1$ will give us an unbiased estimator of the population variance, let's understand from another perspective why we should divide by $n-1$.

Ideally, our estimate of the population variance would be:

$$\begin{equation}\label{eq:ohGzVCDYbDArl9d4nZX} s^2=\frac{1}{n}\sum^n_{i=1}(x_i-\mu)^2 \end{equation}$$

Where $\mu$ is the population mean. In fact, if the population mean is known, then the sample variance should be computed as above without dividing by $n-1$. However, in most cases, the population mean is unknown, so the best we can do is to replace $\mu$ with the sample mean $\bar{x}$ like so:

$$\begin{equation}\label{eq:NrJOSeZL5qE9DxVIics} s^2=\frac{1}{n}\sum^n_{i=1}(x_i-\bar{x})^2 \end{equation}$$

However, when we replace $\mu$ with $\bar{x}$, it turns out that we would, on average, underestimate the population variance. We will now mathematically prove this.

Let's focus on the sum of squared differences. Instead of the sample mean $\bar{x}$, let's replace that with a variable $t$ and consider the expression as a function of $t$ like so:

$$f(t)=\sum^n_{i=1}(x_i-t)^2$$

Using calculus, our goal is to show that that $t=\bar{x}$ minimizes this function. Let's take the first derivative of $f(t)$ with respect to $t$ like so:

$$\begin{align*} f'(t)&=\frac{d}{dt}\sum_{i=1}^n(x_i-t)^2\\ &=\sum_{i=1}^n\frac{d}{dt}(x_i-t)^2\\ &=-2\sum_{i=1}^n(x_i-t) \end{align*}$$

Setting this equal to zero gives:

$$\begin{align*} -2\sum_{i=1}^n(x_i-t)&=0\\ \sum_{i=1}^n(x_i-t)&=0\\ \sum_{i=1}^nx_i-\sum_{i=1}^nt&=0\\ \Big(\sum_{i=1}^nx_i\Big)-nt&=0\\ t&=\frac{1}{n}\sum_{i=1}^nx_i\\ t&=\bar{x}\\ \end{align*}$$

Let's also check the nature of this stationary point by referring to the second derivative:

$$\begin{align*} f''(t)&=\frac{d}{dt}f'(t)\\ &=\frac{d}{dt}\Big(-2\sum_{i=1}^n(x_i-t)\Big)\\ &=-2\Big(\sum_{i=1}^n\frac{d}{dt}(x_i-t)\Big)\\ &=-2\Big(\sum_{i=1}^n-1\Big)\\ &=2n \\ \end{align*}$$

Since the sample size $n$ is positive, we have that the second derivative is always positive. This means that the stationary point $t=\bar{x}$ is indeed a minimum! In other words, out of all the values $t$ can take, setting $t=\bar{x}$ will minimize the sum of squared differences:

$$\sum^n_{i=1}(x_i-\bar{x})^2 \le \sum^n_{i=1}(x_i-t)^2$$

The population mean $\mu$ is some unknown constant, but we now know that:

$$\begin{equation}\label{eq:kUfz4YNwhBVtS8B1ZF0} \sum^n_{i=1}(x_i-\bar{x})^2 \le \sum^n_{i=1}(x_i-\mu)^2 \end{equation}$$

Even though we don't know what $\mu$ is, we know that the sum of squared differences when $t=\mu$ must be at least as large as the sum of squared differences when $t=\bar{x}$.

Let's divide both sides of \eqref{eq:kUfz4YNwhBVtS8B1ZF0} by $n$ to get:

$$\begin{equation}\label{eq:Vd8ISUnkMkIvhi6wExH} \frac{1}{n}\sum^n_{i=1}(x_i-\bar{x})^2 \le \frac{1}{n}\sum^n_{i=1}(x_i-\mu)^2 \end{equation}$$

The right-hand side is our ideal estimate \eqref{eq:ohGzVCDYbDArl9d4nZX} from earlier. To make this clear, let's write \eqref{eq:Vd8ISUnkMkIvhi6wExH} as:

$$\begin{equation}\label{eq:mfxzwx5FHb6tVM1v3Zl} \frac{1}{n}\sum^n_{i=1}(x_i-\bar{x})^2 \le \text{ideal} \end{equation}$$

This means that estimate of the population variance using the left-hand side of \eqref{eq:mfxzwx5FHb6tVM1v3Zl} will be generally less than the ideal estimate. In order to compensate this underestimation, we must make the left-hand side larger. One way of doing so is by dividing by a smaller amount, say $n-1$:

$$\frac{1}{n-1}\sum^n_{i=1}(x_i-\bar{x})^2$$

Of course, this leads to more questions such as why we should divide specifically by $n-1$ instead of say $n-2$ or $n-3$, which all have the effect of making the left-hand side \eqref{eq:mfxzwx5FHb6tVM1v3Zl} larger. The motivation behind this exercise is merely to understand that dividing by some number less than $n$ accounts for the underestimation. As for why we specifically divide by $n-1$, we prove mathematically below that dividing by $n-1$ adjusts our estimate exactly such that we no longer neither underestimate nor overestimate.

Properties of sample variance

Theorem.

Unbiased estimator of the population variance

The sample variance $S^2$ is an unbiased estimator for the population variance $\sigma^2$, that is:

$$\mathbb{E}(S^2)=\sigma^2$$

Proof. We start off with the following algebraic manipulation:

$$\begin{align*} \sum^n_{i=1}(X_i-\bar{X})^2 &=\sum^n_{i=1}(X_i^2-2X_i\bar{X}+\bar{X}^2)\\ &=\Big(\sum^n_{i=1}X_i^2\Big)-2\bar{X}\Big(\sum^n_{i=1}X_i\Big)+\Big(\sum^n_{i=1}\bar{X}^2\Big)\\ &=\Big(\sum^n_{i=1}X_i^2\Big)-2\bar{X}\sum^n_{i=1}\left(n\cdot\frac{X_i}{n}\right)+n\bar{X}^2\\ &=\Big(\sum^n_{i=1}X_i^2\Big)-2n\bar{X}\cdot\Big(\frac{1}{n}\sum^n_{i=1}X_i\Big)+n\bar{X}^2\\ &=\Big(\sum^n_{i=1}X_i^2\Big)-2n\bar{X}^2+n\bar{X}^2\\ &=-n\bar{X}^2+\sum^n_{i=1}X_i^2\\ \end{align*}$$

Multiplying both sides by $1/(n-1)$ gives:

$$\frac{1}{n-1} \sum^n_{i=1}(X_i-\bar{X})^2= \frac{1}{n-1}\Big(-n\bar{X}^2+\sum^n_{i=1}X_i^2\Big)$$

The left-hand side is the formula for the sample variance $S^2$ so:

$$S^2= \frac{1}{n-1}\Big(-n\bar{X}^2+\sum^n_{i=1}X_i^2\Big)$$

Now, let's take the expected value of both sides and use the property of linearity of expected values to simplify:

$$\begin{equation}\label{eq:MGFWQ0zdxObMW1zhXiV} \begin{aligned}[b] \mathbb{E}(S^2)&= \mathbb{E} \Big[\frac{1}{n-1}\Big(-n\bar{X}^2+\sum^n_{i=1}X_i^2\Big)\Big]\\ &= \frac{1}{n-1}\mathbb{E} \Big(-n\bar{X}^2+\sum^n_{i=1}X_i^2\Big)\\ &=\frac{1}{n-1}\Big[\mathbb{E} \Big(-n\bar{X}^2\Big)+\mathbb{E}\Big(\sum^n_{i=1}X_i^2\Big)\Big]\\ &= \frac{1}{n-1}\Big[-n\cdot\mathbb{E} \Big(\bar{X}^2\Big)+\sum^n_{i=1}\mathbb{E}(X_i^2)\Big]\\ \end{aligned} \end{equation}$$

Now, from the property of variance, we know that:

$$\begin{equation}\label{eq:ZQMklBf4CcDEfOxcVdJ} \mathbb{E}(\bar{X}^2)= \mathbb{V}(\bar{X})+[\mathbb{E}(\bar{X})]^2 \end{equation}$$

We have derived TODO the variance as well as the expected value of $\bar{X}$ to be:

$$\begin{align*} \mathbb{V}(\bar{X})&=\frac{\sigma^2}{n}\\ \mathbb{E}(\bar{X})&=\mu \end{align*}$$

Substituting these values into \eqref{eq:ZQMklBf4CcDEfOxcVdJ} gives:

$$\begin{equation}\label{eq:LlQymAMmsVKtv6MIqTc} \mathbb{E}(\bar{X}^2)=\frac{\sigma^2}{n}+\mu^2 \end{equation}$$

Once again, from the same property of variance, we have that:

$$\begin{equation}\label{eq:OPC1YMGbDIHlCRGd6IJ} \begin{aligned}[b] \mathbb{E}(X_i^2)&=\mathbb{V}(X_i)+[\mathbb{E}(X_i)]^2\\ &=\sigma^2+\mu^2 \end{aligned} \end{equation}$$

Substituting \eqref{eq:LlQymAMmsVKtv6MIqTc} and \eqref{eq:OPC1YMGbDIHlCRGd6IJ} into \eqref{eq:MGFWQ0zdxObMW1zhXiV} gives:

$$\begin{align*} \mathbb{E}(S^2)&= \frac{1}{n-1}\Big[-n\cdot \Big(\frac{\sigma^2}{n}+\mu^2\Big) +\sum^n_{i=1}(\sigma^2+\mu^2)\Big]\\ &=\frac{1}{n-1}\Big[-\sigma^2-n\mu^2 +n(\sigma^2+\mu^2)\Big]\\ &=\frac{1}{n-1}\Big(-\sigma^2-n\mu^2 +n\sigma^2+n\mu^2\Big)\\ &=\frac{1}{n-1}\Big(n\sigma^2-\sigma^2\Big)\\ &=\frac{1}{n-1}\Big[\sigma^2(n-1)\Big]\\ &=\sigma^2\\ \end{align*}$$

This proves that the sample variance $S^2$ is an unbiased estimator for the population variance $\sigma^2$.

Computing sample variance using Python

We can easily compute the sample variance using Python's NumPy library. By default, the var(~) method returns the following biased sample variance:

$$s^2=\frac{1}{n}\sum^n_{i=1}(x_i-\bar{x})^2$$

To compute the unbiased sample variance instead, supply the argument ddof=1:

import numpy as np
np.var([3,5,1,7], ddof=1)
6.666666666666667

Note that ddof stands for degree of freedom and represents the following quantity:

$$\frac{1}{n\color{green}{-\mathrm{ddof}}}\sum_{i=0}^{n}\left(x_i-\bar{x}^2\right) $$
mail
Join our newsletter for updates on new DS/ML comprehensive guides (spam-free)
robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Ask a question or leave a feedback...