**Prob and Stats**

# Comprehensive Guide on Sample Covariance

*schedule*Aug 12, 2023

*toc*Table of Contents

*expand_more*

**mathematics behind data science**with 100+ top-tier guides

Start your free 7-days trial now!

# Sample covariance

If $\boldsymbol{x}=(x_1,x_2,\cdots,x_n)$ and $\boldsymbol{y}=(y_1,y_2,\cdots,y_n)$ are a pair of samples, then the sample covariance $s_{xy}$ between $\boldsymbol{x}$ and $\boldsymbol{y}$ is computed as:

Where:

$n$ is the sample size.

$\bar{x}$ is the sample mean of $\boldsymbol{x}$.

$\bar{y}$ is the sample mean of $\boldsymbol{y}$.

# Intuition behind sample covariance

Consider the following 11 data points:

Here, each data point corresponds to an observation $(x_i,y_i)$ in a sample. Let's draw the sample mean of $\boldsymbol{x}$ and the sample mean of $\boldsymbol{y}$ below:

The sample covariance formula is as follows:

Basically, the sample covariance involves taking the average of the products $(x_i-\bar{x})(y-\bar{y})$ for each point. We can visualize $(x_i-\bar{x})$ and $(y_i-\bar{y})$ like so:

Here, we've focused only on the 1st and 3rd quadrants:

for the points in the 1st quadrant (top-right), we can see that both $(x_i-\bar{x})$ and $(y_i-\bar{y})$ are positive, and so their product $(x_i-\bar{x})(y_i-\bar{y})$ will be positive.

for the points in the 3rd quadrant, we see that both $(x_i-\bar{x})$ and $(y_i-\bar{y})$ are negative, which means that their product will also be positive.

Let's now focus on the points in the 2nd and 4th quadrant:

Note the following:

for the points in the 2nd quadrant (top-left), we see that $(x_i-\bar{x})$ is negative while $(y_i-\bar{y})$ is positive. This means that their product is negative.

for the points in the 4th quadrant (bottom-right), we see that $(x_i-\bar{x})$ is positive while $(y_i-\bar{y})$ is negative, which means that their product is also negative.

To summarize, the sign of $(x_i-\bar{x})(y_i-\bar{y})$ will depend on where the point is located:

All the points in the green region contribute to making the sample covariance more positive, while all points in the red region contribute to making the sample covariance more negative.

So far, we've only looked at the sign of $(x_i-\bar{x})(y_i-\bar{y})$, so let's focus now on its magnitude. The product can be interpreted as the area of the rectangle where $(x_i-\bar{x})$ is the width and $(y_i-\bar{y})$ is the height. Below is an example:

Here, we've only drawn 4 rectangles because drawing all the rectangles will make the diagram cluttered. We can see that the rectangle formed by the particular point in the first quadrant is large, which means that this point greatly contributes to making the sample covariance positive.

As we can imagine, the sample covariance would be positive in this case because not only are there more points in the green region, but also the magnitude of the points in the green region is generally bigger. If the sample covariance is positive, we say that there is a positive association between $x$ and $y$, which means that as $x$ increases, $y$ tends to increase as well. Again, this should be intuitive from the diagram below:

We end up with a positive association when there are more points in the green region that are far away from the mean origin $(\bar{x},\bar{y})$. Whenever there is a positive association, the line of best fit through the data points will have a positive slope:

If the line with a positive slope fits the data points well, then we say that $x$ and $y$ have a positive linear relationship. In contrast, a negative association may look like follows:

Here, there are more points in the red region that are generally far away from the mean origin $(\bar{x},\bar{y})$, and thus the covariance is negative. Whenever there is negative association, the line of best fit will have a negative slope.

Zero association may look like follows:

Here, there are roughly the same number of points in the green and red regions with comparable magnitudes, so the covariance is approximately zero. In these cases, the line of best will be approximately a horizontal line.

In some cases, our data points may not look linear at all:

The covariance in these cases is typically near zero as well.

# Computing sample covariance by hand

Consider the following dataset:

$x$ | $y$ |
---|---|

1 | 1 |

2 | 3 |

4 | 6 |

5 | 7 |

Compute the sample covariance of $\boldsymbol{x}$ and $\boldsymbol{y}$.

Solution. We have $n=4$ pairs of samples. Let's start by computing the sample means $\bar{x}$ and $\bar{y}$ which are required when computing the covariance:

The sample covariance is computed as:

This means that $x$ and $y$ are positively associated as confirmed by the graph below:

We can see that as $x$ increases, $y$ increases as well - this is what a positive association is!

# Why sample covariance is not usually computed

The sample covariance tells us the association between two variables. However, the value that the sample covariance takes is not bounded and is heavily affected by the scale of the sample. For instance, consider some data points about people's weight and height. Let's draw two plots with different units:

Small covariance | High covariance |
---|---|

The pattern of the data points is identical regardless of whether we choose kilograms/grams and meters/centimeters. However, notice how the covariance is much larger on the right because the scale is bigger. This is quite misleading because we would think that a higher covariance is caused by a stronger positive association but this is clearly not the case - the culprit here is the scale of the variables. It is therefore meaningless to compare covariance between different multiple pairs of variables because their scale might be different.

We typically normalize the covariance such that the measure is no longer dependent on the scale of the variables. This normalized version of the covariance is called the correlation, and unlike the covariance that is unbounded, the correlation is bounded between $-1$ and $1$. Please consult our comprehensive guide on correlation for the details!

# Why we divide by n-1 instead of n

Just like sample variance, we divide by $n-1$ instead of $n$ for the sample covariance. The reasoning is the same - dividing by $n-1$ leads to an unbiased estimator for the population covariance.

## Unbiased estimator for population covariance

The sample covariance $S_{XY}$ is an unbiased estimator for the population covariance $\sigma_{XY}=\text{cov}(X,Y)$, that is:

Proof. We begin with the definition of sample covariance:

Now, taking the expected value of both sides:

Now, recall the computational form of covariance:

This can be rewritten as:

Let's apply \eqref{eq:dXuLJecqNKkBZtHXzzH} to the green term in \eqref{eq:LiGvwVUlwjBZZGOSvvu}:

Let's now apply \eqref{eq:dXuLJecqNKkBZtHXzzH} to the red term in \eqref{eq:LiGvwVUlwjBZZGOSvvu}:

Now, using a property of covariance, we can take the summation sign outside:

Notice that $\text{cov}(X_i,Y_j)=0$ when $i\ne{j}$ - for instance $\text{cov}(X_1,Y_2)$ should be zero because $X_1$ and $Y_2$ are independent whereas $X_1$ and $Y_1$ are dependent. Therefore, we have that:

Substituting the green \eqref{eq:F1XIzvILfeqoGkzXYzI} and red \eqref{eq:bJvCPML5bpLjKdNEnT1} components back into \eqref{eq:LiGvwVUlwjBZZGOSvvu} gives:

This completes the proof that the sample covariance is an unbiased estimator for the population covariance of $X$ and $Y$.

# Computing sample covariance using Python

We can easily compute the sample covariance using Python's `numpy`

library. Suppose we have the same dataset as earlier:

The sample covariance can be computed like so:

```
import numpy as npx = [1,2,4,5]y = [1,3,6,7]cov_matrix = np.cov(x,y) # uses unbiased estimator (divide by n-1 instead of n)cov_matrix
array([[3.33333333, 5. ], [5. , 7.58333333]])
```

Here, NumPy's `cov(~)`

method returns a covariance matrix, which is a symmetric matrix whose diagonals are the sample variance of $\boldsymbol{x}$ and $\boldsymbol{y}$, and the non-diagonal entries are the covariance. To extract the covariance value, use NumPy's `[~]`

syntax:

```
cov_matrix[0][1]
5.0
```

This is exactly what we got earlier when we computed the sample covariance by hand!