**Prob and Stats**

# Comprehensive Guide on Hypergeometric Distribution

*schedule*May 20, 2023

*toc*Table of Contents

*expand_more*

**mathematics behind data science**with 100+ top-tier guides

Start your free 7-days trial now!

Let's go through a motivating example to derive the hypergeometric distribution!

# Motivating example

Suppose we draw $3$ balls from a bag containing $5$ red balls and $4$ green balls **without replacement**. What is the probability of drawing $2$ green balls?

Since we are drawing without replacement, the probability of drawing a green ball changes after each draw. Because the draws are not independent, we cannot use the binomial distribution here.

We can instead approach this problem using the idea of combinations. The probability of drawing $2$ green balls is:

Let's focus on the numerator first. Since we are drawing $3$ balls, drawing $2$ green balls means that the other ball we draw must be red. The number of ways of drawing $2$ green balls from $4$ green balls is:

The reason we use combinations here is to avoid double counting. For instance, if we label the $4$ green balls as $\{G_1, G_2, G_3,G_4\}$, then the pairs of green balls $\{G_1,G_2\}$ and $\{G_2,G_1\}$ should be treated as the same pair.

The number of ways of drawing $2$ red balls from $5$ red balls is:

Therefore, the number of outcomes with $2$ green balls and $1$ red ball is the product:

Now, let's focus on the denominator of \eqref{eq:zZfkFhwdzVg3yttG45N}. The total possible outcomes of drawing $3$ balls from $9$ balls are:

Here, we use combinations again instead of permutations because we don't want to double-count, that is, $\color{green}\mathrm{G} \color{green}\mathrm{G} \color{red}\mathrm{R}$ is the same as $\color{green}\mathrm{G} \color{red}\mathrm{R} \color{green}\mathrm{G}$.

Substituting \eqref{eq:DFD8IJMAzjaHX3FiZOB} and \eqref{eq:BXb3bOoPAURpRj7tx4y} into \eqref{eq:zZfkFhwdzVg3yttG45N} gives:

Let's now generalize the fraction containing the three combinations in \eqref{eq:nJVLT02keJzQSeO8TGH}. Recall that \eqref{eq:nJVLT02keJzQSeO8TGH} answers the following specific question:

we draw $3$ balls from a bag containing $5$ red balls and $4$ green balls

**without replacement**. What is the probability of drawing $2$ green balls?

We only care whether a ball is green or not, so the color red is irrelevant. Therefore, without losing any information, we can rephrase this question as:

we draw without replacement $3$ balls from a bag containing $9$ balls, of which $4$ are green. What is the probability of drawing $2$ green balls?

Again, the answer to this question is still \eqref{eq:nJVLT02keJzQSeO8TGH}. Now, let's replace all these numbers with variables:

we draw without replacement $n$ balls from a bag containing $N$ balls, of which $m$ are green. What is the probability of drawing $x$ green balls?

The answer to this question is:

What we have just derived is the probability mass function of the so-called hypergeometric random variable! Compared to other special discrete probability distributions, the hypergeometric distribution has the most parameters $(N,m,n)$. Let's now take a moment to understand the relationship between the parameters and the input $x$, specifically what values they are allowed to take.

Instead of discussing in abstract terms, let's go back to our example once more:

we draw without replacement $n$ balls from a bag containing $N$ balls, of which $m$ are green. What is the probability of drawing $x$ green balls?

Clearly, the possible values that $x$ can take are $0,1,2...,n$ because the sample size is $n$. Since there are only $m$ green balls in the bag, $x$ can never be greater than $m$, that is, $x\le{m}$. Moreover, the number of non-green balls in the bag $N-m$ must not be greater than the number of non-green balls in our sample $n-x$, that is, $N-m\le{n-x}$.

# Hypergeometric distribution

A random variable $X$ is said to have a hypergeometric probability distribution with parameters $(N,m,n)$ if and only if $X$ has the following probability mass function:

Where:

$x$ is an integer $0, 1, 2, \cdots, n$.

$x\le{m}$ and $n-x\ge{N-m}$.

If $X$ follows a hypergeometric distribution with parameters $(N,m,n)$, then we can write $X\sim\text{HypG eom}(N,m,n)$.

## Finding the number of males in a group

Suppose we wanted to form a group by selecting $3$ out of $10$ people of which $4$ are male and $6$ are female. What is the probability that we have $2$ males in the formed group?

Solution. The key here is to notice that we are sampling without replacement, that is, we cannot select the same person twice. If we let random variable $X$ represent the number of males in our group, then $X$ follows a hypergeometric distribution with parameters $N=10$, $m=4$ and $n=3$. This means that the probability mass function of $X$ is:

The probability that we have $x=2$ males in the formed group is:

The hypergeometric probability mass function \eqref{eq:vwTQCr5dqSWjIfMi1w1} looks as follows:

Indeed, we can see that the probability of $X=2$ is $0.3$.

# Properties of hypergeometric distribution

## Expected value of hypergeometric random variable

If $X$ is a hypergeometric random variable with parameters $(N,m,n)$, then the expected value or the mean of $X$ is:

Proof. Let's first derive a useful identity that we will use later:

Let's now derive $\mathbb{E}(X)$. By definitionlink of expected value, we have that:

Now, notice that the term in the summation is the probability mass function of the hypergeometric distribution with parameters $N-1$, $m-1$ and $n-1$. By definition of probability mass functions, the sum of probabilities across all possible values of $X$ should equal one, which means:

This completes the proof.

## Variance of hypergeometric random variable

If $X$ is a hypergeometric random variable with parameters $(N,m,n)$, then the variance of $X$ is:

Proof. To derive the variance, let's first prove another useful identity. Suppose we draw $n$ balls from a bag containing $N$ balls, of which $m$ are red. Let random variable $X$ denote the number of red balls drawn. The identity we want to prove is:

Firstly, the number of pairs of red balls drawn is:

The expected number of red ball pairs drawn is therefore:

Another equivalent expression for the expected number of red ball pairs is:

This might look complicated, so let's slightly rephrase it by ignoring the word "pairs" like so:

The fraction on the right represents the proportion of the red balls in the bag. If we drew $n$ balls, then we should expect the number of red balls drawn to be $n$ times this proportion. For instance, suppose there were $100$ balls in the bag, of which $50$ are red. If we draw $10$ balls, we should expect $10\times(50/100)=5$ balls to be red.

Now, instead of single balls, we focus on pairs of balls. The same idea must hold for pairs of balls - this is why \eqref{eq:OUlUMJOScf9fPm3VZTe} is true. Our goal now is to find the three terms in \eqref{eq:JtFedLwopARRCNG0T1x}.

The number of pairs in $n$ drawn balls is:

The proportion of red balls can be calculated as:

Substituting \eqref{eq:eEMnoTJTToZDRHgUuur} and \eqref{eq:wo8vkktEEZFTtJ4yjEt} into \eqref{eq:OUlUMJOScf9fPm3VZTe} gives:

Finally, equating \eqref{eq:PegLhsaiqreqf23VjUs} and \eqref{eq:DIb4h2uqwL5HX030Mvh} gives us the desired identity:

Let's simplify \eqref{eq:uvkUlb1aYOANyavpT8L} to derive $\mathbb{E}[X(X-1)]$:

Now, recalllink that the variance can be computed as:

This can also be written as:

Substituting $\mathbb{E}\big[X(X-1)\big]$ from \eqref{eq:OAR7VgWFEPk3EsPK0JC} and $\mathbb{E}(X)$ into \eqref{eq:FRleloliiG0A42mkRb8} gives:

This completes the proof.

## The binomial distribution as a special case of the hypergeometric distribution

The hypergeometric distribution with parameters $(N,m,n)$ converges to a binomial distribution with parameters $(n,m/N)$ when $N$ tends to infinity.

Intuition. Recall that the key difference between the hypergeometric distribution and the binomial distribution is that the hypergeometric distribution is concerned with sampling without replacement whereas the binomial distribution deals with sampling with replacement. In other words, the trials of the hypergeometric distribution are not independent whereas the trials of the binomial distribution are.

To understand why the binomial distribution provides a good approximation to the hypergeometric distribution when parameters $N$ and $m$ are both large, consider the following question:

Suppose we draw without replacement $2$ balls from a bag containing $1000$ balls, of which $300$ are green. What is the probability of drawing $2$ green balls?

The probability of initially drawing a green ball is:

Suppose we draw a green ball first. The probability of drawing a green ball again is:

What if we had drawn a non-green ball first? The probability of drawing a green ball next is:

Notice how the probabilities are approximately the same, just as if the trials are performed with replacement! This only happens when $N$ is large and the sample size $n$ is not too large.

When $N$ is small, say $N=10$ and $m=5$, then the above probabilities would be:

Notice how the probabilities vary in this case, which means that there is a clear difference between sampling with and without replacement.

Moreover, the sample size $n$ should ideally be quite small for the approximation to work well. For example, suppose we have a large number of balls in the bag, say $N=1000$. If the sample size is large, say $n=999$, then the denominator of the probabilities will get smaller and smaller as we keep sampling, thereby amplifying the effects of sampling without replacement.

As a rule of thumb, if the following condition is met, then the binomial approximation performs well:

The binomial approximation can sometimes be useful because the probability mass function of the binomial distribution is much easier to compute compared to that of the hypergeometric distribution.

Proof. The proof involves some tedious algebraic manipulation, so we will color-code some terms to clarify how they are being manipulated. The hypergeometric probability mass function $p_H(x)$ is:

Let's focus on the fraction with the red numerator:

Let's check how many terms we are taking the product of on the numerator and the denominator. This is easy to do because each term should decrease by one, so counting the number of terms is as easy as subtracting the rightmost term (smallest) from the leftmost term (largest) and adding one. The number of terms on the numerator is:

The number of terms on the denominator is:

Therefore, the number of terms on the numerator and the denominator is the same! This means that we can express the fraction like so:

As $N$ tends to infinity, all the fractions that we subtract will be zero:

If we define $p=m/N$, then $p$ can be interpreted as a probability because $N$ is always equal to or greater than $m$. For instance, if we had a bag containing $N$ balls of which $m$ are green balls, then $p=m/N$ represents the probability of randomly drawing a green ball. Since $p$ is to be treated as a fixed constant (e.g. $p=0.3$) and $m=pN$, we know that $m$ must also tend to infinity as $N$ tends to infinity.

Now going back to our equation \eqref{eq:NVrU0WmaY15TOL2T3dQ}, as $N$ and $m$ both tend to infinity, all the remaining fractions will tend to $m/N$, that is:

Therefore, \eqref{eq:NVrU0WmaY15TOL2T3dQ} becomes:

Let's now perform the same steps on the fraction with the blue numerator in \eqref{eq:T4YBAJzVJMBuTcipJKa} shown below:

Once again, let's check how many terms we are taking the product of on the numerator and the denominator. The number of terms in the numerator is:

The number of terms in the denominator is:

Since the number of terms on the numerator and the denominator is the same, we can express the fraction as:

Since $n$ and $x$ are finite, all the fractions that we are subtracting become $0$ as $N$ tends to infinity:

Note that this is true only when $n$ does not tend to infinity. This is why $n$, which represents the sample size, should be small!

On the other hand, as $N$ and $m$ both tend to infinity, all the remaining fractions become:

Therefore, we have that:

Substituting \eqref{eq:YC73YRSjCI63yUA92pn} and \eqref{eq:nk3iF6FRxPkkVJnPb0Z} into \eqref{eq:T4YBAJzVJMBuTcipJKa} gives:

The right-hand side is the binomial probability mass function! Therefore, the binomial distribution is the limit of the hypergeometric distribution when $N$ tends to infinity. This completes the proof.

# Working with hypergeometric distribution using Python

## Computing probabilities

Recall the examplelink from earlier:

Suppose we wanted to form a group by selecting $3$ out of $10$ people of which $4$ are male and $6$ are female. What is the probability that we have $2$ males in the formed group?

If we let random variable $X$ denote the number of males in the formed group, then we have that:

The probability mass function of $X$ is therefore:

The probability that we get $2$ males in the formed group is:

Instead of manually calculating the probability, we can use Python's SciPy library instead:

```
from scipy.stats import hypergeomx = 2N = 10m = 4n = 3hypergeom.pmf(x,N,m,n)
0.30000000000000016
```

## Plotting probability mass function

To draw the hypergeometric probability mass function, use the `hypergeom.pmf(~)`

function on a list of integers instead:

```
import matplotlib.pyplot as pltfrom scipy.stats import hypergeom
```

N = 10m = 4n = 3xs = list(range(n+1)) # [0,1,2,3]pmfs = hypergeom.pmf(xs,N,m,n)plt.bar(xs, pmfs)plt.xlabel('$x$')plt.ylabel('$p(x)$')plt.show()

This generates the following diagram: