Comprehensive Guide on Box Plot Diagrams
Start your free 7-days trial now!
What is a boxplot diagram?
A boxplot diagram, or box-whisker diagram, is a popular way to visualize the spread of our dataset using quartiles. Recall that quartiles are 3 values that split an ordered sequence of values into four parts of equal size. For instance, the quartiles for the following values are:
Here, note the following:
the sequence of values is ordered.
25% of the data are lower than the 1st quartile (Q1).
50% of the data are lower than the 2nd quartile (Q2), which is the median.
75% of the data are lower than the 3rd quartile (Q3).
The boxplot diagram visualizes these quartiles like so:
Note the following:
apart from the quartiles, we also include information about the minimum and maximum value.
we draw a line, or a whisker, from the minimum to the 1st quartile and from the 3rd quartile to the maximum.
we draw a box from the 1st quartile to the 3rd quartile and include a vertical line to indicate the median. The box represents the interquartile range, which holds the middle 50% of the data.
Note that boxplot diagrams are typically laid out vertically instead of horizontally as in the case here.
Interpreting boxplot diagrams
Consider the following set of values:
Symmetric distribution | Skewed distribution |
---|---|
Here, the red vertical lines from left to right represent the 1st, 2nd and 3rd quartiles:
Let's examine the difference between the shapes of these boxplot diagrams:
for the skewed distribution, the 1st quartile as well as the median appears on the far left and are closer to the minimum. In comparison, the 3rd quartile and the maximum are far apart. This means that the majority of the values are concentrated at the lower end of the distribution, and the latter values are much more spread out. This agrees with the histogram of the skewed distribution.
for the symmetric distribution, the boxplot diagram is also symmetric. In particular, from our histogram, we can see that our symmetric distribution is a normal distribution, and so the bulk of the values are concentrated at the center. This is why the median appears right in the middle. For values further away from the center, their distribution is more spread out - this is why we see long whiskers.
Boxplot for outlier detection
Consider the following set of values:
Here, the minimum and maximum are outliers because they are far away from the bulk of the values. To identify these outliers, we can come up with a lower and upper threshold in which values smaller than the lower threshold and values larger than the upper threshold will be considered outliers. One way of computing these thresholds is by using quartiles and interquartile range:
Here:
$\text{Q}_1$ and $\text{Q}_3$ are the lower and upper quartiles, respectively.
$\text{IQR}$ is the interquartile range, which is computed by $\text{Q}_3-\text{Q}_1$.
For our example, $\text{Q}_1$, $\text{Q}_3$ and $\text{IQR}$ are as follows:
Using our formula \eqref{eq:zV0JeVWuNA7MLyXxPaK}, let's compute the lower $L$ and upper $U$ thresholds:
This means that any value lower than $L=3.5$ or larger than $U=19.5$ will be considered as an outlier. Let's draw in these thresholds in our diagram:
Here, we can see that the minimum and maximum are to the left and right of $L$ and $U$ respectively, which means that they are outliers!
Finally, let's draw the boxplot diagram:
Notice how the original minimum and maximum are now outliers and are replaced with white circles. We can consider them to be excluded from our values, which means that we have a new minimum and maximum.
We have said that the formula to compute outlier thresholds is as follows:
We typically use the factor $1.5$, but this number is arbitrary - we can select any number we wish. For instance, if we want to be more strict with what we identify as outliers, then we should select a larger factor such as $2$.
Overlaying values on boxplot diagrams
Traditionally, boxplot diagrams only consist of whiskers and a box. In practise, we often overlay the values on top of boxplot diagrams for additional insights. For instance, consider the following two set of values and their corresponding boxplot diagrams:
Here, we have included the raw values. We can see that even though the boxplot diagrams are identical, the number of values used to construct them are different. If you imagine the set of values as a sample from some population (e.g. size of family), then the top boxplot diagram is more trustworthy than the one below because the sample size is larger.
On top of this, overlaying values allow us to see the distribution of the points within each quartile. For instance, for the first boxplot diagram, we can see that the values are concentrated on the left, which is an insight that we gain only by overlaying the values!
Drawing boxplot diagrams using Python
Basic boxplot diagram
Drawing boxplot diagrams is a piece of cake using matplotlib
, which is Python's main graphing library. Let's use the same set of values as before:
import matplotlib.pyplot as pltxs = [1,3,3,5,5,6,6,7,8,9] # data does not have to be sortedplt.boxplot(xs) # or add labels=[''] to get rid of '1'plt.xlabel('$x$')plt.show()
This produces the following boxplot diagram:
This is basically the same boxplot diagram we had before, except that the orientation is vertical instead of horizontal.
Boxplot diagrams with outliers
By default, outliers as computed by \eqref{eq:zV0JeVWuNA7MLyXxPaK} will be indicated by circles:
xs = [1,3,3,5,5,6,6,7,10,20]plt.boxplot(xs)plt.xlabel('$x$')plt.show()
This generates the following plot:
To avoid outlier detection, add the option showfliers=False
:
xs = [1,3,3,5,5,6,6,7,10,20]plt.boxplot(xs, showfliers=False)plt.xlabel('$x$')plt.show()
This generates the following plot:
Horizontal boxplot diagram
By default, the orientation of the boxplot diagram is vertical. We can change this to horizontal like so:
xs = [1,3,3,5,5,6,6,7,10,20]plt.boxplot(xs, vert=False)plt.xlabel('$x$')plt.show()
This generates the following plot:
Multiple boxplot diagrams
To draw multiple boxplot diagrams in a single plot:
xs1 = [1,3,3,5,5,6,6,7,8,9]xs2 = [7,6,2,3,5,6,7,8,9,9]plt.boxplot([xs1,xs2], labels=['x1','x2'])plt.xlabel('$x$')plt.show()
This generates the following plot:
Final remarks
Boxplot diagrams are one of the most popular techniques for explanatory data analysis. We can visualize the quartiles to roughly understand how values are distributed. The other useful application of boxplot diagrams is for outlier detection, which is also based on the quartiles. There are other more complicated techniques to identify outliers, but I recommend that you start off with boxplot's technique because of its simplicity and robustness.