search
Search
Login
Math ML Join our weekly DS/ML newsletter
menu
menu search toc more_vert
Robocat
Guest 0reps
Thanks for the thanks!
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
help Ask a question
Share on Twitter
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to
A
A
brightness_medium
share
arrow_backShare
Twitter
Facebook

Comprehensive Guide on Box Plot Diagrams

Probability and Statistics
chevron_right
Basics of Statistics
schedule Nov 5, 2022
Last updated
local_offer Probability and Statistics
Tags

What is a boxplot diagram?

A boxplot diagram, or box-whisker diagram, is a popular way to visualize the spread of our dataset using quartiles. Recall that quartiles are 3 values that split an ordered sequence of values into four parts of equal size. For instance, the quartiles for the following values are:

Here, note the following:

  • the sequence of values is ordered.

  • 25% of the data are lower than the 1st quartile (Q1).

  • 50% of the data are lower than the 2nd quartile (Q2), which is the median.

  • 75% of the data are lower than the 3rd quartile (Q3).

The boxplot diagram visualizes these quartiles like so:

Note the following:

  • apart from the quartiles, we also include information about the minimum and maximum value.

  • we draw a line, or a whisker, from the minimum to the 1st quartile and from the 3rd quartile to the maximum.

  • we draw a box from the 1st quartile to the 3rd quartile and include a vertical line to indicate the median. The box represents the interquartile range, which holds the middle 50% of the data.

Note that boxplot diagrams are typically laid out vertically instead of horizontally as in the case here.

Interpreting boxplot diagrams

Consider the following set of values:

Symmetric distribution

Skewed distribution

Here, the red vertical lines from left to right represent the 1st, 2nd and 3rd quartiles:

Let's examine the difference between the shapes of these boxplot diagrams:

  • for the skewed distribution, the 1st quartile as well as the median appears on the far left and are closer to the minimum. In comparison, the 3rd quartile and the maximum are far apart. This means that the majority of the values are concentrated at the lower end of the distribution, and the latter values are much more spread out. This agrees with the histogram of the skewed distribution.

  • for the symmetric distribution, the boxplot diagram is also symmetric. In particular, from our histogram, we can see that our symmetric distribution is a normal distribution, and so the bulk of the values are concentrated at the center. This is why the median appears right in the middle. For values further away from the center, their distribution is more spread out - this is why we see long whiskers.

Boxplot for outlier detection

Consider the following set of values:

Here, the minimum and maximum are outliers because they are far away from the bulk of the values. To identify these outliers, we can come up with a lower and upper threshold in which values smaller than the lower threshold and values larger than the upper threshold will be considered outliers. One way of computing these thresholds is by using quartiles and interquartile range:

$$\begin{equation}\label{eq:zV0JeVWuNA7MLyXxPaK} \begin{aligned}[b] L&=\text{Q}_1-(1.5\times\text{IQR})\\ U&=\text{Q}_3+(1.5\times\text{IQR}) \end{aligned} \end{equation}$$

Here:

  • $\text{Q}_1$ and $\text{Q}_3$ are the lower and upper quartiles, respectively.

  • $\text{IQR}$ is the interquartile range, which is computed by $\text{Q}_3-\text{Q}_1$.

For our example, $\text{Q}_1$, $\text{Q}_3$ and $\text{IQR}$ are as follows:

$$\begin{align*} \text{Q}_1&=9.5\\ \text{Q}_3&=13.5\\ \text{IQR}&=4\\ \end{align*}$$

Using our formula \eqref{eq:zV0JeVWuNA7MLyXxPaK}, let's compute the lower $L$ and upper $U$ thresholds:

$$\begin{align*} L&=(9.5)-(1.5\times4)=3.5\\ U&=(13.5)+(1.5\times4)=19.5\\ \end{align*}$$

This means that any value lower than $L=3.5$ or larger than $U=19.5$ will be considered as an outlier. Let's draw in these thresholds in our diagram:

Here, we can see that the minimum and maximum are to the left and right of $L$ and $U$ respectively, which means that they are outliers!

Finally, let's draw the boxplot diagram:

Notice how the original minimum and maximum are now outliers and are replaced with white circles. We can consider them to be excluded from our values, which means that we have a new minimum and maximum.

WARNING

We have said that the formula to compute outlier thresholds is as follows:

$$\begin{align*} L&=\text{Q}_1-(1.5\times\text{IQR})\\ U&=\text{Q}_3+(1.5\times\text{IQR}) \end{align*}$$

We typically use the factor $1.5$, but this number is arbitrary - we can select any number we wish. For instance, if we want to be more strict with what we identify as outliers, then we should select a larger factor such as $2$.

Overlaying values on boxplot diagrams

Traditionally, boxplot diagrams only consist of whiskers and a box. In practise, we often overlay the values on top of boxplot diagrams for additional insights. For instance, consider the following two set of values and their corresponding boxplot diagrams:

Here, we have included the raw values. We can see that even though the boxplot diagrams are identical, the number of values used to construct them are different. If you imagine the set of values as a sample from some population (e.g. size of family), then the top boxplot diagram is more trustworthy than the one below because the sample size is larger.

On top of this, overlaying values allow us to see the distribution of the points within each quartile. For instance, for the first boxplot diagram, we can see that the values are concentrated on the left, which is an insight that we gain only by overlaying the values!

Drawing boxplot diagrams using Python

Basic boxplot diagram

Drawing boxplot diagrams is a piece of cake using matplotlib, which is Python's main graphing library. Let's use the same set of values as before:

import matplotlib.pyplot as plt
xs = [1,3,3,5,5,6,6,7,8,9] # data does not have to be sorted
plt.boxplot(xs) # or add labels=[''] to get rid of '1'
plt.xlabel('$x$')
plt.show()

This produces the following boxplot diagram:

This is basically the same boxplot diagram we had before, except that the orientation is vertical instead of horizontal.

Boxplot diagrams with outliers

By default, outliers as computed by \eqref{eq:zV0JeVWuNA7MLyXxPaK} will be indicated by circles:

xs = [1,3,3,5,5,6,6,7,10,20]
plt.boxplot(xs)
plt.xlabel('$x$')
plt.show()

This generates the following plot:

To avoid outlier detection, add the option showfliers=False:

xs = [1,3,3,5,5,6,6,7,10,20]
plt.boxplot(xs, showfliers=False)
plt.xlabel('$x$')
plt.show()

This generates the following plot:

Horizontal boxplot diagram

By default, the orientation of the boxplot diagram is vertical. We can change this to horizontal like so:

xs = [1,3,3,5,5,6,6,7,10,20]
plt.boxplot(xs, vert=False)
plt.xlabel('$x$')
plt.show()

This generates the following plot:

Multiple boxplot diagrams

To draw multiple boxplot diagrams in a single plot:

xs1 = [1,3,3,5,5,6,6,7,8,9]
xs2 = [7,6,2,3,5,6,7,8,9,9]
plt.boxplot([xs1,xs2], labels=['x1','x2'])
plt.xlabel('$x$')
plt.show()

This generates the following plot:

Final remarks

Boxplot diagrams are one of the most popular techniques for explanatory data analysis. We can visualize the quartiles to roughly understand how values are distributed. The other useful application of boxplot diagrams is for outlier detection, which is also based on the quartiles. There are other more complicated techniques to identify outliers, but I recommend that you start off with boxplot's technique because of its simplicity and robustness.

mail
Join our newsletter for updates on new DS/ML comprehensive guides (spam-free)
robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Ask a question or leave a feedback...