Pandas | cut method
Start your free 7-days trial now!
Pandas cut(~) method categorises numerical values into bins (intervals).
Parameters
1. xlink | array-like
A 1D input array whose numerical values will be segmented into bins.
2. binslink | int or sequence<scalar> or IntervalIndex
The specified type of bins determines how the bins are computed:
Type | Description |
|---|---|
| The number of equal-width bins. The range of |
| The desired bin edges. Values that do no fall in a bin will be set to |
| The exact bins to use. |
3. rightlink | boolean | optional
Whether to make the left bin edge exclusive and the right bin edge inclusive. By default, right=True.
4. labelslink | array or False | optional
The desired labels of the bins. By default, labels=None.
5. retbinslink | boolean | optional
Whether or not to return bins. By default, retbins=False.
6. precisionlink | int | optional
The number of decimal places to include up until for the bin labels. By default, precision=3.
7. include_lowestlink | boolean | optional
Whether to make the left edge of the first bin inclusive. By default, include_lowest=False.
8. duplicateslink | string | optional
How to deal with duplicate bin edges:
Value | Description |
|---|---|
| Throw an error if any duplicate bin edges are set. |
| Remove the duplicate bin edge and just keep one. |
By default, duplicates="raise".
9. orderedlink | boolean | optional | v1.10~
Whether or not to embed ordering information. This is only relevant if the return type is Categorical or Series of data-type Categorical. ordered can only be set to False if labels is provided. By default, ordered=True.
Return Value
The return type depends on the type of the labels parameter:
if
labelsis unspecified:if
labelsis an array of scalars:if
xis aSeries, then aSeriesis returned. The type of the values stored within thisSeriesmatches the type of the values stored inlabels.else, a
Categoricalis returned. The type of the values stored within theCategoricalmatches the type of the values stored inlabels.
if
labelsis a booleanFalse, then a Numpy array of integers is returned.
If retbins=True, then in addition to the above, the bins are returned as a Numpy array. If x is an IntervalIndex, then x is returned instead.
Examples
Consider the following DataFrame about students and their grades:
raw_grades = [3,6,8,7,4,6]students = ["alex", "bob", "cathy", "doge", "eric", "fred"]df = pd.DataFrame({"name":students,"raw_grade":raw_grades})df
name raw_grade0 alex 31 bob 62 cathy 83 doge 74 eric 45 fred 6
Basic Usage
To categorise the raw grades into four bins (segments):
df["grade"] = pd.cut(df["raw_grade"], bins=4) # returns a Seriesdf
name raw_grade grade0 alex 3 (2.999, 4.5]1 bob 6 (4.5, 6.0]2 cathy 8 (6.75, 8.0]3 doge 7 (6.75, 8.0]4 eric 4 (2.999, 4.5]5 fred 6 (4.5, 6.0]
The grade column now contains the bins, and there should be 4 different bins in total. Note that (2.995, 4.25] just means that the 2.995 < raw_grade <= 4.25.
Specifying custom bin edges
To specify custom bin edges, we can pass in an array of bin edges instead of an int:
df["grade"] = pd.cut(df["raw_grade"], bins=[0,4,6,10])df
name raw_grade grade0 alex 3 (0, 4]1 bob 6 (4, 6]2 cathy 8 (6, 10]3 doge 7 (6, 10]4 eric 4 (0, 4]5 fred 6 (4, 6]
We show the same df here for your reference:
df
name raw_grade0 alex 31 bob 62 cathy 83 doge 74 eric 45 fred 6
Specifying right
To make the left bin edge inclusive and the right bin edge exclusive, set right=False:
df["grade"] = pd.cut(df["raw_grade"], bins=[0,4,6,10], right=False)df
name raw_grade grade0 alex 3 [0, 4)1 bob 6 [6, 10)2 cathy 8 [6, 10)3 doge 7 [6, 10)4 eric 4 [4, 6)5 fred 6 [6, 10)
Notice how we have [0, 4) instead of the default (0, 4].
Specifying labels
We can give labels to our bins by setting the labels parameter:
df["grade"] = pd.cut(df["raw_grade"], bins=3, labels=["C","B","A"])df
name raw_grade grade0 alex 3 C1 bob 6 B2 cathy 8 A3 doge 7 A4 eric 4 C5 fred 6 B
This is an extremely practical feature of the cut(~) method. The length of the labels array must equal the specified number of bins.
By setting labels=False, a Numpy array of int is returned:
raw_grades = [3,6,8,7,4,5]pd.cut(raw_grades, bins=3, labels=False)
array([0, 1, 2, 2, 0, 1])
Here, the output tells us that:
the raw grade
3belongs to bin0(first bin).the raw grade
6belongs to bin1(second bin).and so on.
Specifying retbins
To get the computed bin edges as well, set retbins=True:
raw_grades = [3,6,8,7,4,5]res = pd.cut(raw_grades, bins=2, retbins=True)print("Categories: ", res[0])print("Bin egdes: ", res[1])
Categories: [(2.995, 5.5], (5.5, 8.0], (5.5, 8.0], (5.5, 8.0], (2.995, 5.5], (2.995, 5.5]]Categories (2, interval[float64]): [(2.995, 5.5] < (5.5, 8.0]]Bin egdes: [2.995 5.5 8. ]
We show the same df here for your reference:
df
name raw_grade0 alex 31 bob 62 cathy 83 doge 74 eric 45 fred 6
Specifying precision
To control how many decimal places are displayed, set the precision parameter:
res = pd.cut(df["raw_grade"], bins=[0,4.33333,6.6,10], precision=2)print(res)
0 (0.0, 4.33]1 (4.33, 6.6]2 (6.6, 10.0]3 (6.6, 10.0]4 (0.0, 4.33]5 (4.33, 6.6]Name: raw_grade, dtype: categoryCategories (3, interval[float64]): [(0.0, 4.33] < (4.33, 6.6] < (6.6, 10.0]]
Here, notice how 4.3333 got truncated to 4.33, as specified by precision value of 2.
Specifying include_lowest
Consider the following:
df["grade"] = pd.cut(df["raw_grade"], bins=[3,6,10])df
name raw_grade grade0 alex 3 NaN1 bob 6 (3.0, 6.0]2 ...
By default, include_lowest=False, which means that the first bin interval is left-exclusive. This is why the raw_grade of 3 does not fall in any bin here.
We can make the first bin interval left-inclusive by setting include_lowest=True:
df["grade"] = pd.cut(df["raw_grade"], bins=[3,6,10], include_lowest=True)df
name raw_grade grade0 alex 3 (2.999, 6.0]1 bob 6 (2.999, 6.0]...
We now see that the raw_grade of 3 has been included in the first bin.
Specifying duplicates
By default, the bin edges must be unique, otherwise an error will be thrown. For instance:
x = [3,7,8,7,4,5]pd.cut(x, bins=[2,6,6,10]) # duplicates="raise"
ValueError: Bin edges must be unique: array([ 2, 6, 6, 10]).
Here, we have two bin edges of value 6, so that's why we get an error.
In order to drop (remove) redundant bin edges, set duplicates="drop", like so:
x = [3,7,8,7,4,5]pd.cut(x, bins=[2,6,6,10], duplicates="drop")
[(2, 6], (6, 10], (6, 10], (6, 10], (2, 6], (2, 6]]Categories (2, interval[int64]): [(2, 6] < (6, 10]]
We see that one of the bin edge of value 6 got dropped.
Specifying ordered
By default, ordered=True, which means that the resulting Categorical will be ordered:
grades = [3,6,8,7,4,5]pd.cut(grades, bins=2, labels=["B","A"]) # ordered=True
['B', 'A', 'A', 'A', 'B', 'B']Categories (2, object): ['B' < 'A']
Notice how the information about ordering is embedded as ['B'<'A'].
By setting ordered=False, such ordering information is omitted:
grades = [3,6,8,7,4,5]pd.cut(grades, bins=2, labels=["B","A"], ordered=False)
['B', 'A', 'A', 'A', 'B', 'B']Categories (2, object): ['B', 'A']
To set ordered=False, make sure to have specified labels.