search
Search
Login
Unlock 100+ guides
menu
menu
web
search toc
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
What does this mean?
Why is this true?
Give me some examples!
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to

Pandas | cut method

schedule Aug 11, 2023
Last updated
local_offer
PythonPandas
Tags
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!

Pandas cut(~) method categorises numerical values into bins (intervals).

Parameters

1. xlink | array-like

A 1D input array whose numerical values will be segmented into bins.

2. binslink | int or sequence<scalar> or IntervalIndex

The specified type of bins determines how the bins are computed:

Type

Description

int

The number of equal-width bins. The range of x is increased by 0.1% to ensure that all values fall in some bin.

sequence<scalar>

The desired bin edges. Values that do no fall in a bin will be set to NaN.

IntervalIndex

The exact bins to use.

3. rightlink | boolean | optional

Whether to make the left bin edge exclusive and the right bin edge inclusive. By default, right=True.

4. labelslink | array or False | optional

The desired labels of the bins. By default, labels=None.

5. retbinslink | boolean | optional

Whether or not to return bins. By default, retbins=False.

6. precisionlink | int | optional

The number of decimal places to include up until for the bin labels. By default, precision=3.

7. include_lowestlink | boolean | optional

Whether to make the left edge of the first bin inclusive. By default, include_lowest=False.

8. duplicateslink | string | optional

How to deal with duplicate bin edges:

Value

Description

"raise"

Throw an error if any duplicate bin edges are set.

"drop"

Remove the duplicate bin edge and just keep one.

By default, duplicates="raise".

9. orderedlink | boolean | optional | v1.10~

Whether or not to embed ordering information. This is only relevant if the return type is Categorical or Series of data-type Categorical. ordered can only be set to False if labels is provided. By default, ordered=True.

Return Value

The return type depends on the type of the labels parameter:

  • if labels is unspecified:

    • if x is a Series, then a Series that encode the bins for each value is returned. Each bin interval is represented by an Interval.

    • else, a Categorical is returned. Each bin interval is represented by an Interval.

  • if labels is an array of scalars:

    • if x is a Series, then a Series is returned. The type of the values stored within this Series matches the type of the values stored in labels.

    • else, a Categorical is returned. The type of the values stored within the Categorical matches the type of the values stored in labels.

  • if labels is a boolean False, then a Numpy array of integers is returned.

If retbins=True, then in addition to the above, the bins are returned as a Numpy array. If x is an IntervalIndex, then x is returned instead.

Examples

Consider the following DataFrame about students and their grades:

raw_grades = [3,6,8,7,4,6]
students = ["alex", "bob", "cathy", "doge", "eric", "fred"]
df = pd.DataFrame({"name":students,"raw_grade":raw_grades})
df
name raw_grade
0 alex 3
1 bob 6
2 cathy 8
3 doge 7
4 eric 4
5 fred 6

Basic Usage

To categorise the raw grades into four bins (segments):

df["grade"] = pd.cut(df["raw_grade"], bins=4) # returns a Series
df
name raw_grade grade
0 alex 3 (2.999, 4.5]
1 bob 6 (4.5, 6.0]
2 cathy 8 (6.75, 8.0]
3 doge 7 (6.75, 8.0]
4 eric 4 (2.999, 4.5]
5 fred 6 (4.5, 6.0]

The grade column now contains the bins, and there should be 4 different bins in total. Note that (2.995, 4.25] just means that the 2.995 < raw_grade <= 4.25.

Specifying custom bin edges

To specify custom bin edges, we can pass in an array of bin edges instead of an int:

df["grade"] = pd.cut(df["raw_grade"], bins=[0,4,6,10])
df
name raw_grade grade
0 alex 3 (0, 4]
1 bob 6 (4, 6]
2 cathy 8 (6, 10]
3 doge 7 (6, 10]
4 eric 4 (0, 4]
5 fred 6 (4, 6]

We show the same df here for your reference:

df
name raw_grade
0 alex 3
1 bob 6
2 cathy 8
3 doge 7
4 eric 4
5 fred 6

Specifying right

To make the left bin edge inclusive and the right bin edge exclusive, set right=False:

df["grade"] = pd.cut(df["raw_grade"], bins=[0,4,6,10], right=False)
df
name raw_grade grade
0 alex 3 [0, 4)
1 bob 6 [6, 10)
2 cathy 8 [6, 10)
3 doge 7 [6, 10)
4 eric 4 [4, 6)
5 fred 6 [6, 10)

Notice how we have [0, 4) instead of the default (0, 4].

Specifying labels

We can give labels to our bins by setting the labels parameter:

df["grade"] = pd.cut(df["raw_grade"], bins=3, labels=["C","B","A"])
df
name raw_grade grade
0 alex 3 C
1 bob 6 B
2 cathy 8 A
3 doge 7 A
4 eric 4 C
5 fred 6 B

This is an extremely practical feature of the cut(~) method. The length of the labels array must equal the specified number of bins.

By setting labels=False, a Numpy array of int is returned:

raw_grades = [3,6,8,7,4,5]
pd.cut(raw_grades, bins=3, labels=False)
array([0, 1, 2, 2, 0, 1])

Here, the output tells us that:

  • the raw grade 3 belongs to bin 0 (first bin).

  • the raw grade 6 belongs to bin 1 (second bin).

  • and so on.

Specifying retbins

To get the computed bin edges as well, set retbins=True:

raw_grades = [3,6,8,7,4,5]
res = pd.cut(raw_grades, bins=2, retbins=True)
print("Categories: ", res[0])
print("Bin egdes: ", res[1])
Categories: [(2.995, 5.5], (5.5, 8.0], (5.5, 8.0], (5.5, 8.0], (2.995, 5.5], (2.995, 5.5]]
Categories (2, interval[float64]): [(2.995, 5.5] < (5.5, 8.0]]
Bin egdes: [2.995 5.5 8. ]

We show the same df here for your reference:

df
name raw_grade
0 alex 3
1 bob 6
2 cathy 8
3 doge 7
4 eric 4
5 fred 6

Specifying precision

To control how many decimal places are displayed, set the precision parameter:

res = pd.cut(df["raw_grade"], bins=[0,4.33333,6.6,10], precision=2)
print(res)
0 (0.0, 4.33]
1 (4.33, 6.6]
2 (6.6, 10.0]
3 (6.6, 10.0]
4 (0.0, 4.33]
5 (4.33, 6.6]
Name: raw_grade, dtype: category
Categories (3, interval[float64]): [(0.0, 4.33] < (4.33, 6.6] < (6.6, 10.0]]

Here, notice how 4.3333 got truncated to 4.33, as specified by precision value of 2.

Specifying include_lowest

Consider the following:

df["grade"] = pd.cut(df["raw_grade"], bins=[3,6,10])
df
name raw_grade grade
0 alex 3 NaN
1 bob 6 (3.0, 6.0]
2 ...

By default, include_lowest=False, which means that the first bin interval is left-exclusive. This is why the raw_grade of 3 does not fall in any bin here.

We can make the first bin interval left-inclusive by setting include_lowest=True:

df["grade"] = pd.cut(df["raw_grade"], bins=[3,6,10], include_lowest=True)
df
name raw_grade grade
0 alex 3 (2.999, 6.0]
1 bob 6 (2.999, 6.0]
...

We now see that the raw_grade of 3 has been included in the first bin.

Specifying duplicates

By default, the bin edges must be unique, otherwise an error will be thrown. For instance:

x = [3,7,8,7,4,5]
pd.cut(x, bins=[2,6,6,10]) # duplicates="raise"
ValueError: Bin edges must be unique: array([ 2, 6, 6, 10]).

Here, we have two bin edges of value 6, so that's why we get an error.

In order to drop (remove) redundant bin edges, set duplicates="drop", like so:

x = [3,7,8,7,4,5]
pd.cut(x, bins=[2,6,6,10], duplicates="drop")
[(2, 6], (6, 10], (6, 10], (6, 10], (2, 6], (2, 6]]
Categories (2, interval[int64]): [(2, 6] < (6, 10]]

We see that one of the bin edge of value 6 got dropped.

Specifying ordered

By default, ordered=True, which means that the resulting Categorical will be ordered:

grades = [3,6,8,7,4,5]
pd.cut(grades, bins=2, labels=["B","A"]) # ordered=True
['B', 'A', 'A', 'A', 'B', 'B']
Categories (2, object): ['B' < 'A']

Notice how the information about ordering is embedded as ['B'<'A'].

By setting ordered=False, such ordering information is omitted:

grades = [3,6,8,7,4,5]
pd.cut(grades, bins=2, labels=["B","A"], ordered=False)
['B', 'A', 'A', 'A', 'B', 'B']
Categories (2, object): ['B', 'A']

To set ordered=False, make sure to have specified labels.

robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...