search
Search
Login
Math ML Join our weekly DS/ML newsletter
menu
menu search toc more_vert
Robocat
Guest 0reps
Thanks for the thanks!
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
help Ask a question
Share on Twitter
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to
A
A
brightness_medium
share
arrow_backShare
Twitter
Facebook

Pandas | get_dummies method

Pandas
chevron_right
Documentation
chevron_right
General Functions
schedule Jul 1, 2022
Last updated
local_offer PandasPython
Tags

Pandas get_dummies(~) method performs one-hot encoding or dummy coding on categorical variables.

Parameters

1. datalink | array-like or DataFrame

The source data whose categorical variables will be one-hot encoded.

2. prefixlink | string or list<string> or dict | optional

The prefix to append to the label of the dummy-encoded columns. By default, prefix=None.

3. prefix_seplink | string | optional

The separator to use between prefix and the column name. prefix must be specified for this to take effect. By default, prefix_sep="_".

4. dummy_nalink | boolean | optional

Whether or not to append a new column that indicates a missing value. By default, dummy_na=False.

5. columns | array-like | optional

The label of the columns that will be one-hot encoded. By default, columns=None.

6. sparselink | boolean | optional

Whether or not to use a SparseArray to represent the dummy-encoded columns. By default, sparse=False.

7. drop_firstlink | boolean | optional

Whether or not to remove one dummy-encoded column. By default, drop_first=False.

8. dtype | dtype | optional

The data type of the new dummy columns. By default, dtype=np.uint8.

Return Value

A DateFrame whose categorical variables have been one-hot encoded.

Examples

Basic usage

Consider the following DataFrame:

df = pd.DataFrame({"name":["alex","bob","cathy"], "group":["A","B","A"]})
df
name group
0 alex A
1 bob B
2 cathy A

Here, the column group holds categorical variables. However, by default, all strings will be interpreted as categorical variables - this is undesirable in this case since we know that name is not a categorical variable:

pd.get_dummies(df)
name_alex name_bob name_cathy group_A group_B
0 1 0 0 1 0
1 0 1 0 0 1
2 0 0 1 1 0

In order to specify that the group column is the categorical variable to one-hot encode, we just need to set the columns parameter, like so:

pd.get_dummies(df, columns=["group"])
name group_A group_B
0 alex 1 0
1 bob 0 1
2 cathy 1 0

Here, notice how the name column is not one-hot encoded.

One-hot encoding using a list

To build an one-hot encoded DataFrame from a list:

pd.get_dummies(["A","B","C","B"])
A B C
0 1 0 0
1 0 1 0
2 0 0 1
3 0 1 0

We show df here again for your reference:

df
name group
0 alex A
1 bob B
2 cathy A

Specifying prefix

By default, the column label of the categorical variables becomes the prefix of the new column labels:

pd.get_dummies(df, columns=["group"])
name group_A group_B
0 alex 1 0
1 bob 0 1
2 cathy 1 0

We can specify a custom prefix by setting the prefix parameter:

pd.get_dummies(df, columns=["group"], prefix="Group")
name Group_A Group_B
0 alex 1 0
1 bob 0 1
2 cathy 1 0

Specifying prefix_sep

By default, the separator between the prefix and value of the categorical variable is "_". We can change this to whatever we wish:

pd.get_dummies(df, columns=["group"], prefix_sep="@")
name group@A group@B
0 alex 1 0
1 bob 0 1
2 cathy 1 0

Specifying dummy_na

Consider the following DataFrame:

df = pd.DataFrame({"name":["alex","bob","cathy"], "group":["A","B",np.NaN]})
df
name group
0 alex A
1 bob B
2 cathy NaN

Here, we've got a missing value (NaN) for Cathy's group.

By default, dummy_na=False, which means that a missing value will result in all 0s for that row:

pd.get_dummies(df, columns=["group"])
name group_A group_B
0 alex 1 0
1 bob 0 1
2 cathy 0 0

A missing value can be treated as a category of each its own if we set dummy_na=True like so:

pd.get_dummies(df, columns=["group"], dummy_na=True)
name group_A group_B group_nan
0 alex 1 0 0
1 bob 0 1 0
2 cathy 0 0 1

Notice how we have a new column called group_nan.

Specifying sparse

One-hot encoding, by nature, results in a sparse set of columns (i.e. many 0s). In order to save memory usage, we can choose to use SparseArray to store the one-hot encoded columns instead of the conventional Numpy arrays.

The caveat is that SparseArray does not carry as many functionalities as Numpy arrays, so only set sparse=True when you are dealing with a large DataFrame that cause memory issues.

Consider the same df as above:

df
name group
0 alex A
1 bob B
2 cathy A

Here's the default dtype of the dummy-encoded columns:

pd.get_dummies(df, columns=["group"]).dtypes
name object
group_A uint8
group_B uint8
dtype: object

Here's the dtype when we set sparse=True:

pd.get_dummies(df, columns=["group"], sparse=True).dtypes
name object
group_A Sparse[uint8, 0]
group_B Sparse[uint8, 0]
dtype: object

We see that the internal representation of the dummy-encoded columns have changed.

Specifying drop_first

Consider the following DataFrame:

df = pd.DataFrame({"name":["alex","bob","cathy"], "group":["A","B","A"]})
df
name group
0 alex A
1 bob B
2 cathy A

By default, drop_first=False, which means that each categorical variable gets a column of its own:

pd.get_dummies(df, columns=["group"]) # drop_first=False
name group_A group_B
0 alex 1 0
1 bob 0 1
2 cathy 1 0

By setting drop_first=True, we drop one dummy-encoded column:

pd.get_dummies(df, columns=["group"], drop_first=True)
name group_B
0 alex 0
1 bob 1
2 cathy 0

The key here is that, even if we drop a single dummy-encoded column, we can still figure out what group a person belongs to.

mail
Join our newsletter for updates on new DS/ML comprehensive guides (spam-free)
robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Ask a question or leave a feedback...
thumb_up
0
thumb_down
0
chat_bubble_outline
0
settings
Enjoy our search
Hit / to insta-search docs and recipes!