Pandas | get_dummies method
Start your free 7-days trial now!
Pandas get_dummies(~) method performs one-hot encoding or dummy coding on categorical variables.
Parameters
1. datalink | array-like or DataFrame
The source data whose categorical variables will be one-hot encoded.
2. prefixlink | string or list<string> or dict | optional
The prefix to append to the label of the dummy-encoded columns. By default, prefix=None.
3. prefix_seplink | string | optional
The separator to use between prefix and the column name. prefix must be specified for this to take effect. By default, prefix_sep="_".
4. dummy_nalink | boolean | optional
Whether or not to append a new column that indicates a missing value. By default, dummy_na=False.
5. columns | array-like | optional
The label of the columns that will be one-hot encoded. By default, columns=None.
6. sparselink | boolean | optional
Whether or not to use a SparseArray to represent the dummy-encoded columns. By default, sparse=False.
7. drop_firstlink | boolean | optional
Whether or not to remove one dummy-encoded column. By default, drop_first=False.
8. dtype | dtype | optional
The data type of the new dummy columns. By default, dtype=np.uint8.
Return Value
A DateFrame whose categorical variables have been one-hot encoded.
Examples
Basic usage
Consider the following DataFrame:
df = pd.DataFrame({"name":["alex","bob","cathy"], "group":["A","B","A"]})df
name group0 alex A1 bob B2 cathy A
Here, the column group holds categorical variables. However, by default, all strings will be interpreted as categorical variables - this is undesirable in this case since we know that name is not a categorical variable:
pd.get_dummies(df)
name_alex name_bob name_cathy group_A group_B0 1 0 0 1 01 0 1 0 0 12 0 0 1 1 0
In order to specify that the group column is the categorical variable to one-hot encode, we just need to set the columns parameter, like so:
pd.get_dummies(df, columns=["group"])
name group_A group_B0 alex 1 01 bob 0 12 cathy 1 0
Here, notice how the name column is not one-hot encoded.
One-hot encoding using a list
To build an one-hot encoded DataFrame from a list:
pd.get_dummies(["A","B","C","B"])
A B C0 1 0 01 0 1 02 0 0 13 0 1 0
We show df here again for your reference:
df
name group0 alex A1 bob B2 cathy A
Specifying prefix
By default, the column label of the categorical variables becomes the prefix of the new column labels:
pd.get_dummies(df, columns=["group"])
name group_A group_B0 alex 1 01 bob 0 12 cathy 1 0
We can specify a custom prefix by setting the prefix parameter:
pd.get_dummies(df, columns=["group"], prefix="Group")
name Group_A Group_B0 alex 1 01 bob 0 12 cathy 1 0
Specifying prefix_sep
By default, the separator between the prefix and value of the categorical variable is "_". We can change this to whatever we wish:
pd.get_dummies(df, columns=["group"], prefix_sep="@")
name group@A group@B0 alex 1 01 bob 0 12 cathy 1 0
Specifying dummy_na
Consider the following DataFrame:
df = pd.DataFrame({"name":["alex","bob","cathy"], "group":["A","B",np.NaN]})df
name group0 alex A1 bob B2 cathy NaN
Here, we've got a missing value (NaN) for Cathy's group.
By default, dummy_na=False, which means that a missing value will result in all 0s for that row:
pd.get_dummies(df, columns=["group"])
name group_A group_B0 alex 1 01 bob 0 12 cathy 0 0
A missing value can be treated as a category of each its own if we set dummy_na=True like so:
pd.get_dummies(df, columns=["group"], dummy_na=True)
name group_A group_B group_nan0 alex 1 0 01 bob 0 1 02 cathy 0 0 1
Notice how we have a new column called group_nan.
Specifying sparse
One-hot encoding, by nature, results in a sparse set of columns (i.e. many 0s). In order to save memory usage, we can choose to use SparseArray to store the one-hot encoded columns instead of the conventional Numpy arrays.
The caveat is that SparseArray does not carry as many functionalities as Numpy arrays, so only set sparse=True when you are dealing with a large DataFrame that cause memory issues.
Consider the same df as above:
df
name group0 alex A1 bob B2 cathy A
Here's the default dtype of the dummy-encoded columns:
pd.get_dummies(df, columns=["group"]).dtypes
name objectgroup_A uint8group_B uint8dtype: object
Here's the dtype when we set sparse=True:
pd.get_dummies(df, columns=["group"], sparse=True).dtypes
name objectgroup_A Sparse[uint8, 0]group_B Sparse[uint8, 0]dtype: object
We see that the internal representation of the dummy-encoded columns have changed.
Specifying drop_first
Consider the following DataFrame:
df = pd.DataFrame({"name":["alex","bob","cathy"], "group":["A","B","A"]})df
name group0 alex A1 bob B2 cathy A
By default, drop_first=False, which means that each categorical variable gets a column of its own:
pd.get_dummies(df, columns=["group"]) # drop_first=False
name group_A group_B0 alex 1 01 bob 0 12 cathy 1 0
By setting drop_first=True, we drop one dummy-encoded column:
pd.get_dummies(df, columns=["group"], drop_first=True)
name group_B0 alex 01 bob 12 cathy 0
The key here is that, even if we drop a single dummy-encoded column, we can still figure out what group a person belongs to.