search
Search
Login
Unlock 100+ guides
menu
menu
web
search toc
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
What does this mean?
Why is this true?
Give me some examples!
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to

One-hot encoding in PySpark

schedule Aug 12, 2023
Last updated
local_offer
PySpark
Tags
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!

To perform one-hot encoding in PySpark, we must:

  1. convert the categorical column into a numeric column (0, 1, ...) using StringIndexer

  2. convert the numeric column into one-hot encoded columns using OneHotEncoder

One-hot encoding categorical columns as sparse vector

Consider the following PySpark DataFrame:

rows = [['Alex','B'], ['Bob','A'], ['Cathy','B'], ['Dave','C'], ['Eric','D']]
df = spark.createDataFrame(rows, ['name','class'])
df.show()
+-----+-----+
| name|class|
+-----+-----+
| Alex| B|
| Bob| A|
|Cathy| B|
| Dave| C|
| Eric| D|
+-----+-----+

Our goal is to one-hot encode the categorical column class.

The first step is to convert the class column into a numeric column using StringIndexer:

from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol='class', outputCol='class_numeric')
indexer_fitted = indexer.fit(df)
df_indexed = indexer_fitted.transform(df)
df_indexed.show()
+-----+-----+-------------+
| name|class|class_numeric|
+-----+-----+-------------+
| Alex| B| 0.0|
| Bob| A| 1.0|
|Cathy| B| 0.0|
| Dave| C| 2.0|
| Eric| D| 3.0|
+-----+-----+-------------+

Here, note the following:

  • the inputCol argument is the label of the categorical column, while outputCol is the label of the new numerically encoded column.

  • we need to call both the methods fit(~) and transform(~) on our PySpark DataFrame.

  • the numeric category that is assigned will depend on the frequency of the category. By default stringOrderType='frequencyDesc', which means that the class that occurs the most will be assigned the category index of 0. In this case, class B occurs the most and so it is assigned a category index of 0. You can reverse this by setting stringOrderType='frequencyAsc'.

  • the indexer_fitted object has a labels property holding the mapped column labels:

    indexer_fitted.labels
    ['B', 'A', 'C', 'D']

Now that we have converted the categorical strings into categorical indexes, we can use PySpark's OneHotEncoder module to perform one-hot encoding:

from pyspark.ml.feature import OneHotEncoder
encoder = OneHotEncoder(inputCols=['class_numeric'], outputCols=['class_onehot'])
df_onehot = encoder.fit(df_indexed).transform(df_indexed)
df_onehot.show()
+-----+-----+-------------+-------------+
| name|class|class_numeric| class_onehot|
+-----+-----+-------------+-------------+
| Alex| B| 0.0|(3,[0],[1.0])|
| Bob| A| 1.0|(3,[1],[1.0])|
|Cathy| B| 0.0|(3,[0],[1.0])|
| Dave| C| 2.0|(3,[2],[1.0])|
| Eric| D| 3.0| (3,[],[])|
+-----+-----+-------------+-------------+

Here, after performing OneHotEncoder's fit(~) and transform(~) on our PySpark DataFrame, we end up with a new column as specified by the outputCols argument. Since one-hot encoded vectors typically have a large number of zeroes, PySpark uses the column type (sparse) vector for one-hot encoding:

df_onehot.printSchema()
root
|-- name: string (nullable = true)
|-- class: string (nullable = true)
|-- class_numeric: double (nullable = false)
|-- class_onehot: vector (nullable = true)

A sparse vector is defined by three values (in order):

  • size: the size of the vector (the number of categories minus one)

  • index: the index in the vector that holds value

  • value: the value at index

Let's take the vector (3,[0],[1.0]) as an example. The size of the vector is 3 even though we have 4 unique categories (A,B,C,D) because one category is used as the base category - we will explain this part in a bit. The middle value [0] and the third value [1.0] means that the index position 0 in the vector should be filled with a 1.0. All other values in the sparse vector are filled with zeros. Since the vectors in this column represent one-hot encoded vectors, the third value will always be 1.0.

Now, let's take a look at the last one-hot encoded vector (3,[],[]). The second and third values are both empty []. This means that the vector is just filled with zeroes, that is, category D is treated as a base category. This is the reason why we can represent 4 unique categories with a vector of size 3.

Note that we can still choose to represent our unique categories without using a base category by supplying the argument dropLast=False:

encoder = OneHotEncoder(inputCols=['class_numeric'], outputCols=['class_onehot'], dropLast=False)
df_onehot_no_base = encoder.fit(df_indexed).transform(df_indexed)
df_onehot_no_base.show()
+-----+-----+-------------+-------------+
| name|class|class_numeric| class_onehot|
+-----+-----+-------------+-------------+
| Alex| B| 0.0|(4,[0],[1.0])|
| Bob| A| 1.0|(4,[1],[1.0])|
|Cathy| B| 0.0|(4,[0],[1.0])|
| Dave| C| 2.0|(4,[2],[1.0])|
| Eric| D| 3.0|(4,[3],[1.0])|
+-----+-----+-------------+-------------+

Here, notice how the size of our vectors is 4 instead of 0 and also how category D is assigned an index of 3.

One-hot encoding categorical columns as a set of binary columns (dummy encoding)

The OneHotEncoder module encodes a numeric categorical column using a sparse vector, which is useful as inputs of PySpark's machine learning models such as decision trees (DecisionTreeClassifier).

However, you may want the one-hot encoding to be done in a similar way to Pandas' get_dummies(~) method that produces a set of binary columns instead. In this section, we will convert the sparse vector into binary one-hot encoded columns.

We begin by converting the sparse vectors into arrays using the vector_to_array(~) method:

from pyspark.ml.functions import vector_to_array
df_col_onehot = df_onehot.select('*', vector_to_array('class_onehot').alias('col_onehot'))
df_col_onehot.show()
+-----+-----+-------------+-------------+---------------+
| name|class|class_numeric| class_onehot| col_onehot|
+-----+-----+-------------+-------------+---------------+
| Alex| B| 0.0|(3,[0],[1.0])|[1.0, 0.0, 0.0]|
| Bob| A| 1.0|(3,[1],[1.0])|[0.0, 1.0, 0.0]|
|Cathy| B| 0.0|(3,[0],[1.0])|[1.0, 0.0, 0.0]|
| Dave| C| 2.0|(3,[2],[1.0])|[0.0, 0.0, 1.0]|
| Eric| D| 3.0| (3,[],[])|[0.0, 0.0, 0.0]|
+-----+-----+-------------+-------------+---------------+

Here, note the following:

  • '*' refers to all columns in df_onehot.

  • the alias(~) method assigns a label to the column returned by vector_to_array(~).

Next, we will unpack this column of arrays into a set of columns:

import pyspark.sql.functions as F
num_categories = len(df_col_onehot.first()['col_onehot']) # 3
cols_expanded = [(F.col('col_onehot')[i]) for i in range(num_categories)]
df_cols_onehot = df_col_onehot.select('name', 'class', *cols_expanded)
df_cols_onehot.show()
+-----+-----+-------------+-------------+-------------+
| name|class|col_onehot[0]|col_onehot[1]|col_onehot[2]|
+-----+-----+-------------+-------------+-------------+
| Alex| B| 1.0| 0.0| 0.0|
| Bob| A| 0.0| 1.0| 0.0|
|Cathy| B| 1.0| 0.0| 0.0|
| Dave| C| 0.0| 0.0| 1.0|
| Eric| D| 0.0| 0.0| 0.0|
+-----+-----+-------------+-------------+-------------+

Here, note the following:

  • we are first fetching the number of categories. The first(~) method returns the first row as a Row object and the length of an array in the col_onehot column represents the number of categories (minus one since we are using one category as the base category).

  • we then use list comprehension to obtain a list of binary columns. F.col('col_onehot')[2] for instance will return a Column holding the 3rd value of each list.

  • the * in *cols_expanded unpacks the list of Column objects into positional arguments.

Finally, notice how the encoded binary columns have awkward labels like col_onehot[0] by default. We can convert their labels to their corresponding categorical labels by slightly tweaking the following line of the previous code snippet:

num_categories = len(df_col_onehot.first()['col_onehot']) # 3
cols_expanded = [(F.col('col_onehot')[i].alias(f'{indexer_fitted.labels[i]}')) for i in range(num_categories)]
df_cols_onehot = df_col_onehot.select('name', 'class', *cols_expanded)
df_cols_onehot.show()
+-----+-----+---+---+---+
| name|class| B| A| C|
+-----+-----+---+---+---+
| Alex| B|1.0|0.0|0.0|
| Bob| A|0.0|1.0|0.0|
|Cathy| B|1.0|0.0|0.0|
| Dave| C|0.0|0.0|1.0|
| Eric| D|0.0|0.0|0.0|
+-----+-----+---+---+---+

Here we are using the PySpark column's alias(~) method to assign the original categorical labels given by indexer_fitted.labels:

indexer_fitted.labels
['B', 'A', 'C', 'D']
robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...
thumb_up
30
thumb_down
2
chat_bubble_outline
6
settings
Enjoy our search
Hit / to insta-search docs and recipes!