Pandas | factorize method
Start your free 7-days trial now!
Pandas factorize(~) method returns the following:
an array of integer indices to map the input array to the unique values.
all the unique values of the input array.
Parameters
1. valueslink | sequence
A 1D sequence of values.
2. sortlink | boolean | optional
Whether or not to sort the resulting array of unique values. By default, sort=False.
3. na_sentinellink | int | optional
The value to mark NaN in the array of integer indices. By default, na_sentinel=-1.
Return Value
The following two NumPy arrays are returned:
an array of integer indices that maps the input array to the array of unique values.
an array containing the unique values of the input array.
Examples
Basic usage
codes, uniques = pd.factorize(["B", "A", "A", "C", "B"])print("codes:", codes)print("uniques:", uniques)
codes: [0 1 1 2 0]uniques: ['B' 'A' 'C']
Note the following:
the
codesarray maps the values in the input array to theuniquesarray.the unique values are ordered as they appear in the input array.
You can recreate the input array using codes and uniques like so:
uniques[codes]
array(['B', 'A', 'A', 'C', 'B'], dtype=object)
Specifying sort
By default, sort=False, which means that the returned array of unique values is not sorted.
To have the array of unique values sorted, set sort=True like so:
codes, uniques = pd.factorize(["B", "A", "A", "C", "B"], sort=True)print("codes:", codes)print("uniques:", uniques)
codes: [1 0 0 2 1]uniques: ['A' 'B' 'C']
Notice how the uniques are sorted, and the codes array also reflects this.
Specifying na_sentinel
By default, NaN values are marked as -1 in the codes array:
codes, uniques = pd.factorize(["B", np.NaN, "A", "C", "B"])print("codes:", codes)print("uniques:", uniques)
codes: [ 0 -1 1 2 0]uniques: ['B' 'A' 'C']
We can choose our own value by passing in na_sentinel like so:
codes, uniques = pd.factorize(["B", np.NaN, "A", "C", "B"], na_sentinel=50)print("codes:", codes)print("uniques:", uniques)
codes: [ 0 50 1 2 0]uniques: ['B' 'A' 'C']