Counting frequency of values in PySpark DataFrame Column
Consider the following PySpark DataFrame:
Counting frequency of values using aggregation (groupBy and count)
To count the frequency of values in column
Here, we are first grouping by the values in
col1, and then for each group, we are counting the number of rows.
Sorting PySpark DataFrame by frequency counts
The resulting PySpark DataFrame is not sorted by any particular order by default. We can sort the DataFrame by the
count column using the
Here, the output is similar to Pandas'
value_counts(~) method which returns the frequency counts in descending order.
Assigning label to count aggregate column
Similar to what we did with the methods
count(), we can also use the
agg(~) method, which takes as input an aggregate function:
This is more verbose than the solution using
count(), but the advantage is that we can use the
alias(~) method to assign a name to the resulting aggregate column - here the label is
my_count instead of the default
Series.value_counts(~)method returns the count of unique values in the Series.