df.groupBy('col1').count().show()
                
            
            +----+-----+
|col1|count|
+----+-----+
|   A|    2|
|   B|    1|
+----+-----+

Here, we are first grouping by the values in col1, and then for each group, we are counting the number of rows.

Sorting PySpark DataFrame by frequency counts

The resulting PySpark DataFrame is not sorted by any particular order by default. We can sort the DataFrame by the count column using the orderBy(~) method:


        
        
            
                
                
                    df.groupBy('col1').count().orderBy('count', ascending=False).show()
                
            
            +----+-----+
|col1|count|
+----+-----+
|   A|    2|
|   B|    1|
+----+-----+

Here, the output is similar to Pandas' value_counts(~) method which returns the frequency counts in descending order.

Assigning label to count aggregate column

Similar to what we did with the methods groupBy(~) and count(), we can also use the agg(~) method, which takes as input an aggregate function:


        
        
            
                
                
                    import pyspark.sql.functions as F
df.groupBy('col1').agg(F.count('col1').alias('my_count')).show()
                
            
            +----+--------+
|col1|my_count|
+----+--------+
|   A|       2|
|   B|       1|
+----+--------+

This is more verbose than the solution using groupBy(~) and count(), but the advantage is that we can use the alias(~) method to assign a name to the resulting aggregate column - here the label is my_count instead of the default count.

Pandas Series | value_counts method

Pandas Series.value_counts(~) method returns the count of unique values in the Series.

chevron_right