# Counting frequency of values in PySpark DataFrame Column

schedule Aug 12, 2023
PySpark
PySpark
Consider the following PySpark DataFrame:

``` df = spark.createDataFrame([['A'],['A'],['B']], ['col1'])df.show() +----+|col1|+----+| A|| A|| B|+----+ ```

# Counting frequency of values using aggregation (groupBy and count)

To count the frequency of values in column `col1`:

``` df.groupBy('col1').count().show() +----+-----+|col1|count|+----+-----+| A| 2|| B| 1|+----+-----+ ```

Here, we are first grouping by the values in `col1`, and then for each group, we are counting the number of rows.

# Sorting PySpark DataFrame by frequency counts

The resulting PySpark DataFrame is not sorted by any particular order by default. We can sort the DataFrame by the `count` column using the `orderBy(~)` method:

``` df.groupBy('col1').count().orderBy('count', ascending=False).show() +----+-----+|col1|count|+----+-----+| A| 2|| B| 1|+----+-----+ ```

Here, the output is similar to Pandas' `value_counts(~)` method which returns the frequency counts in descending order.

# Assigning label to count aggregate column

Similar to what we did with the methods `groupBy(~)` and `count()`, we can also use the `agg(~)` method, which takes as input an aggregate function:

``` import pyspark.sql.functions as Fdf.groupBy('col1').agg(F.count('col1').alias('my_count')).show() +----+--------+|col1|my_count|+----+--------+| A| 2|| B| 1|+----+--------+ ```

This is more verbose than the solution using `groupBy(~)` and `count()`, but the advantage is that we can use the `alias(~)` method to assign a name to the resulting aggregate column - here the label is `my_count` instead of the default `count`.

