**PySpark**

*chevron_left*

**PySpark SQL Functions**

# PySpark SQL Functions | count_distinct method

*schedule*Mar 5, 2023

*toc*Table of Contents

*expand_more*

**interactive map of data science**

PySpark SQL Functions' `count_distinct(~)`

method counts the number of distinct values in the specified columns.

# Parameters

1. `*cols`

| `string`

or `Column`

The columns in which to count the number of distinct values.

# Return Value

A PySpark `Column`

holding an integer.

# Examples

Consider the following PySpark DataFrame:

```
+-----+-----+| name|class|+-----+-----+| Alex| A|| Bob| A||Cathy| B|+-----+-----+
```

## Counting the number of distinct values in a single column in PySpark

To count the number of distinct values in the `class`

column:

```
```

Here, we are giving the name `"c"`

to the `Column`

returned by `count_distinct(~)`

via `alias(~)`

.

Note that we could also supply a `Column`

object to `count_distinct(~)`

instead:

```
```

### Obtaining an integer count

By default, `count_distinct(~)`

returns a PySpark `Column`

. To get an integer count instead:

```
```

Here, we are use the `select(~)`

method to convert the `Column`

into PySpark DataFrame. We then use the `collect(~)`

method to convert the DataFrame into a list of `Row`

objects. Since there is only one `Row`

in this list as well as one value in the `Row`

, we use `[0][0]`

to access the integer count.

## Counting the number of distinct values in a set of columns in PySpark

To count the number of distinct values for the columns `name`

and `class`

:

```
```