search
Search
menu search toc more_vert
Guest 0reps
Thanks for the thanks!
close
account_circle
Profile
exit_to_app
Sign out
help Ask a question
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview Doc Search Code Search Beta SORRY NOTHING FOUND!
mic
Start speaking... Voice search is only supported in Safari and Chrome.
Shrink
Navigate to
A
A
brightness_medium
share
arrow_backShare Twitter Facebook

# PySpark DataFrame | sample method

Machine Learning
chevron_right
PySpark
chevron_right
Documentation
chevron_right
PySpark DataFrame
schedule Jul 1, 2022
Last updated
local_offer PySpark
Tags

PySpark DataFrame's `sample(~)` method returns a random subset of rows of the DataFrame.

# Parameters

1. `withReplacement` | `boolean` | `optional`

• If `True`, then sample with replacement, that is, allow for duplicate rows.

• If `False`, then sample without replacement, that is, do not allow for duplicate rows.

By default, `withReplacement=False`.

WARNING

If `withReplacement=False`, then Bernoulli sampling is performed, which is a technique where we iterate over each element and we include the element into sample with a probability of `fraction`. On the other hand, `withReplacemnt=True` will use Poisson sampling. I actually don't quite understand this, and if you have any idea as to what this is, please let me know!

2. `fraction` | `float`

A number between `0` and `1`, which represents the probability that a value will be included in the sample. For instance, if `fraction=0.5`, then each element will be included in the sample with a probability of `0.5`.

WARNING

The sample size of the subset will be random since the sampling is performed using Bernoulli sampling (if `withReplacement=True`). This means that even setting `fraction=0.5` may result in a sample without any rows! On average though, the supplied `fraction` value will reflect the number of rows returned.

3. `seed` | `int` | `optional`

The seed for reproducibility. By default, no seed will be set which means that the derived samples will be random each time.

# Return Value

A PySpark DataFrame (`pyspark.sql.dataframe.DataFrame`).

# Examples

Consider the following PySpark DataFrame:

``` df = spark.createDataFrame([["Alex", 20],\ ["Bob", 24],\ ["Cathy", 22],\ ["Doge", 22]],\ ["name", "age"])df.show() +-----+---+| name|age|+-----+---+| Alex| 20|| Bob| 24||Cathy| 22|| Doge| 22|+-----+---+ ```

## Sampling random rows from a PySpark DataFrame (Bernoulli sampling)

To get a random sample in which the probability that an element is included in the sample is `0.5`:

``` df.sample(fraction=0.5).show() +----+---+|name|age|+----+---+|Doge| 22|+----+---+ ```

Running the code once again may yield a sample of different size:

``` df.sample(fraction=0.5).show() +-----+---+| name|age|+-----+---+| Alex| 20||Cathy| 22|+-----+---+ ```

This is because the sampling is based on Bernoulli sampling as explained in the beginning.

## Sampling with replacement (Poisson Sampling)

Once again, consider the following PySpark DataFrame:

``` df = spark.createDataFrame([["Alex", 20],\ ["Bob", 24],\ ["Cathy", 22],\ ["Doge", 22]],\ ["name", "age"])df.show() +-----+---+| name|age|+-----+---+| Alex| 20|| Bob| 24||Cathy| 22|| Doge| 22|+-----+---+ ```

To sample with replacement (using Poisson sampling), use `withReplacement=True`:

``` df.sample(fraction=0.5, withReplacement=True).show() +-----+---+| name|age|+-----+---+| Alex| 20|| Bob| 24|| Bob| 24|| Bob| 24||Cathy| 22|+-----+---+ ```

Notice how the sample size can exceed the original dataset size.

mail