search
Search
Unlock 100+ guides
search toc
close
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview Doc Search Code Search Beta SORRY NOTHING FOUND!
mic
Start speaking... Voice search is only supported in Safari and Chrome.
Shrink
Navigate to

# PySpark DataFrame | sample method

schedule Aug 12, 2023
Last updated
local_offer
PySpark
Tags
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!

PySpark DataFrame's `sample(~)` method returns a random subset of rows of the DataFrame.

# Parameters

1. `withReplacement` | `boolean` | `optional`

• If `True`, then sample with replacement, that is, allow for duplicate rows.

• If `False`, then sample without replacement, that is, do not allow for duplicate rows.

By default, `withReplacement=False`.

WARNING

If `withReplacement=False`, then Bernoulli sampling is performed, which is a technique where we iterate over each element and we include the element into sample with a probability of `fraction`. On the other hand, `withReplacemnt=True` will use Poisson sampling. I actually don't quite understand this, and if you have any idea as to what this is, please let me know!

2. `fraction` | `float`

A number between `0` and `1`, which represents the probability that a value will be included in the sample. For instance, if `fraction=0.5`, then each element will be included in the sample with a probability of `0.5`.

WARNING

The sample size of the subset will be random since the sampling is performed using Bernoulli sampling (if `withReplacement=True`). This means that even setting `fraction=0.5` may result in a sample without any rows! On average though, the supplied `fraction` value will reflect the number of rows returned.

3. `seed` | `int` | `optional`

The seed for reproducibility. By default, no seed will be set which means that the derived samples will be random each time.

# Return Value

A PySpark DataFrame (`pyspark.sql.dataframe.DataFrame`).

# Examples

Consider the following PySpark DataFrame:

``` df = spark.createDataFrame([["Alex", 20],\ ["Bob", 24],\ ["Cathy", 22],\ ["Doge", 22]],\ ["name", "age"])df.show() +-----+---+| name|age|+-----+---+| Alex| 20|| Bob| 24||Cathy| 22|| Doge| 22|+-----+---+ ```

## Sampling random rows from a PySpark DataFrame (Bernoulli sampling)

To get a random sample in which the probability that an element is included in the sample is `0.5`:

``` df.sample(fraction=0.5).show() +----+---+|name|age|+----+---+|Doge| 22|+----+---+ ```

Running the code once again may yield a sample of different size:

``` df.sample(fraction=0.5).show() +-----+---+| name|age|+-----+---+| Alex| 20||Cathy| 22|+-----+---+ ```

This is because the sampling is based on Bernoulli sampling as explained in the beginning.

## Sampling with replacement (Poisson Sampling)

Once again, consider the following PySpark DataFrame:

``` df = spark.createDataFrame([["Alex", 20],\ ["Bob", 24],\ ["Cathy", 22],\ ["Doge", 22]],\ ["name", "age"])df.show() +-----+---+| name|age|+-----+---+| Alex| 20|| Bob| 24||Cathy| 22|| Doge| 22|+-----+---+ ```

To sample with replacement (using Poisson sampling), use `withReplacement=True`:

``` df.sample(fraction=0.5, withReplacement=True).show() +-----+---+| name|age|+-----+---+| Alex| 20|| Bob| 24|| Bob| 24|| Bob| 24||Cathy| 22|+-----+---+ ```

Notice how the sample size can exceed the original dataset size.

thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...
thumb_up
0
thumb_down
0
chat_bubble_outline
0
settings
Enjoy our search
Hit / to insta-search docs and recipes!