PySpark DataFrame | sample method
Start your free 7-days trial now!
sample(~) method returns a random subset of rows of the DataFrame.
True, then sample with replacement, that is, allow for duplicate rows.
False, then sample without replacement, that is, do not allow for duplicate rows.
withReplacement=False, then Bernoulli sampling is performed, which is a technique where we iterate over each element and we include the element into sample with a probability of
fraction. On the other hand,
withReplacemnt=True will use Poisson sampling. I actually don't quite understand this, and if you have any idea as to what this is, please let me know!
A number between
1, which represents the probability that a value will be included in the sample. For instance, if
fraction=0.5, then each element will be included in the sample with a probability of
The sample size of the subset will be random since the sampling is performed using Bernoulli sampling (if
withReplacement=True). This means that even setting
fraction=0.5 may result in a sample without any rows! On average though, the supplied
fraction value will reflect the number of rows returned.
The seed for reproducibility. By default, no seed will be set which means that the derived samples will be random each time.
A PySpark DataFrame (
Consider the following PySpark DataFrame:
Sampling random rows from a PySpark DataFrame (Bernoulli sampling)
To get a random sample in which the probability that an element is included in the sample is
Running the code once again may yield a sample of different size:
+-----+---+| name|age|+-----+---+| Alex| 20||Cathy| 22|+-----+---+
This is because the sampling is based on Bernoulli sampling as explained in the beginning.
Sampling with replacement (Poisson Sampling)
Once again, consider the following PySpark DataFrame:
To sample with replacement (using Poisson sampling), use
+-----+---+| name|age|+-----+---+| Alex| 20|| Bob| 24|| Bob| 24|| Bob| 24||Cathy| 22|+-----+---+
Notice how the sample size can exceed the original dataset size.