search
Search
Publish
menu
menu search toc more_vert
Robocat
Guest 0reps
Thanks for the thanks!
close
chevron_left PySpark SparkContext
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
help Ask a question
Share on Twitter
search
keyboard_voice
close
Searching Tips
Search for a recipe: "Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to
A
A
share
thumb_up_alt
bookmark
arrow_backShare
Twitter
Facebook
chevron_left PySpark SparkContext
thumb_up
0
thumb_down
0
chat_bubble_outline
0
auto_stories new
settings

PySpark SparkContext | parallelize method

Machine Learning
chevron_right
PySpark
chevron_right
Documentation
chevron_right
PySpark SparkContext
schedule Jun 17, 2022
Last updated
local_offer
Tags

PySpark SparkContext's parallelize(~) method creates a RDD (resilient distributed dataset) from the given dataset.

Parameters

1. c | any

The data you want to convert into RDD. Typically, you would pass a list of values.

2. numSlices | int | optional

The number of partitions to use. By default, the parallelism level set in the Spark configuration will be used for the number of partitions:

sc.defaultParallelism # For my configs, this is set to 8
8

Return Value

A PySpark RDD (pyspark.rdd.RDD).

Examples

Creating a RDD with a list of values

To create a RDD, use the parallelize(~) function:

rdd = sc.parallelize(["A","B","C","A"])
rdd.collect()
['A', 'B', 'C', 'A']

The default number of partitions as specified by my Spark configuration is:

Creating a RDD with specific number of partitions

To create a RDD using a list that has 3 partitions:

rdd = sc.parallelize(["A","B","C","A"], numSlices=3)
rdd.collect()
['A', 'B', 'C', 'A']

Here, Spark is partitioning our list into 3 sub-datasets. We can see the content of each partition using the glom() method:

rdd.glom().collect()
[['A'], ['B'], ['C', 'A']]

We can indeed see that there are 3 partitions:

  • Partition one: 'A'

  • Partition two: 'B'

  • Partition three: 'C' and 'A'

Notice how the same value 'A' does not necessarily end up in the same partition - the partitioning is done naively based on the ordering of the list.

Creating a pair RDD

To create a pair RDD, pass a list of tuples like so:

rdd = sc.parallelize([("A",1),("B",1),("C",1),("A",1)], numSlices=3)
rdd.collect()
[('A', 1), ('B', 1), ('C', 1), ('A', 1)]

Note that parallelize will not perform partitioning based on the key, as shown here:

rdd.glom().collect()
[[('A', 1)], [('B', 1)], [('C', 1), ('A', 1)]]

We can see that just like the previous case, the partitioning is done using the ordering of the list.

NOTE

A pair RDD is not a type of its own:

type(rdd)
pyspark.rdd.RDD

What makes pair RDDs special is that, we can perform additional methods such as reduceByKey(~), which performs a groupby on the key and perform a custom reduction function:

rdd = sc.parallelize([("A",1),("B",1),("C",1),("A",1)], numSlices=3)
new_rdd = rdd.reduceByKey(lambda a,b: a+b)
new_rdd.collect()
[('B', 1), ('C', 1), ('A', 2)]

Here, the reduction function that we used is a simple summation.

Creating a RDD from a Pandas DataFrame

Consider the following Pandas DataFrame:

import pandas as pd
df_pandas = pd.DataFrame({"A":[3,4],"B":[5,6]})
df_pandas
A B
0 3 5
1 4 6

To create a RDD that contains the values of this Pandas DataFrame:

df_spark = spark.createDataFrame(df_pandas)
rdd = df_spark.rdd
rdd.collect()
[Row(A=3, B=5), Row(A=4, B=6)]

Notice how only the values of the DataFrame are kept - column labels are not included in the RDD.

WARNING

Even though parallelize(~) can accept a Pandas DataFrame directly, this does not give us the desired RDD:

import pandas as pd
df_pandas = pd.DataFrame({"A":[3,4],"B":[5,6]})
rdd = sc.parallelize(df_pandas)
rdd.collect()
['A', 'B']

As you can see, the rdd only contains the column labels but not the data itself.

robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down