search
Search
Join our weekly DS/ML newsletter layers DS/ML Guides
menu
menu search toc more_vert
Robocat
Guest 0reps
Thanks for the thanks!
close
chevron_left PySpark SparkSession
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
help Ask a question
Share on Twitter
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to
A
A
brightness_medium
share
arrow_backShare
Twitter
Facebook
chevron_left PySpark SparkSession
2
thumb_down
0
chat_bubble_outline
0
auto_stories new
settings

PySpark SparkSession | createDataFrame method

Machine Learning
chevron_right
PySpark
chevron_right
Documentation
chevron_right
PySpark SparkSession
schedule Jul 1, 2022
Last updated
local_offer
Tags

PySpark's createDataFrame(~) method creates a new DataFrame from the given list, Pandas DataFrame or RDD.

Parameters

1. data | list-like or Pandas DataFrame or RDD

The data used to create the new DataFrame.

2. schema | pyspark.sql.types.DataType, string or list | optional

The column names and the data type of each column.

3. samplingRatio | float | optional

If the data type is not provided via schema, then samplingRatio indicates the proportion of rows to sample from when making inferences about the column type. By default, only the first row will be used for type inference.

4. verifySchema | boolean | optional

Whether or not to check the data against the given schema. If data type does not align, then an error will be thrown. By default, verifySchema=True.

Return Value

A PySpark DataFrame.

Examples

Creating a PySpark DataFrame from a list of lists

To create a PySpark DataFrame from a list of lists:

rows = [["Alex", 25], ["Bob", 30]]
df = spark.createDataFrame(rows)
df.show()
+----+---+
| _1| _2|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+

To create a PySpark DataFrame from a list of lists with the column names specified:

rows = [["Alex", 25], ["Bob", 30]]
df = spark.createDataFrame(rows, ["name", "age"])
df.show()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+

Creating a PySpark DataFrame with column names and type

To create a PySpark DataFrame with column name and type:

rows = [["Alex", 25], ["Bob", 30]]
df = spark.createDataFrame(rows, "name:string, age:int")
df.show()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+

Creating a PySpark DataFrame from a list of values

To create a PySpark DataFrame from a list of values:

from pyspark.sql.types import *
vals = [3,4,5]
spark.createDataFrame(vals, IntegerType()).show()
+-----+
|value|
+-----+
| 3|
| 4|
| 5|
+-----+

Here, the IntegerType() indicates that the column is of type integer - this is needed in this case, otherwise PySpark will throw an error.

Creating a PySpark DataFrame from a list of tuples

To create a PySpark DataFrame from a list of tuples:

rows = (("Alex", 25), ("Bob", 30))
df = spark.createDataFrame(rows, ["name", "age"])
df.show()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+

Creating a PySpark DataFrame from a list of objects

To create a PySpark DataFrame from a list of objects:

data = [{"name":"Alex", "age":20},{"name":"Bob", "age":30}]
df = spark.createDataFrame(data)
df.show()
+---+----+
|age|name|
+---+----+
| 20|Alex|
| 30| Bob|
+---+----+

Creating a PySpark DataFrame from a RDD

To create a PySpark DataFrame from a RDD:

rdd = sc.parallelize([["Alex", 25], ["Bob", 30]])
df = spark.createDataFrame(rdd, ["name", "age"])
df.show()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+

Here, we are using the parallelize(~) method to create a RDD.

Creating a PySpark DataFrame from a Pandas DataFrame

Consider the following Pandas DataFrame:

import pandas as pd
df = pd.DataFrame({"A":[3,4],"B":[5,6]})
df
A B
0 3 5
1 4 6

To create a PySpark DataFrame from this Pandas DataFrame:

pyspark_df = spark.createDataFrame(df)
pyspark_df.show()
+---+---+
| A| B|
+---+---+
| 3| 5|
| 4| 6|
+---+---+

Creating a PySpark DataFrame with a schema (StructType)

To create PySpark DataFrame while specifying the column names and types:

from pyspark.sql.types import *
schema = StructType([
StructField("name", StringType()),
StructField("age", IntegerType())])

rows = [["Alex", 25], ["Bob", 30]]
df = spark.createDataFrame(rows, schema)
df.show()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+

Here, name is of type string and age is of type integer.

Creating a PySpark DataFrame with date columns

To create a PySpark DataFrame with date columns, use the datetime library:

import datetime
df = spark.createDataFrame([["Alex", datetime.date(1995,12,16)], ["Bob", datetime.date(1995,5,9)]], ["name", "birthday"])
df.show()
+----+----------+
|name| birthday|
+----+----------+
|Alex|1995-12-16|
| Bob|1995-05-09|
+----+----------+

Specifying verifySchema

By default, verifySchema=True, which means that an error is thrown if there is a mismatch in the type indicated by the schema and the type inferred from data:

from pyspark.sql.types import *
schema = StructType([
StructField("name", IntegerType()),
StructField("age", IntegerType())])

rows = [["Alex", 25], ["Bob", 30]]
df = spark.createDataFrame(rows, schema) # verifySchema=True
df.show()
org.apache.spark.api.python.PythonException:
'TypeError: field name: IntegerType can not accept object 'Alex' in type <class 'str'>'

Here, an error is thrown because the inferred type of column name is string, but we have specified the column type to be integer in our schema.

By setting verifySchema=False, PySpark will fill the column with nulls instead of throwing an error:

from pyspark.sql.types import *
schema = StructType([
StructField("name", IntegerType()),
StructField("age", IntegerType())])

rows = [["Alex", 25], ["Bob", 30]]
df = spark.createDataFrame(rows, schema, verifySchema=False)
df.show()
+----+---+
|name|age|
+----+---+
|null| 25|
|null| 30|
+----+---+
mail
Join our newsletter for updates on new DS/ML comprehensive guides (spam-free)
robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?