search
Search
Login
Unlock 100+ guides
menu
menu
web
search toc
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
What does this mean?
Why is this true?
Give me some examples!
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to
chevron_leftPySpark SparkSession
check_circle
Mark as learned
thumb_up
4
thumb_down
0
chat_bubble_outline
0
Comment
auto_stories Bi-column layout
settings

PySpark SparkSession | createDataFrame method

schedule Aug 12, 2023
Last updated
local_offer
Tags
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!

PySpark's createDataFrame(~) method creates a new DataFrame from the given list, Pandas DataFrame or RDD.

Parameters

1. data | list-like or Pandas DataFrame or RDD

The data used to create the new DataFrame.

2. schema | pyspark.sql.types.DataType, string or list | optional

The column names and the data type of each column.

3. samplingRatio | float | optional

If the data type is not provided via schema, then samplingRatio indicates the proportion of rows to sample from when making inferences about the column type. By default, only the first row will be used for type inference.

4. verifySchema | boolean | optional

Whether or not to check the data against the given schema. If data type does not align, then an error will be thrown. By default, verifySchema=True.

Return Value

A PySpark DataFrame.

Examples

Creating a PySpark DataFrame from a list of lists

To create a PySpark DataFrame from a list of lists:

rows = [["Alex", 25], ["Bob", 30]]
df = spark.createDataFrame(rows)
df.show()
+----+---+
| _1| _2|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+

To create a PySpark DataFrame from a list of lists with the column names specified:

rows = [["Alex", 25], ["Bob", 30]]
df = spark.createDataFrame(rows, ["name", "age"])
df.show()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+

Creating a PySpark DataFrame with column names and type

To create a PySpark DataFrame with column name and type:

rows = [["Alex", 25], ["Bob", 30]]
df = spark.createDataFrame(rows, "name:string, age:int")
df.show()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+

Creating a PySpark DataFrame from a list of values

To create a PySpark DataFrame from a list of values:

from pyspark.sql.types import *
vals = [3,4,5]
spark.createDataFrame(vals, IntegerType()).show()
+-----+
|value|
+-----+
| 3|
| 4|
| 5|
+-----+

Here, the IntegerType() indicates that the column is of type integer - this is needed in this case, otherwise PySpark will throw an error.

Creating a PySpark DataFrame from a list of tuples

To create a PySpark DataFrame from a list of tuples:

rows = (("Alex", 25), ("Bob", 30))
df = spark.createDataFrame(rows, ["name", "age"])
df.show()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+

Creating a PySpark DataFrame from a list of objects

To create a PySpark DataFrame from a list of objects:

data = [{"name":"Alex", "age":20},{"name":"Bob", "age":30}]
df = spark.createDataFrame(data)
df.show()
+---+----+
|age|name|
+---+----+
| 20|Alex|
| 30| Bob|
+---+----+

Creating a PySpark DataFrame from a RDD

To create a PySpark DataFrame from a RDD:

rdd = sc.parallelize([["Alex", 25], ["Bob", 30]])
df = spark.createDataFrame(rdd, ["name", "age"])
df.show()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+

Here, we are using the parallelize(~) method to create a RDD.

Creating a PySpark DataFrame from a Pandas DataFrame

Consider the following Pandas DataFrame:

import pandas as pd
df = pd.DataFrame({"A":[3,4],"B":[5,6]})
df
A B
0 3 5
1 4 6

To create a PySpark DataFrame from this Pandas DataFrame:

pyspark_df = spark.createDataFrame(df)
pyspark_df.show()
+---+---+
| A| B|
+---+---+
| 3| 5|
| 4| 6|
+---+---+

Creating a PySpark DataFrame with a schema (StructType)

To create PySpark DataFrame while specifying the column names and types:

from pyspark.sql.types import *
schema = StructType([
StructField("name", StringType()),
StructField("age", IntegerType())])

rows = [["Alex", 25], ["Bob", 30]]
df = spark.createDataFrame(rows, schema)
df.show()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+

Here, name is of type string and age is of type integer.

Creating a PySpark DataFrame with date columns

To create a PySpark DataFrame with date columns, use the datetime library:

import datetime
df = spark.createDataFrame([["Alex", datetime.date(1995,12,16)], ["Bob", datetime.date(1995,5,9)]], ["name", "birthday"])
df.show()
+----+----------+
|name| birthday|
+----+----------+
|Alex|1995-12-16|
| Bob|1995-05-09|
+----+----------+

Specifying verifySchema

By default, verifySchema=True, which means that an error is thrown if there is a mismatch in the type indicated by the schema and the type inferred from data:

from pyspark.sql.types import *
schema = StructType([
StructField("name", IntegerType()),
StructField("age", IntegerType())])

rows = [["Alex", 25], ["Bob", 30]]
df = spark.createDataFrame(rows, schema) # verifySchema=True
df.show()
org.apache.spark.api.python.PythonException:
'TypeError: field name: IntegerType can not accept object 'Alex' in type <class 'str'>'

Here, an error is thrown because the inferred type of column name is string, but we have specified the column type to be integer in our schema.

By setting verifySchema=False, PySpark will fill the column with nulls instead of throwing an error:

from pyspark.sql.types import *
schema = StructType([
StructField("name", IntegerType()),
StructField("age", IntegerType())])

rows = [["Alex", 25], ["Bob", 30]]
df = spark.createDataFrame(rows, schema, verifySchema=False)
df.show()
+----+---+
|name|age|
+----+---+
|null| 25|
|null| 30|
+----+---+
robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...