search
Search
Login
Math ML Join our weekly DS/ML newsletter
menu
menu search toc more_vert
Robocat
Guest 0reps
Thanks for the thanks!
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
help Ask a question
Share on Twitter
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to
A
A
brightness_medium
share
arrow_backShare
Twitter
Facebook
check_circle
Mark as learned
thumb_up
0
thumb_down
0
chat_bubble_outline
0
auto_stories new
settings

Using SQL against a PySpark DataFrame

Machine Learning
chevron_right
PySpark
chevron_right
PySpark Guides
schedule Jul 1, 2022
Last updated
local_offer PySpark
Tags

Consider the following PySpark DataFrame:

df = spark.createDataFrame([["Alex", 20], ["Bob", 30], ["Cathy", 40]], ["name", "age"])
df.show()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+

Registering PySpark DataFrame as a SQL table

Before we can run SQL queries against a PySpark DataFrame, we must first register the DataFrame as a SQL table:

df.createOrReplaceTempView("users")

Here, we have registered the DataFrame as a SQL table called users. The temporary table will be dropped whenever the Spark session ends. On the other hand, createGlobalTempView(~) will be shared across Spark sessions, and will only be dropped whenever the Spark application ends.

Running SQL queries against PySpark DataFrame

We can now run SQL queries against our PySpark DataFrame:

spark.sql("SELECT * FROM users").show()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
WARNING

Only read-only SQL statements are allowed - data manipulation language (DML) statements such as UPDATE and DELETE are not supported since PySpark has no notion of transactions.

Using variables in SQL queries

The sql(~) method takes in a SQL query expression (string), and so incorporating variables can be done using f-string:

table_name = "users"
query = f"SELECT * FROM {table_name}"
df_res = spark.sql(query)
df_res.show()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
mail
Join our newsletter for updates on new DS/ML comprehensive guides (spam-free)
robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Ask a question or leave a feedback...
thumb_up
0
thumb_down
0
chat_bubble_outline
0
settings
Enjoy our search
Hit / to insta-search docs and recipes!