PySpark DataFrame | select method
Start your free 7-days trial now!
The select(~) method of PySpark DataFrame returns a new DataFrame with the specified columns.
Parameters
1. *cols | string, Column or list
The columns to include in the returned DataFrame.
Return Value
A new PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
+----+---+|name|age|+----+---+|Alex| 25|| Bob| 30|+----+---+
Selecting a single column of PySpark DataFrame
To select a single column, pass the name of the column as a string:
+----+|name|+----+|Alex|| Bob|+----+
Or equivalently, we could pass in a Column object:
+----+|name|+----+|Alex|| Bob|+----+
Here, df["name"] is of type Column. Here, you can think of the role of select(~) as converting a Column object into a PySpark DataFrame.
Or equivalently, the Column object can also be obtained using sql.function:
Selecting multiple columns of a PySpark DataFrame
To select the columns name and age:
+----+---+|name|age|+----+---+|Alex| 25|| Bob| 30|+----+---+
Or equivalently, we can supply multiple Column objects:
+----+---+|name|age|+----+---+|Alex| 25|| Bob| 30|+----+---+
Or equivalently, we can supply Column objects obtained from sql.functions:
import pyspark.sql.functions as F
+----+---+|name|age|+----+---+|Alex| 25|| Bob| 30|+----+---+
Selecting all columns of a PySpark DataFrame
To select all columns, pass "*":
+----+---+|name|age|+----+---+|Alex| 25|| Bob| 30|+----+---+
Selecting columns given a list of column labels
To select columns given a list of column labels, use the * operator:
cols = ["name", "age"]
+----+---+|name|age|+----+---+|Alex| 25|| Bob| 30|+----+---+
Here, the * operator is used to convert the list into positional arguments.
Selecting columns that begin with a certain substring
To select columns that begin with a certain substring:
+----+|name|+----+|Alex|| Bob|+----+
Here, we are using Python's list comprehension to get a list of column labels that begin with the substring "na":
cols
['name']