df = spark.createDataFrame([["Alex", 25], ["Bob", 30]], ["name", "age"])
df.show()
                
            
            +----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+

Selecting a single column of PySpark DataFrame

To select a single column, pass the name of the column as a string:


        
        
            
                
                
                    df.select("name").show()
                
            
            +----+
|name|
+----+
|Alex|
| Bob|
+----+

Or equivalently, we could pass in a Column object:


        
        
            
                
                
                    df.select(df["name"]).show()
                
            
            +----+
|name|
+----+
|Alex|
| Bob|
+----+

Here, df["name"] is of type Column. Here, you can think of the role of select(~) as converting a Column object into a PySpark DataFrame.

Or equivalently, the Column object can also be obtained using sql.function:


        
        
            
                
                
                    import pyspark.sql.functions as F
df.select(F.col("name")).show()
                
            
            +----+
|name|
+----+
|Alex|
| Bob|
+----+

Selecting multiple columns of a PySpark DataFrame

To select the columns name and age:


        
        
            
                
                
                    df.select("name","age").show()
                
            
            +----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+

Or equivalently, we can supply multiple Column objects:


        
        
            
                
                
                    df.select(df["name"],df["age"]).show()
                
            
            +----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+

Or equivalently, we can supply Column objects obtained from sql.functions:


        
        
            
                
                
                    import pyspark.sql.functions as F
df.select(F.col("name"), F.col("age")).show()
                
            
            +----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+

Selecting all columns of a PySpark DataFrame

To select all columns, pass "*":


        
        
            
                
                
                    df.select("*").show()
                
            
            +----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+

Selecting columns given a list of column labels

To select columns given a list of column labels, use the * operator:


        
        
            
                
                
                    cols = ["name", "age"]
df.select(cols).show()
                
            
            +----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+

Here, the * operator is used to convert the list into positional arguments.

Selecting columns that begin with a certain substring

To select columns that begin with a certain substring:


        
        
            
                
                
                    cols = [col for col in df.columns if col.startswith("na")]
df.select(cols).show()
                
            
            +----+
|name|
+----+
|Alex|
| Bob|
+----+

Here, we are using Python's list comprehension to get a list of column labels that begin with the substring "na":


        
        
            
                
                
                    cols = [col for col in df.columns if col.startswith("na")]
cols
                
            
            ['name']

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

Official PySpark Documentation

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.select.html

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!