PySpark SQL Functions | col method
PySpark SQL Functions'
col(~) method returns a
The label of the column to return.
Consider the following PySpark DataFrame:
df = spark.createDataFrame([["Alex", 20], ["Bob", 30]], ["name", "age"])df.show()+----+---+|name|age|+----+---+|Alex| 20|| Bob| 30|+----+---+
Selecting a column in PySpark
To select the
Note that we could also select the
name column without the explicit use of
F.col(~) like so:
Creating a new column
To create a new column called
status whose values are dependent on the
Note the following:
"*"refers to all the columns of
we are using the
otherwise(~)pattern to fill the values of our column conditionally
we use the
alias(~)method to assign a label to new column
F.col("age") can also be replaced by
How does col know which DataFrame's column to refer to?
Notice how the
col(~) method only takes in as argument the name of the column. PySpark executes our code lazily and waits until an action is invoked (e.g.
show()) to run all the transformations (e.g.
df.select(~)). Therefore, PySpark will have the needed context to decipher to which DataFrame's column the
col(~) is referring.
For example, suppose we have the following two PySpark DataFrames with the same schema:
df1 = spark.createDataFrame([["Alex", 20], ["Bob", 30]], ["name", "age"])df2 = spark.createDataFrame([["Cathy", 40], ["Doge", 50]], ["name", "age"])my_col = F.col("name")
Let's select the
name column from
Here, PySpark knows that we are referring to
df1's name column because
df1 is invoking the transformation (
Let's now select the
name column from
+-----+| name|+-----+|Cathy|| Doge|+-----+
Again, PySpark is aware that this time the
name column is referring to