df = spark.createDataFrame([["Alex", 20], ["Bob", 24], ["Cathy", 22], ["Doge", 30]], ["name", "age"])
df.show()
                
            
            +-----+---+
| name|age|
+-----+---+
| Alex| 20|
|  Bob| 24|
|Cathy| 22|
| Doge| 30|
+-----+---+

Getting the summary statistics of numeric columns of PySpark DataFrame

The summary statistics of our DataFrame is as follows:


        
        
            
                
                
                    df.summary().show()
                
            
            +-------+----+-----------------+
|summary|name|              age|
+-------+----+-----------------+
|  count|   4|                4|
|   mean|null|             24.0|
| stddev|null|4.320493798938574|
|    min|Alex|               20|
|    25%|null|               20|
|    50%|null|               22|
|    75%|null|               24|
|    max|Doge|               30|
+-------+----+-----------------+

To compute certain summary statistics only:


        
        
            
                
                
                    df.summary("max", "min").show()
                
            
            +-------+----+---+
|summary|name|age|
+-------+----+---+
|    max|Doge| 30|
|    min|Alex| 20|
+-------+----+---+

Getting n-th percentile of numeric columns in PySpark DataFrame

To compute the 60th percentile:


        
        
            
                
                
                    df.summary("60%").show()
                
            
            +-------+----+---+
|summary|name|age|
+-------+----+---+
|    60%|null| 24|
+-------+----+---+

Getting summary statistics of certain columns in PySpark DataFrame

To summarise certain columns instead, use the select(~) method first to select the columns that you want to summarize:


        
        
            
                
                
                    df.select("age").summary("max", "min").show()
                
            
            +-------+---+
|summary|age|
+-------+---+
|    max| 30|
|    min| 20|
+-------+---+