How to use pivot() function in PySpark Azure Databricks?

Are you looking to find out how to use the pivot() function in Azure Databricks cloud or maybe you are looking for a solution, to column group values into multiple columns in PySpark Databricks using the select methods? If you are looking for any of these problem solutions, you have landed on the correct page. I will also show you how to use PySpark to pivot column values in DataFrames in Azure Databricks. I will explain it by taking a practical example. So don’t waste time let’s start with a step-by-step guide to understanding how to use pivot() function in PySpark DataFrame.

In this blog, I will teach you the following with practical examples:

  • Syntax of pivot()
  • Pivot PySpark DataFrame
  • Unpivot PySpark DataFrame

The pivot () method is used to reverse or transpose column grouped or identical values with multiple columns.

Syntax:

dataframe_name.groupBy(column_name).pivot(column_name).sum(column_name)

What is the syntax of the pivot() function in PySpark Azure Databricks?

The syntax is as follows:

dataframe_name.groupBy(group_column_name).pivot(pivot_column_name, value).sum(aggregation_column)
Parameter NameRequiredDescription
pivot_column_name (str)YesIt represents the name of the pivot column.
values (list)OptionalIt represents the list of values that will be translated to columns in the output DataFrame.
Table 1: pivot() Method in PySpark Databricks Parameter list with Details

Official documentation link: pivot()

Create a simple DataFrame

Let’s understand the use of the pivot() function with a variety of examples. Let’s start by creating a DataFrame.

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

a) Create manual PySpark DataFrame

data = [
    ("Foolish","Comedy",6026908,1940),
    ("Four Rooms","Comedy",4301000,1940),
    ("1941","Comedy",34175000,1950),
    ("1776","Drama",0,1940),
    ("Braveheart","Drama",75545647,1940),
    ("Chariots of Fire","Drama",57159946,1950),
    ("Bloodsport","Action",11806119,1950),
    ("Bad Boys","Action",65647413,1940),
    ("3 Ninjas Kick Back","Action",11744960,1950)
]

df = spark.createDataFrame(data, schema=["title", "genre", "gross", "release_year"])
df.printSchema()
df.show(truncate=False)

"""
root
 |-- title: string (nullable = true)
 |-- genre: string (nullable = true)
 |-- gross: long (nullable = true)
 |-- release_year: long (nullable = true)

+------------------+------+--------+------------+
|title             |genre |gross   |release_year|
+------------------+------+--------+------------+
|Foolish           |Comedy|6026908 |1940        |
|Four Rooms        |Comedy|4301000 |1940        |
|1941              |Comedy|34175000|1950        |
|1776              |Drama |0       |1940        |
|Braveheart        |Drama |75545647|1940        |
|Chariots of Fire  |Drama |57159946|1950        |
|Bloodsport        |Action|11806119|1950        |
|Bad Boys          |Action|65647413|1940        |
|3 Ninjas Kick Back|Action|11744960|1950        |
+------------------+------+--------+------------+
"""

b) Creating a DataFrame by reading files

Download and use the below source file.

# replace the file_path with the source file location which you have downloaded.

df_2 = spark.read.format("csv").option("inferSchema", True).option("header", True).load("file_path")
df_2.printSchema()

"""
root
 |-- title: string (nullable = true)
 |-- genre: string (nullable = true)
 |-- gross: integer (nullable = true)
 |-- release_year: integer (nullable = true)
"""

Note: Here, I will be using the manually created DataFrame.

How to pivot a column in PySpark Azure Databricks?

Data is rotated or transposed from one column into many Dataframe columns using the PySpark pivot() method and then reversed using unpivot (). Pivot() is an aggregate that transforms the values of one of the grouping columns into separate columns with distinct data.

Example:

df = df.groupBy("release_year").pivot("genre").sum("gross")
df.show()

"""
Output:

+------------+--------+--------+--------+
|release_year|  Action|  Comedy|   Drama|
+------------+--------+--------+--------+
|        1950|23551079|34175000|57159946|
|        1940|65647413|10327908|75545647|
+------------+--------+--------+--------+

"""

How to unpivot a column in PySpark Azure Databricks?

Below, we are going to unpivot the pivot data frame using SQL expressions.

Example:

from pyspark.sql.functions import expr

unpivotExpr = "STACK(3, 'Comedy', Comedy, 'Drama', Drama, 'Action', Action) AS (genre,gross)"
unPivotDF = df.select("release_year", expr(unpivotExpr)) \
    .where("gross is not null")

unPivotDF.show(truncate=False)

"""
Output

+------------+------+--------+
|release_year|genre |gross   |
+------------+------+--------+
|1950        |Comedy|34175000|
|1950        |Drama |57159946|
|1950        |Action|23551079|
|1940        |Comedy|10327908|
|1940        |Drama |75545647|
|1940        |Action|65647413|
+------------+------+--------+

"""

I have attached the complete code used in this blog in a notebook format in this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you use the pivot() function in PySpark Azure Databricks?

We can use the pivot() function, whenever we want to rotate or transpose the data from one column into multiple Dataframe columns.

Real World Use Case Scenarios for using pivot() function in PySpark Azure Databricks?

What are the alternatives to pivot() function in PySpark Azure Databricks?

The PySpark function pivot() is the only one that helps in rotating or transposing a column. The pivot function is explained in detail with a practical example in the above section.

Final Thoughts

In this article, we have learned about the PySpark pivot() and unpivot methods to select the columns of a dataframe in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

PySpark in Azure Databricks, as explained by Arud Seka Berne S on azurelib.com.

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.