How to use explode() function in PySpark Azure Databricks?

Are you looking to find out how to create new rows from an ArrayType column of PySpark DataFrame using Azure Databricks cloud or maybe you are looking for a solution, to explode an array column in PySpark Databricks using the explode() function? If you are looking for any of these problem solutions, you have landed on the correct page. I will also help you how to use PySpark explode() function with multiple examples in Azure Databricks. I will explain it by taking a practical example. So please don’t waste time let’s start with a step-by-step guide to understand how to use the explode() function in PySpark.

In this blog, I will teach you the following with practical examples:

  • Syntax of explode()
  • Using explode()
  • Using explode_outer()
  • Using posexplode()
  • Using posexplode_outer()
  • Flattening nested array

The Pyspark explode() function is used to transform each element of a list-like to a row, replicating index values.

Syntax:

explode()

What is the syntax of the explode() function in PySpark Azure Databricks?

The syntax is as follows:

explode(column)
Parameter NameRequiredDescription
column (str, Tuple)YesIt represents the exploding column.
Table 1: explode() Method in PySpark Databricks Parameter list with Details

Apache Spark Official Documentation Link: explode()

Create a simple DataFrame

Let’s understand the use of the explode() function with various examples. Let’s start by creating a DataFrame.

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

a) Create manual PySpark DataFrame

from pyspark.sql.types import StructType, StructField, ArrayType, StringType

data = [
    ("Anand",["Apple", "Banana"],[["Simrun", "Kuspu"], ["Ajith", "Suriya"]]),
    ("Berne",["Pine Apple", "Orange"],[["Trisha", "Samantha"], ["Vijay", "Suriya"]]),
    ("Charan",["Grapes", "Pome"],[["Nayanthara", "Jothika"], ["Simbu", "Vishal"]]),
    ("Denish",["Starwberry", "Blueberry"],[["Asin", "Katrina"], ["Rajini", "Aswin"]]),
    ("Eren",None,[["Sherina", "Emi"], ["Kamal", "Sivaji"]])
]

columns = ["name","foods","actors"]
df = spark.createDataFrame(data, schema=columns)
df.printSchema()
df.show(truncate=False)

"""
root
 |-- name: string (nullable = true)
 |-- foods: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- actors: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)

+------+-----------------------+----------------------------------------+
|name  |foods                  |actors                                  |
+------+-----------------------+----------------------------------------+
|Anand |[Apple, Banana]        |[[Simrun, Kuspu], [Ajith, Suriya]]      |
|Berne |[Pine Apple, Orange]   |[[Trisha, Samantha], [Vijay, Suriya]]   |
|Charan|[Grapes, Pome]         |[[Nayanthara, Jothika], [Simbu, Vishal]]|
|Denish|[Starwberry, Blueberry]|[[Asin, Katrina], [Rajini, Aswin]]      |
|Eren  |null                   |[[Sherina, Emi], [Kamal, Sivaji]]       |
+------+-----------------------+----------------------------------------+
"""

b) Creating a DataFrame by reading files

Download and use the below source file.

# replace the file_path with the source file location which you have downloaded.

df_2 = spark.read.format("csv").option("header", True).load(file_path)
df_2.printSchema()

"""
root
 |-- name: string (nullable = true)
 |-- foods: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- actors: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)
"""

Note: Here, I will be using the manually created DataFrame.

How to explode ArrayType column in PySpark Azure Databricks?

We can generate new rows from the given column of ArrayType by using the PySpark explode() function. The explode function will not create a new row for an ArrayType column having null as a value.

Example:

from pyspark.sql.functions import explode

df.select("name", explode("foods").alias("explode")).show()

"""
Output:

+------+----------+
|  name|   explode|
+------+----------+
| Anand|     Apple|
| Anand|    Banana|
| Berne|Pine Apple|
| Berne|    Orange|
|Charan|    Grapes|
|Charan|      Pome|
|Denish|Starwberry|
|Denish| Blueberry|
+------+----------+

"""

As mentioned above, the person “Eren” record was not listed in the above example output.

How to explode ArrayType column elements having null values?

We can generate new rows from the given column of ArrayType by using the PySpark explode_outer() function. The explode_outer() function also creates new rows for a map column having null as a value.

Examples:

from pyspark.sql.functions import explode_outer

df.select("name", explode_outer("foods").alias("explode_outer")).show()

"""
Output:

+------+-------------+
|  name|explode_outer|
+------+-------------+
| Anand|        Apple|
| Anand|       Banana|
| Berne|   Pine Apple|
| Berne|       Orange|
|Charan|       Grapes|
|Charan|         Pome|
|Denish|   Starwberry|
|Denish|    Blueberry|
|  Eren|         null|
+------+-------------+

"""

As mentioned above, the person “Eren” record has a null record but is listed in the above example output.

How to explode an ArrayType column with their index position elements?

We can generate new rows from the given column of ArrayType by using the PySpark explode_outer() function. The posexplode() function will create new rows along with the position of the elements inside the map column.

Examples:

from pyspark.sql.functions import posexplode

df.select("name", posexplode("foods").alias("index", "posexplode")).show()

"""
Output:

+------+-----+----------+
|  name|index|posexplode|
+------+-----+----------+
| Anand|    0|     Apple|
| Anand|    1|    Banana|
| Berne|    0|Pine Apple|
| Berne|    1|    Orange|
|Charan|    0|    Grapes|
|Charan|    1|      Pome|
|Denish|    0|Starwberry|
|Denish|    1| Blueberry|
+------+-----+----------+

"""

As mentioned above, each element’s position is recorded in a separate column.

How to explode ArrayType column elements having null values along with their index position in PySpark DataFrame?

We can generate new rows from the given column of ArrayType by using the PySpark explode_outer() function. The explode_outer() function also creates new rows for a map column having null as a value and creates an index column that represents the element index position.

Examples:

from pyspark.sql.functions import posexplode_outer

df.select("name", posexplode_outer("foods").alias("index", "posexplode_outer")).show()

"""
Output:

+------+-----+----------------+
|  name|index|posexplode_outer|
+------+-----+----------------+
| Anand|    0|           Apple|
| Anand|    1|          Banana|
| Berne|    0|      Pine Apple|
| Berne|    1|          Orange|
|Charan|    0|          Grapes|
|Charan|    1|            Pome|
|Denish|    0|      Starwberry|
|Denish|    1|       Blueberry|
|  Eren| null|            null|
+------+-----+----------------+

"""

As mentioned above, each element’s position is recorded in a separate column and includes null records too.

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you use the PySpark explode() in Azure Databricks?

You could use the PySpark explode function to unstack the array values into separate records.

Real World Use Case Scenarios for PySpark DataFrame explode() in Azure Databricks?

Assume that you were given a people dataset. The dataset contains people’s ID, names, and favorite_books (an ArrayType column), for example (“EMP1”, “Berne”, [“Wings on Fire”, “Rich Dad Poor Dad”]). Now you were given a requirement to unstack the favourite_books column and to look like (“EMP1”, “Berne”, “Wings on Fire”) and (“EMP1”, “Berne”, “Rich Dad Poor Dad”). You should use the PySpark explode() function to achieve the requirement.

What are the alternatives to the explode() function in PySpark Azure Databricks?

There are multiple alternatives to the explode() function, which is as follows:

  • explode() used for unstacking an ArrayType column
  • explode_outer() used for unstacking ArrayType columns having null values
  • posexplode() used for unstacking an ArrayType column with their index position
  • posexplode_outer() used for unstacking ArrayType columns having null values along with their index position

Final Thoughts

In this article, we have learned about the PySpark explode() method of DataFrame in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

PySpark in Azure Databricks, as explained by Arud Seka Berne S on azurelib.com.

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.