How to create and manipulate ArrayType columns in PySpark Azure Databricks?

Are you looking to find out how to create an ArrayType column of PySpark DataFrame in Azure Databricks cloud, or maybe you are looking for a solution, to manipulate the DataFrame ArrayType column’s value in PySpark Databricks using PySpark’s in-built methods? If you are looking for any of these problem solutions, you have landed on the correct page. I will also show you how to use PySpark to create and manipulate the ArrayType column and its values in DataFrames in Azure Databricks. I will explain it by taking a practical example. So don’t waste time let’s start with a step-by-step guide to understanding how to create and change the ArrayType column values in PySpark DataFrame.

In this blog, I will teach you the following with practical examples:

  • Syntax of ArrayType
  • Creating an instance of the ArrayType column
  • Creating ArrayType column using StructType
  • Array functions

ArrayType() method is used to create an instance of the ArrayType column datatyoe of the DataFrame in PySpark Azure Databricks.

Syntax:

ArrayType(element_data_type, cantains_null)

What is the syntax of the ArrayType() function in PySpark Azure Databricks?

The syntax is as follows:

ArrayType(element_data_type, cantains_null)
Parameter NameRequiredDescription
element_data_type (DataType)YesIt represents the data type of the array element.
contains_null (bool)OptionalIt represents whether we can have an empty array or not.
Table 1: ArrayType() Method in PySpark Databricks Parameter list with Details

Apache Spark Official Documentation Link: ArrayType()

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

How to create an instance of ArrayType column in PySpark Azure Databricks?

The PySpark ArrayType() takes two arguments, an element datatype and a bool value representing whether it can have a null value. By default, contains_null is true. Let’s start by creating a DataFrame.

from pyspark.sql.types import ArrayType, IntegerType

array_column = ArrayType(elementType=IntegerType(), containsNull=True)

How to create an ArrayType column in PySpark Azure Databricks?

Download and use the below source file.

from pyspark.sql.types import StructType, StructField, ArrayType, StringType

schema = StructType([
    StructField("full_name", StringType(), True),
    StructField("items", ArrayType(StringType()), True),
    StructField("city", StringType(), True),
    StructField("state", StringType(), True),
])

data = [
    ("Indhu Madhi", (["cakes", "cookies", "chocolates"]), "Salem", "TN"),
    ("Asha Shree", (["apple", "orange", "banana"]), "Cochin", "KL"),
    ("Anand Kumar", (["carrot", "banana", "cookies"]), "Erode", "TN"),
    ("Aswin Raj", None, "Whitefield", "KA")
]

# If you want to read data from a file, I have attached the dataset in the blog.

df = spark.createDataFrame(data, schema=schema)
df.printSchema()
df.show(truncate=False)

"""
Output:

root
 |-- full_name: string (nullable = true)
 |-- items: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)

+-----------+----------------------------+----------+-----+
|full_name  |items                       |city      |state|
+-----------+----------------------------+----------+-----+
|Indhu Madhi|[cakes, cookies, chocolates]|Salem     |TN   |
|Asha Shree |[apple, orange, banana]     |Cochin    |KL   |
|Anand Kumar|[carrot, banana, cookies]   |Erode     |TN   |
|Aswin Raj  |null                        |Whitefield|KA   |
+-----------+----------------------------+----------+-----+

"""

How to create new rows from ArrayType column in PySpark Azure Databricks?

We can generate new rows from the given column of ArrayType by using the PySpark explode() function. The explode function will not create a new row for an ArrayType column that has null as a value.

Example:

from pyspark.sql.functions import explode

df.select("full_name", explode("items").alias("foods")).show()

"""
Output:

+-----------+----------+
|  full_name|     foods|
+-----------+----------+
|Indhu Madhi|     cakes|
|Indhu Madhi|   cookies|
|Indhu Madhi|chocolates|
| Asha Shree|     apple|
| Asha Shree|    orange|
| Asha Shree|    banana|
|Anand Kumar|    carrot|
|Anand Kumar|    banana|
|Anand Kumar|   cookies|
+-----------+----------+

"""

As mentioned above, the person “Aswin Raj” record was not listed in the above example output.

How to create new rows from ArrayType column having null values in PySpark Azure Databricks?

We can generate new rows from the given column of ArrayType by using the PySpark explode_outer() function. The explode_outer() function also creates new rows for an array column having null as a value.

Example:

from pyspark.sql.functions import explode_outer

df.select("full_name", explode_outer("items").alias("foods")).show()

"""
Output:

+-----------+----------+
|  full_name|     foods|
+-----------+----------+
|Indhu Madhi|     cakes|
|Indhu Madhi|   cookies|
|Indhu Madhi|chocolates|
| Asha Shree|     apple|
| Asha Shree|    orange|
| Asha Shree|    banana|
|Anand Kumar|    carrot|
|Anand Kumar|    banana|
|Anand Kumar|   cookies|
|  Aswin Raj|      null|
+-----------+----------+

"""

As mentioned above, the person “Aswin Raj” record has a null record but is listed in the above example output.

How to create new rows with their index position from ArrayType column in PySpark Azure Databricks?

We can generate new rows from the given column of ArrayType by using the PySpark explode_outer() function. The posexplode() function will create new rows along with the position of the elements inside the ArrayType column.

Example:

from pyspark.sql.functions import posexplode

df.select("full_name", posexplode("items").alias("food_index", "foods")).show()

"""
Output:

+-----------+----------+----------+
|  full_name|food_index|     foods|
+-----------+----------+----------+
|Indhu Madhi|         0|     cakes|
|Indhu Madhi|         1|   cookies|
|Indhu Madhi|         2|chocolates|
| Asha Shree|         0|     apple|
| Asha Shree|         1|    orange|
| Asha Shree|         2|    banana|
|Anand Kumar|         0|    carrot|
|Anand Kumar|         1|    banana|
|Anand Kumar|         2|   cookies|
+-----------+----------+----------+

"""

As mentioned above, each element’s position is recorded in a separate column.

How to create new rows from ArrayType column having null values along with their index value in PySpark Azure Databricks?

We can generate new rows from the given column of ArrayType by using the PySpark explode_outer() function. The explode_outer() function also creates new rows for an array column having null as a value and creates an index column that represents the element index position.

Example:

from pyspark.sql.functions import posexplode_outer

df.select("full_name", posexplode_outer("items").alias("food_index", "foods")).show()

"""
Output:

+-----------+----------+----------+
|  full_name|food_index|     foods|
+-----------+----------+----------+
|Indhu Madhi|         0|     cakes|
|Indhu Madhi|         1|   cookies|
|Indhu Madhi|         2|chocolates|
| Asha Shree|         0|     apple|
| Asha Shree|         1|    orange|
| Asha Shree|         2|    banana|
|Anand Kumar|         0|    carrot|
|Anand Kumar|         1|    banana|
|Anand Kumar|         2|   cookies|
|  Aswin Raj|      null|      null|
+-----------+----------+----------+

"""

As mentioned above, each element’s position is recorded in a separate column and includes null records too.

How to split string column values into ArrayType in PySpark Azure Databricks?

We can split rows from the given column of StringType into ArrayType by using the PySpark split() function. The split() function takes column to be split, pattern and limit.

Example:

from pyspark.sql.functions import split

df.select(split("full_name", " ").alias("split_name")).show()

"""
Output:

+--------------+
|    split_name|
+--------------+
|[Indhu, Madhi]|
| [Asha, Shree]|
|[Anand, Kumar]|
|  [Aswin, Raj]|
+--------------+

"""

As mentioned above, each element gets split into an array of elements using the ” ” as a pattern.

How to merge two columns into ArrayType column in PySpark Azure Databricks?

We can merge column rows into a single column of ArrayType by using the PySpark array() function. The array() function takes columns as arguments.

Example:

from pyspark.sql.functions import array

df.select("full_name", array("city", "state").alias("address")).show(truncate=False)

"""
Output:

+-----------+----------------+
|full_name  |address         |
+-----------+----------------+
|Indhu Madhi|[Salem, TN]     |
|Asha Shree |[Cochin, KL]    |
|Anand Kumar|[Erode, TN]     |
|Aswin Raj  |[Whitefield, KA]|
+-----------+----------------+

"""

As mentioned above, the city and state columns merged into a single column(address).

How to check element in ArrayType column in PySpark Azure Databricks?

We can check whether a column of type ArrayType contains a specific element or not by using the PySpark array_contains() function. The array_contains() function takes a column and a literal value as arguments.

Example:

In the below example, let’s find out who loves or has “cookies” in their items list.

from pyspark.sql.functions import array_contains

df.select("full_name", "items").filter(array_contains("items", "cookies")).show(truncate=False)

"""
Output:

+-----------+----------------------------+
|full_name  |items                       |
+-----------+----------------------------+
|Indhu Madhi|[cakes, cookies, chocolates]|
|Anand Kumar|[carrot, banana, cookies]   |
+-----------+----------------------------+

"""

Using the array_contains() function, we can see that the person “Indhu Madhi” and “Anand Kumar” like or has cookies in their items list.

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you use ArrayType column functions in PySpark Azure Databricks?

These are some of the possible reasons:

  1. To create new records from an ArrayType column by exploding
  2. To split string value into an array
  3. To create an ArrayType column from existing columns
  4. To check element is present in an array or not

Real World Use Case Scenarios for using ArrayType column functions in PySpark Azure Databricks?

  • Assume you have a column name “full_name” containing a person’s name. You want to create two columns named “first_name” and “last_name” by splitting the “full_name” column. For example (“Vijay Antony”) into (“Vijay”, “Antony”) You can use the split() function to split the column using the separator as a delimiter.
  • Assume you were given a dataset containing People’s names, and their favorite foods, for example (“Berne”, [“apple”, “pancake”, “waffle”]). Now you have given a requirement to fetch the count of unique foods. For this, you might need to unstack the ArrayType column into a StringType and apply a count() function.
  • Assume you were given a candidate dataset that contains ID, name, and preferred location as an array for example (“Berne”, [“Chennai”, “Bangalore”, “Mumbai”]). YOu have asked to fetch all the candidates who preferred to work in the “Chennai” location. You can use the array_contains() function to check whether “Chennai” is inside the array column or not and pass it inside the filter() function or get the candidate’s records.

These are some of the real-world examples, each of the array functions is explained in detail with a practical example in the above section.

What are the alternatives for creating ArrayType column in PySpark Azure Databricks?

These alternatives were discussed with multiple examples in the above section.

  • Use lit() function for creating aa ArrayType column from literal values
  • Use array() function to create an ArrayType column from existing columns

Final Thoughts

In this article, we have learned about the PySpark ArrayType and its method to manipulate the columns of DataFrame in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

Arud Seka Berne S

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.