How to create ArrayType column from existing columns PySpark Azure Databricks?

Are you looking to find out how to join columns into a column of ArrayType of PySpark DataFrame using Azure Databricks cloud or maybe you are looking for a solution, to group multiple columns together in PySpark Databricks using the array_contains() function? If you are looking for any of these problem solutions, you have landed on the correct page. I will also help you how to use PySpark array() function with multiple examples in Azure Databricks. I will explain it by taking a practical example. So please don’t waste time let’s start with a step-by-step guide to understand how to use the array() function in PySpark.

In this blog, I will teach you the following with practical examples:

  • Syntax of array()
  • Using array() function with select()
  • Using array() function with withColumn()

The Pyspark array() function is used to create a column of ArrayType.

Syntax:

array()

What is the syntax of the array() function in PySpark Azure Databricks?

The syntax is as follows:

array(columns)
Parameter NameRequiredDescription
column (str, Column)YesIt represents the list of columns to be grouped together.
Table 1: array() Method in PySpark Databricks Parameter list with Details

Apache Spark Official Documentation Link: array()

Create a simple DataFrame

Let’s understand the use of the array() function with various examples. Let’s start by creating a DataFrame.

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

a) Create manual PySpark DataFrame

data = [
    ("Anandhi", "Chennai", "Cochin", "Hyderabad"),
    ("Benish", "Coimbatore", "Mumbai", "Chennai"),
    ("Chandru", "Salem", "Bangalore", None),
    ("Derif", "Delhi", "Bangalore", "Noida"),
    ("Fayaz", "Mumbai", "Pune", "Cochin"),
    ("Gomathi", "Chennai", "Mumbai", None),
    ("Harini", "Delhi", "Noida", "Kolkata")
]

columns = ["name", "pref1", "pref2", "pref3"]
df = spark.createDataFrame(data, schema=columns)
df.printSchema()
df.show(truncate=False)

"""
root
 |-- name: string (nullable = true)
 |-- pref1: string (nullable = true)
 |-- pref2: string (nullable = true)
 |-- pref3: string (nullable = true)

+-------+----------+---------+---------+
|name   |pref1     |pref2    |pref3    |
+-------+----------+---------+---------+
|Anandhi|Chennai   |Cochin   |Hyderabad|
|Benish |Coimbatore|Mumbai   |Chennai  |
|Chandru|Salem     |Bangalore|null     |
|Derif  |Delhi     |Bangalore|Noida    |
|Fayaz  |Mumbai    |Pune     |Cochin   |
|Gomathi|Chennai   |Mumbai   |null     |
|Harini |Delhi     |Noida    |Kolkata  |
+-------+----------+---------+---------+
"""

b) Creating a DataFrame by reading files

Download and use the below source file.

# replace the file_path with the source file location which you have downloaded.

df_2 = spark.read.format("json").load(file_path)
df_2.printSchema()

"""
root
 |-- name: string (nullable = true)
 |-- pref1: string (nullable = true)
 |-- pref2: string (nullable = true)
 |-- pref3: string (nullable = true)
"""

Note: Here, I will be using the manually created DataFrame.

How to create an ArrayType column from existing columns in PySpark Azure Databricks?

In this example, let’s see how to use array() function in different ways to create a column of ArrayType in PySpark Azure Databricks.

Example 1:

In this example, let’s try to group multiple columns into a single column with different methods using select().

from pyspark.sql.functions import array, col

# Method 1:
df.select("name", array("pref1", "pref2", "pref3").alias("preferences")).show(truncate=False)

# Method 2:
prefs_col = ["pref1", "pref2", "pref3"]
df.select("name", array(*prefs_col).alias("preferences")).show(truncate=False)

# Method 3:
prefs_col = ["pref1", "pref2", "pref3"]
df \
.select("name", array([col(pref) for pref in prefs_col]).alias("preferences"))\
.show(truncate=False)

# The above three method gives the same output as mentioned below

"""
Output:

+-------+-----------------------------+
|name   |preferences                  |
+-------+-----------------------------+
|Anandhi|[Chennai, Cochin, Hyderabad] |
|Benish |[Coimbatore, Mumbai, Chennai]|
|Chandru|[Salem, Bangalore, null]     |
|Derif  |[Delhi, Bangalore, Noida]    |
|Fayaz  |[Mumbai, Pune, Cochin]       |
|Gomathi|[Chennai, Mumbai, null]      |
|Harini |[Delhi, Noida, Kolkata]      |
+-------+-----------------------------+

"""

Example 2:

In this example, let’s try to group multiple columns into a single column with different methods using withColumn().

from pyspark.sql.functions import array, col

# Method 1:
df.withColumn("preferences", array("pref1", "pref2", "pref3")) \
.select("name", "preferences").show(truncate=False)

# Method 2:
prefs_col = ["pref1", "pref2", "pref3"]
df.withColumn("preferences", array(*prefs_col)) \
.select("name", "preferences").show(truncate=False)

# Method 3:
prefs_col = ["pref1", "pref2", "pref3"]
df \
.withColumn("preferences", array([col(pref) for pref in prefs_col]))\
.select("name", "preferences").show(truncate=False)


# The above three method gives the same output as mentioned below

"""
Output:

+-------+-----------------------------+
|name   |preferences                  |
+-------+-----------------------------+
|Anandhi|[Chennai, Cochin, Hyderabad] |
|Benish |[Coimbatore, Mumbai, Chennai]|
|Chandru|[Salem, Bangalore, null]     |
|Derif  |[Delhi, Bangalore, Noida]    |
|Fayaz  |[Mumbai, Pune, Cochin]       |
|Gomathi|[Chennai, Mumbai, null]      |
|Harini |[Delhi, Noida, Kolkata]      |
+-------+-----------------------------+

"""

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you use the PySpark array() in Azure Databricks?

These could be the possible reasons:

  1. For creating new ArrayType column from existing columns
  2. When you want to group multiple columns into a single ArrayType column

Real World Use Case Scenarios for PySpark DataFrame array() in Azure Databricks?

Assume you were given a dataset of candidates that includes candidate ID, name, preference1, and preference2. Instead of two separate preference columns as “perference1” and “perference2”, you could combine both StringType columns into an ArrayType column as “preferences” so that the values look like [“perf_1”, “pref_2”].

What are the alternatives to the array() function in PySpark Azure Databricks?

The PySpark function array() is the only one that helps in creating a new ArrayType column from existing columns, and this function is explained in detail in the above section.

  • lit() can be used for creating an ArrayType column from a literal value

Final Thoughts

In this article, we have learned about the PySpark array() method of DataFrame in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

PySpark in Azure Databricks, as explained by Arud Seka Berne S on azurelib.com.

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.