How to convert a column value to list in PySpark Azure Databricks?

Are you looking to find how to convert a column to a list of PySpark Dataframe in Azure Databricks cloud using RDD or maybe you are looking for a solution, to convert a column to a list of PySpark Databricks in Azure Databricks using DataFrame? If you are looking for any of these problem solutions, then you have landed on the correct page. I will also show you how to use PySpark to perform these actions in Azure Databricks. I will explain it by taking a practical example. So don’t waste time let’s start step by step guide to understanding how to convert column values to list in PySpark Azure Databricks.

In this blog, I will teach you the following with practical examples:

  • Converting column value into List using column index
  • Converting column value into List using column name
  • Converting column value into List using flatMap()
  • Converting column value into List using Pandas
  • Get column values in Row types
  • Converting multiple columns into a Python list
  • Removing column duplicate values

Create a simple DataFrame

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

a) Create manual PySpark DataFrame

data = [ 
    (1,"Jacynth","DL"),
    (2,"Rand","TN"),
    (3,"Marquita","DL"),
    (4,"Rodrick","KL"),
    (5,"Ingram","TN")
]

df = spark.createDataFrame(data, schema=["id","name","state"])
df.printSchema()
df.show(truncate=False)

"""
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- state: string (nullable = true)

+---+--------+-----+
|id |name    |state|
+---+--------+-----+
|1  |Jacynth |DL   |
|2  |Rand    |TN   |
|3  |Marquita|DL   |
|4  |Rodrick |KL   |
|5  |Ingram  |TN   |
+---+--------+-----+
"""

b) Creating a DataFrame by reading files

Download and use the below source file.

# replace the file_paths with the source file location which you have downloaded.

df_2 = spark.read.format("csv").option("header", True).load(file_path)
df_2.printSchema()

"""
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- state: string (nullable = true)
"""

Do remember one thing: In order to fetch a column and convert the values into a list or any iterable value, you have to select the column and apply a collect() action on top of the transformation.

How to convert columns to list in PySpark Azure Databricks using index value?

Let’s look at how to convert columns to lists in PySpark Azure Databricks using index values with a practical example in this section.

Example:

In this example let’s try to convert the column ‘state’ values to a Python list.

df.rdd.map(lambda column: column[2]).collect()

"""
Output:

['DL', 'TN', 'DL', 'KL', 'TN']

"""

In the above example, we used the map() transformation function and applied the collect() action to fetch the column values into a list. This is how we can collect the column values into a list by using a column index.

How to convert columns to list in PySpark Azure Databricks using the column name?

Let’s look at how to convert columns to lists in PySpark Azure Databricks using the column name with some practical examples in this section.

Example:

In this example let’s try to convert the column ‘state’ values to a Python list.

from pyspark.sql.functions import collect_list

# Method 1: Using RDD
df.rdd.map(lambda column: column.state).collect()

# Method 2: Using DataFrame
column_row_values = df.select("state").collect()
state_list = list(map(lambda row: row.state, column_row_values))
print(list(state_list))

# Method 3:
df.select(collect_list("state").alias("states")).collect()[0]["states"]

# The above codes generate the following output

"""
Output:

['DL', 'TN', 'DL', 'KL', 'TN']

"""

How to convert columns to list in PySpark Azure Databricks using the flatMap() function?

In this section, we’ll look at how to convert columns to lists in PySpark Azure Databricks using the flatMap() function with a practical example.

Example:

In this example let’s try to convert the column ‘state’ values to a Python list.

df.select("state").rdd.flatMap(lambda state: state).collect()

"""
Output:

['DL', 'TN', 'DL', 'KL', 'TN']

"""

How to convert columns to list in PySpark Azure Databricks using Pandas DataFrame?

In this section, we’ll look at how to convert columns to lists in PySpark Azure Databricks using Pandas DataFrae with a practical example.

Example:

In this example let’s try to convert the column ‘state’ values to a Python list.

states = df.toPandas()['state']
print(list(states))

"""
Output:

['DL', 'TN', 'DL', 'KL', 'TN']

"""

How to convert columns to list of Row types in PySpark Azure Databricks?

In this section, we’ll look at how to convert columns to list of Row types in PySpark Azure Databricks using various methods.

Example:

In this example let’s try to convert the column ‘state’ values to a Python list.

from pyspark.sql.functions import col

# Method 1:
df.select("state").collect()

# Method 2:
df.select(df.state).collect()

# Method 3:
df.select(col("state")).collect()

# Method 4:
df.select(df["state"]).collect()

# The above code generates the following output

"""
Output:

[
Row(state='DL'),
 Row(state='TN'),
 Row(state='DL'),
 Row(state='KL'),
 Row(state='TN')
]

"""

How to convert multiple columns to list of Row types in PySpark Azure Databricks?

In this section, we’ll look at how to convert multiple column values to a list of Row types in PySpark Azure Databricks using various methods.

Example:

pandas_df = df.toPandas()
names = list(pandas_df["name"])
states = list(pandas_df["state"])

print(f"Name: {names}")
print(f"States: {states}")

"""
Output:

Name: ['Jacynth', 'Rand', 'Marquita', 'Rodrick', 'Ingram']
States: ['DL', 'TN', 'DL', 'KL', 'TN']

"""

How to remove duplicates of columns value after collecting them in PySpark Azure Databricks?

In this section, we’ll look at how to remove duplicates of columns value after collecting them in PySpark Azure Databricks using multiple methods.

Example:

from pyspark.sql.functions import collect_set

# Method 1:
row_states = df.rdd.map(lambda column: column.state).collect()
print(list(set(row_states)))

# Method 2:
row_states = df.select("state").collect()
unique_states = list(set(map(lambda row: row.state, row_states)))
print(unique_states)

# Method 3:
df.select(collect_set("state").alias("state")).collect()[0]["state"]

# The above methods generate the following output

"""
Output:

['DL', 'TN', 'KL']

"""

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you convert columns value into a list in PySpark Azure Databricks?

These could be the possible reasons:

Real World Use Case Scenarios for converting column values into lists in PySpark Azure Databricks?

What are the alternatives for converting column values into lists in PySpark Azure Databricks?

There are multiple alternatives for converting column values into lists in PySpark Azure Databricks, which are as follows:

  1. collect_list(): used for fetching the column values into a Python List.
  2. collect_set(): used for fetching the column unique values into a Python List.

Final Thoughts

In this article, we have learned about converting column values into Python lists in PySpark Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

PySpark in Azure Databricks, as explained by Arud Seka Berne S on azurelib.com.

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.