How to convert a column value to list in PySpark Azure Databricks?

Are you looking to find how to convert a column to a list of PySpark Dataframe in Azure Databricks cloud using RDD or maybe you are looking for a solution, to convert a column to a list of PySpark Databricks in Azure Databricks using DataFrame? If you are looking for any of these problem solutions, then you have landed on the correct page. I will also show you how to use PySpark to perform these actions in Azure Databricks. I will explain it by taking a practical example. So don’t waste time let’s start step by step guide to understanding how to convert column values to list in PySpark Azure Databricks.

In this blog, I will teach you the following with practical examples:

Converting column value into List using column index
Converting column value into List using column name
Converting column value into List using flatMap()
Converting column value into List using Pandas
Get column values in Row types
Converting multiple columns into a Python list
Removing column duplicate values

Contents

1 Create a simple DataFrame
- 1.1 a) Create manual PySpark DataFrame
- 1.2 b) Creating a DataFrame by reading files
2 How to convert columns to list in PySpark Azure Databricks using index value?
- 2.1 Example:
3 How to convert columns to list in PySpark Azure Databricks using the column name?
- 3.1 Example:
4 How to convert columns to list in PySpark Azure Databricks using the flatMap() function?
- 4.1 Example:
5 How to convert columns to list in PySpark Azure Databricks using Pandas DataFrame?
- 5.1 Example:
6 How to convert columns to list of Row types in PySpark Azure Databricks?
- 6.1 Example:
7 How to convert multiple columns to list of Row types in PySpark Azure Databricks?
- 7.1 Example:
8 How to remove duplicates of columns value after collecting them in PySpark Azure Databricks?
- 8.1 Example:
9 When should you convert columns value into a list in PySpark Azure Databricks?
10 Real World Use Case Scenarios for converting column values into lists in PySpark Azure Databricks?
11 What are the alternatives for converting column values into lists in PySpark Azure Databricks?
12 Final Thoughts

Create a simple DataFrame

Gentle reminder:

In Databricks,

sparkSession made available as spark
sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

a) Create manual PySpark DataFrame

data = [ 
    (1,"Jacynth","DL"),
    (2,"Rand","TN"),
    (3,"Marquita","DL"),
    (4,"Rodrick","KL"),
    (5,"Ingram","TN")
]

df = spark.createDataFrame(data, schema=["id","name","state"])
df.printSchema()
df.show(truncate=False)

"""
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- state: string (nullable = true)

+---+--------+-----+
|id |name    |state|
+---+--------+-----+
|1  |Jacynth |DL   |
|2  |Rand    |TN   |
|3  |Marquita|DL   |
|4  |Rodrick |KL   |
|5  |Ingram  |TN   |
+---+--------+-----+
"""

b) Creating a DataFrame by reading files

Download and use the below source file.

Download Sample People.csv

# replace the file_paths with the source file location which you have downloaded.

df_2 = spark.read.format("csv").option("header", True).load(file_path)
df_2.printSchema()

"""
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- state: string (nullable = true)
"""

Do remember one thing: In order to fetch a column and convert the values into a list or any iterable value, you have to select the column and apply a collect() action on top of the transformation.

How to convert columns to list in PySpark Azure Databricks using index value?

Let’s look at how to convert columns to lists in PySpark Azure Databricks using index values with a practical example in this section.