How to split columns in PySpark Azure Databricks?

Are you looking to find out how to split columns based on the delimiter of PySpark DataFrame in Azure Databricks cloud or maybe you are looking for a solution, to split columns based on the delimiter using SQL expression in PySpark Databricks using the split() function? If you are looking for any of these problem solutions, you have landed on the correct page. I will also show you how to use PySpark to split columns on both dataframe and SQL expression using the split() function in Azure Databricks. I will explain it by taking a practical example. So please don’t waste time let’s start with a step-by-step guide to understanding how to split columns in PySpark using the split() function.

In this blog, I will teach you the following with practical examples:

  • Syntax of split()
  • Split columns using DataFrame
  • Split columns using SQL expression
  • Spit() function limit parameter
  • Split columns on multiple delimiter

The PySpark’s split() function is used to split columns of DataFrame in PySpark Azure Databricks. Split() function takes a column name, delimiter string and limit as argument.

Syntax:

split(column_name, delimiter, limit)

What is the syntax of the split() function in PySpark Azure Databricks?

The syntax is as follows:

split(column_name, delimiter, limit)
Parameter NameRequiredDescription
column_nameYesIt represents the column that has to be split.
delimiterYesIt represents the pattern in which the column has to be split.
limitOptionalIt represents the number of times the delimiter pattern has to be applied.
Table 1: split() Method in PySpark Databricks Parameter list with Details

Apache Spark Official Documentation Link: split()

Create a simple DataFrame

Let’s understand the use of the lit() function with various examples. Let’s start by creating a DataFrame.

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

a) Create manual PySpark DataFrame

data = [
    ("Andie,Ebbutt","Andie|Ebbutt|Ebbutt","Andie,Ebbutt Andie|Ebbutt"),
    ("Hobie,Deegan","Hobie|Deegan|Deegan","Hobie,Deegan Hobie|Deegan"),
    ("Denys,Belverstone","Denys|Belverstone|Belverstone","Denys,Belverstone Denys|Belverstone"),
    ("Delphine,Pietersma","Delphine|Pietersma|Pietersma","Delphine,Pietersma Delphine|Pietersma"),
    ("Putnem,Chasson","Putnem|Chasson|Chasson","Putnem,Chasson Putnem|Chasson")
]

df = spark.createDataFrame(data, schema=["names_1","names_2","names_3"])
df.printSchema()
df.show(truncate=False)

"""
root
 |-- names_1: string (nullable = true)
 |-- names_2: string (nullable = true)
 |-- names_3: string (nullable = true)

+------------------+-----------------------------+-------------------------------------+
|names_1           |names_2                      |names_3                              |
+------------------+-----------------------------+-------------------------------------+
|Andie,Ebbutt      |Andie|Ebbutt|Ebbutt          |Andie,Ebbutt Andie|Ebbutt            |
|Hobie,Deegan      |Hobie|Deegan|Deegan          |Hobie,Deegan Hobie|Deegan            |
|Denys,Belverstone |Denys|Belverstone|Belverstone|Denys,Belverstone Denys|Belverstone  |
|Delphine,Pietersma|Delphine|Pietersma|Pietersma |Delphine,Pietersma Delphine|Pietersma|
|Putnem,Chasson    |Putnem|Chasson|Chasson       |Putnem,Chasson Putnem|Chasson        |
+------------------+-----------------------------+-------------------------------------+
"""

b) Creating a DataFrame by reading files

Download and use the below source file.

# replace the file_path with the source file location which you have downloaded.

df_2 = spark.read.format("csv").load(file_path)
df_2.printSchema()

"""
root
 |-- names_1: string (nullable = true)
 |-- names_2: string (nullable = true)
 |-- names_3: string (nullable = true)
"""

Note: Here, I will be using the manually created DataFrame.

How to split columns based on delimiters in PySpark Azure Databricks?

Let’s see how to split columns of PySpark’s DataFrame using a split() in Azure Databricks.

Example:

from pyspark.sql.functions import split

s_df1 = df.select(split("names_1", ",").alias("comma_delimiter"))
s_df1.printSchema()
s_df1.show(truncate=False)

"""
Output:

root
 |-- comma_delimiter: array (nullable = true)
 |    |-- element: string (containsNull = false)

+---------------------+
|comma_delimiter      |
+---------------------+
|[Andie, Ebbutt]      |
|[Hobie, Deegan]      |
|[Denys, Belverstone] |
|[Delphine, Pietersma]|
|[Putnem, Chasson]    |
+---------------------+

"""

How to split columns in PySpark Azure Databricks using SQL expression?

Let’s see how to split columns using the SQL expression of PySpark DataFrame using a split() in Azure Databricks.

Example:

In order to use raw SQL expressions in PySpark, we have to convert DataFrame to a SQL view.

df.createOrReplaceTempView("names")

spark.sql("SELECT SPLIT(names_1, ',') AS comma_delimiter FROM names") \
.show(truncate=False)

"""
Output:

+---------------------+
|comma_delimiter      |
+---------------------+
|[Andie, Ebbutt]      |
|[Hobie, Deegan]      |
|[Denys, Belverstone] |
|[Delphine, Pietersma]|
|[Putnem, Chasson]    |
+---------------------+

"""

How to use the limit parameter in the split() function in PySpark Azure Databricks?

The limit is an integer that controls the number of times a pattern is applied. By default, the value is -1.

  • limit > 0: The length of the resulting array will not exceed the limit, and its final entry will include all input received after the final matched pattern.
  • limit <= 0: pattern will be used as many times as possible.

Example:

from pyspark.sql.functions import split, size

df.select(
    split("names_2", "\|", limit=2).alias("limit_delimiter"),
    size(split("names_2", "\|", limit=2)).alias("length"),
).show(truncate=False)

"""
Output:

+--------------------------------+------+
|limit_delimiter                 |length|
+--------------------------------+------+
|[Andie, Ebbutt|Ebbutt]          |2     |
|[Hobie, Deegan|Deegan]          |2     |
|[Denys, Belverstone|Belverstone]|2     |
|[Delphine, Pietersma|Pietersma] |2     |
|[Putnem, Chasson|Chasson]       |2     |
+--------------------------------+------+

"""

How to split columns based on multiple delimiters in PySpark Azure Databricks?

Let’s see how to split columns based on multiple delimiters of PySpark’s DataFrame using a split() in Azure Databricks.

Example:

from pyspark.sql.functions import split, size

df.select(
    split("names_3", "[, |]").alias("multiple_delimiter"),
    size(split("names_3", "[, |]")).alias("length"),
).show(truncate=False)

"""
Output:

+------------------------------------------+------+
|multiple_delimiter                        |length|
+------------------------------------------+------+
|[Andie, Ebbutt, Andie, Ebbutt]            |4     |
|[Hobie, Deegan, Hobie, Deegan]            |4     |
|[Denys, Belverstone, Denys, Belverstone]  |4     |
|[Delphine, Pietersma, Delphine, Pietersma]|4     |
|[Putnem, Chasson, Putnem, Chasson]        |4     |
+------------------------------------------+------+

"""

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you use the split() function in PySpark Azure Databricks?

These could be the possible reasons:

  1. To split StringType columns with a delimiter
  2. To split StringType columns with multiple delimiters

Real World Use Case Scenarios for split() function in PySpark Azure Databricks?

  • Assume that you were given a column of ‘full_name’ and you have been a requirement to split the column value into first name and last name. For example (‘John Cena’) into ([‘John’, ‘Cena’]). You can use the PySpark split() function to solve this problem by specifying the delimiter values, in this case, the delimiter is ‘ ‘ (a space).
  • Assume you were given a time column in the format of ‘HH.mm.ss:SSS’ and you have asked to split the column value into [‘HH’, ‘mm’, ‘ss’, ‘SSS’]. But note here that you have two delimiters. By passing these delimiters you can solve this kind of problem.

What are the alternatives to the split() function in PySpark Azure Databricks?

The PySpark Function split() is the only one to split string column values using a delimiter character into an ArrayType column. You can also use the SPLIT() function in PySpark SQL.

Final Thoughts

In this article, we have learned about the PySpark split() method to separate string values based on delimiter or pattern in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

Arud Seka Berne S

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.