Are you looking to find out how to split columns based on the delimiter of PySpark DataFrame in Azure Databricks cloud or maybe you are looking for a solution, to split columns based on the delimiter using SQL expression in PySpark Databricks using the split() function? If you are looking for any of these problem solutions, you have landed on the correct page. I will also show you how to use PySpark to split columns on both dataframe and SQL expression using the split() function in Azure Databricks. I will explain it by taking a practical example. So please don’t waste time let’s start with a step-by-step guide to understanding how to split columns in PySpark using the split() function.
In this blog, I will teach you the following with practical examples:
- Syntax of split()
- Split columns using DataFrame
- Split columns using SQL expression
- Spit() function limit parameter
- Split columns on multiple delimiter
The PySpark’s split() function is used to split columns of DataFrame in PySpark Azure Databricks. Split() function takes a column name, delimiter string and limit as argument.
Syntax:
split(column_name, delimiter, limit)
Contents
- 1 What is the syntax of the split() function in PySpark Azure Databricks?
- 2 Create a simple DataFrame
- 3 How to split columns based on delimiters in PySpark Azure Databricks?
- 4 How to split columns in PySpark Azure Databricks using SQL expression?
- 5 How to use the limit parameter in the split() function in PySpark Azure Databricks?
- 6 How to split columns based on multiple delimiters in PySpark Azure Databricks?
- 7 When should you use the split() function in PySpark Azure Databricks?
- 8 Real World Use Case Scenarios for split() function in PySpark Azure Databricks?
- 9 What are the alternatives to the split() function in PySpark Azure Databricks?
- 10 Final Thoughts
What is the syntax of the split() function in PySpark Azure Databricks?
The syntax is as follows:
split(column_name, delimiter, limit)
Parameter Name | Required | Description |
column_name | Yes | It represents the column that has to be split. |
delimiter | Yes | It represents the pattern in which the column has to be split. |
limit | Optional | It represents the number of times the delimiter pattern has to be applied. |
Apache Spark Official Documentation Link: split()
Create a simple DataFrame
Let’s understand the use of the lit() function with various examples. Let’s start by creating a DataFrame.
Gentle reminder:
In Databricks,
- sparkSession made available as spark
- sparkContext made available as sc
In case, you want to create it manually, use the below code.
from pyspark.sql.session import SparkSession
spark = SparkSession.builder
.master("local[*]")
.appName("azurelib.com")
.getOrCreate()
sc = spark.sparkContext
a) Create manual PySpark DataFrame
data = [
("Andie,Ebbutt","Andie|Ebbutt|Ebbutt","Andie,Ebbutt Andie|Ebbutt"),
("Hobie,Deegan","Hobie|Deegan|Deegan","Hobie,Deegan Hobie|Deegan"),
("Denys,Belverstone","Denys|Belverstone|Belverstone","Denys,Belverstone Denys|Belverstone"),
("Delphine,Pietersma","Delphine|Pietersma|Pietersma","Delphine,Pietersma Delphine|Pietersma"),
("Putnem,Chasson","Putnem|Chasson|Chasson","Putnem,Chasson Putnem|Chasson")
]
df = spark.createDataFrame(data, schema=["names_1","names_2","names_3"])
df.printSchema()
df.show(truncate=False)
"""
root
|-- names_1: string (nullable = true)
|-- names_2: string (nullable = true)
|-- names_3: string (nullable = true)
+------------------+-----------------------------+-------------------------------------+
|names_1 |names_2 |names_3 |
+------------------+-----------------------------+-------------------------------------+
|Andie,Ebbutt |Andie|Ebbutt|Ebbutt |Andie,Ebbutt Andie|Ebbutt |
|Hobie,Deegan |Hobie|Deegan|Deegan |Hobie,Deegan Hobie|Deegan |
|Denys,Belverstone |Denys|Belverstone|Belverstone|Denys,Belverstone Denys|Belverstone |
|Delphine,Pietersma|Delphine|Pietersma|Pietersma |Delphine,Pietersma Delphine|Pietersma|
|Putnem,Chasson |Putnem|Chasson|Chasson |Putnem,Chasson Putnem|Chasson |
+------------------+-----------------------------+-------------------------------------+
"""
b) Creating a DataFrame by reading files
Download and use the below source file.
# replace the file_path with the source file location which you have downloaded.
df_2 = spark.read.format("csv").load(file_path)
df_2.printSchema()
"""
root
|-- names_1: string (nullable = true)
|-- names_2: string (nullable = true)
|-- names_3: string (nullable = true)
"""
Note: Here, I will be using the manually created DataFrame.
How to split columns based on delimiters in PySpark Azure Databricks?
Let’s see how to split columns of PySpark’s DataFrame using a split() in Azure Databricks.
Example:
from pyspark.sql.functions import split
s_df1 = df.select(split("names_1", ",").alias("comma_delimiter"))
s_df1.printSchema()
s_df1.show(truncate=False)
"""
Output:
root
|-- comma_delimiter: array (nullable = true)
| |-- element: string (containsNull = false)
+---------------------+
|comma_delimiter |
+---------------------+
|[Andie, Ebbutt] |
|[Hobie, Deegan] |
|[Denys, Belverstone] |
|[Delphine, Pietersma]|
|[Putnem, Chasson] |
+---------------------+
"""
How to split columns in PySpark Azure Databricks using SQL expression?
Let’s see how to split columns using the SQL expression of PySpark DataFrame using a split() in Azure Databricks.
Example:
In order to use raw SQL expressions in PySpark, we have to convert DataFrame to a SQL view.
df.createOrReplaceTempView("names")
spark.sql("SELECT SPLIT(names_1, ',') AS comma_delimiter FROM names") \
.show(truncate=False)
"""
Output:
+---------------------+
|comma_delimiter |
+---------------------+
|[Andie, Ebbutt] |
|[Hobie, Deegan] |
|[Denys, Belverstone] |
|[Delphine, Pietersma]|
|[Putnem, Chasson] |
+---------------------+
"""
How to use the limit parameter in the split() function in PySpark Azure Databricks?
The limit is an integer that controls the number of times a pattern is applied. By default, the value is -1.
- limit > 0: The length of the resulting array will not exceed the limit, and its final entry will include all input received after the final matched pattern.
- limit <= 0: pattern will be used as many times as possible.
Example:
from pyspark.sql.functions import split, size
df.select(
split("names_2", "\|", limit=2).alias("limit_delimiter"),
size(split("names_2", "\|", limit=2)).alias("length"),
).show(truncate=False)
"""
Output:
+--------------------------------+------+
|limit_delimiter |length|
+--------------------------------+------+
|[Andie, Ebbutt|Ebbutt] |2 |
|[Hobie, Deegan|Deegan] |2 |
|[Denys, Belverstone|Belverstone]|2 |
|[Delphine, Pietersma|Pietersma] |2 |
|[Putnem, Chasson|Chasson] |2 |
+--------------------------------+------+
"""
How to split columns based on multiple delimiters in PySpark Azure Databricks?
Let’s see how to split columns based on multiple delimiters of PySpark’s DataFrame using a split() in Azure Databricks.
Example:
from pyspark.sql.functions import split, size
df.select(
split("names_3", "[, |]").alias("multiple_delimiter"),
size(split("names_3", "[, |]")).alias("length"),
).show(truncate=False)
"""
Output:
+------------------------------------------+------+
|multiple_delimiter |length|
+------------------------------------------+------+
|[Andie, Ebbutt, Andie, Ebbutt] |4 |
|[Hobie, Deegan, Hobie, Deegan] |4 |
|[Denys, Belverstone, Denys, Belverstone] |4 |
|[Delphine, Pietersma, Delphine, Pietersma]|4 |
|[Putnem, Chasson, Putnem, Chasson] |4 |
+------------------------------------------+------+
"""
I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.
When should you use the split() function in PySpark Azure Databricks?
These could be the possible reasons:
- To split StringType columns with a delimiter
- To split StringType columns with multiple delimiters
Real World Use Case Scenarios for split() function in PySpark Azure Databricks?
- Assume that you were given a column of ‘full_name’ and you have been a requirement to split the column value into first name and last name. For example (‘John Cena’) into ([‘John’, ‘Cena’]). You can use the PySpark split() function to solve this problem by specifying the delimiter values, in this case, the delimiter is ‘ ‘ (a space).
- Assume you were given a time column in the format of ‘HH.mm.ss:SSS’ and you have asked to split the column value into [‘HH’, ‘mm’, ‘ss’, ‘SSS’]. But note here that you have two delimiters. By passing these delimiters you can solve this kind of problem.
What are the alternatives to the split() function in PySpark Azure Databricks?
The PySpark Function split() is the only one to split string column values using a delimiter character into an ArrayType column. You can also use the SPLIT() function in PySpark SQL.
Final Thoughts
In this article, we have learned about the PySpark split() method to separate string values based on delimiter or pattern in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.
Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.
- For Azure Study material Join Telegram group : Telegram group link:
- Azure Jobs and other updates Follow me on LinkedIn: Azure Updates on LinkedIn
- Azure Tutorial Videos: Videos Link
- Azure Databricks Lesson 1
- Azure Databricks Lesson 2
- Azure Databricks Lesson 3
- Azure Databricks Lesson 4
- Azure Databricks Lesson 5
- Azure Databricks Lesson 6
- Azure Databricks Lesson 7
As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.