How to replace characters of a column in PySpark Azure Databricks?

Are you looking to find out how to replace character by character in PySpark using Azure Databricks cloud or maybe you are looking for a solution, to remove characters in PySpark Databricks using the translate() function? If you are looking for any of these problem solutions, you have landed on the correct page. I will also show you how to use PySpark to replace characters using the translate() function in Azure Databricks. I will explain it by taking a practical example. So please don’t waste time let’s start with a step-by-step guide to understand how to use the translate() function in PySpark.

In this blog, I will teach you the following with practical examples:

  • Syntax of translate()
  • Replacing one character with another character
  • Replacing multiple characters with characters
  • Removing characters

The Pyspark translate() function takes a column name, a character that needs to be replaced, and a character to be replaced by.

Syntax:

translate(column_name, matching_character, replacing_character)

What is the syntax of the translate() function in PySpark Azure Databricks?

The syntax is as follows:

translate(column_name, matching_character, replacing_character)
Parameter NameRequiredDescription
column_nameYesIt represents the column name.
matching_characterYesIt represents the characters that need to be replaced.
replacing_characterYesIt represents the characters that need to be replaced by.
Table 1: translate() Method in PySpark Databricks Parameter list with Details

Apache Spark Official Documentation Link: translate()

Create a simple DataFrame

Let’s understand the use of the lit() function with a variety of examples. Let’s start by creating a DataFrame.

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

a) Create manual PySpark DataFrame

data = [
    ("Andie,Ebbutt","Andie|Ebbutt|Ebbutt","Andie,Ebbutt Andie|Ebbutt"),
    ("Hobie,Deegan","Hobie|Deegan|Deegan","Hobie,Deegan Hobie|Deegan"),
    ("Denys,Belverstone","Denys|Belverstone|Belverstone","Denys,Belverstone Denys|Belverstone"),
    ("Delphine,Pietersma","Delphine|Pietersma|Pietersma","Delphine,Pietersma Delphine|Pietersma"),
    ("Putnem,Chasson","Putnem|Chasson|Chasson","Putnem,Chasson Putnem|Chasson")
]

df = spark.createDataFrame(data, schema=["names_1","names_2","names_3"])
df.printSchema()
df.show(truncate=False)

"""
root
 |-- names_1: string (nullable = true)
 |-- names_2: string (nullable = true)
 |-- names_3: string (nullable = true)

+------------------+-----------------------------+-------------------------------------+
|names_1           |names_2                      |names_3                              |
+------------------+-----------------------------+-------------------------------------+
|Andie,Ebbutt      |Andie|Ebbutt|Ebbutt          |Andie,Ebbutt Andie|Ebbutt            |
|Hobie,Deegan      |Hobie|Deegan|Deegan          |Hobie,Deegan Hobie|Deegan            |
|Denys,Belverstone |Denys|Belverstone|Belverstone|Denys,Belverstone Denys|Belverstone  |
|Delphine,Pietersma|Delphine|Pietersma|Pietersma |Delphine,Pietersma Delphine|Pietersma|
|Putnem,Chasson    |Putnem|Chasson|Chasson       |Putnem,Chasson Putnem|Chasson        |
+------------------+-----------------------------+-------------------------------------+
"""

b) Creating a DataFrame by reading files

Download and use the below source file.

# replace the file_path with the source file location which you have downloaded.

df_2 = spark.read.format("csv").option("header", True).load(file_path)
df_2.printSchema()

"""
root
 |-- names_1: string (nullable = true)
 |-- names_2: string (nullable = true)
 |-- names_3: string (nullable = true)
"""

Note: Here, I will be using the manually created DataFrame.

How to replace a column character with another in PySpark Azure Databricks?

Let’s see how to replace a single character with another character in a PySpark DataFrame in Azure Databricks.

1. Example 1:

In this example, let’s try to replace the character “,” with “*”.

# Using select()

from pyspark.sql.functions import translate

df.select("names_1", translate("names_1", ",", "*").alias("names_translate")).show()

"""
Output:

+------------------+------------------+
|           names_1|   names_translate|
+------------------+------------------+
|      Andie,Ebbutt|      Andie*Ebbutt|
|      Hobie,Deegan|      Hobie*Deegan|
| Denys,Belverstone| Denys*Belverstone|
|Delphine,Pietersma|Delphine*Pietersma|
|    Putnem,Chasson|    Putnem*Chasson|
+------------------+------------------+


"""

Example 2:

In this example, let’s try to replace the character “,” with “+”.

# Using withColumn()

from pyspark.sql.functions import translate

df.withColumn("names_translate", translate("names_1", ",", "+"))\
.select("names_1", "names_translate").show()

"""
Output:

+------------------+------------------+
|           names_1|   names_translate|
+------------------+------------------+
|      Andie,Ebbutt|      Andie+Ebbutt|
|      Hobie,Deegan|      Hobie+Deegan|
| Denys,Belverstone| Denys+Belverstone|
|Delphine,Pietersma|Delphine+Pietersma|
|    Putnem,Chasson|    Putnem+Chasson|
+------------------+------------------+

"""

How to replace column characters with another in PySpark Azure Databricks?

Let’s see how to replace multiple characters with other characters of PySpark DataFrame in Azure Databricks.

Let’s try to replace the following character in this example:

  • “,” with “1”
  • ” ” with “2”
  • “|” with “3”

Example 1:

# Using select

from pyspark.sql.functions import translate

df.select("names_3", 
          translate("names_3", ", |", "123").alias("names_translate")
         ).show(truncate=False)

"""
Output:

+-------------------------------------+-------------------------------------+
|names_3                              |names_translate                      |
+-------------------------------------+-------------------------------------+
|Andie,Ebbutt Andie|Ebbutt            |Andie1Ebbutt2Andie3Ebbutt            |
|Hobie,Deegan Hobie|Deegan            |Hobie1Deegan2Hobie3Deegan            |
|Denys,Belverstone Denys|Belverstone  |Denys1Belverstone2Denys3Belverstone  |
|Delphine,Pietersma Delphine|Pietersma|Delphine1Pietersma2Delphine3Pietersma|
|Putnem,Chasson Putnem|Chasson        |Putnem1Chasson2Putnem3Chasson        |
+-------------------------------------+-------------------------------------+

"""

Example 2:

# Using withColumn

from pyspark.sql.functions import translate

df.withColumn("names_translate", translate("names_3", ", |", "123"))\
.select("names_3", "names_translate")\
.show(truncate=False)

"""
Output:

+-------------------------------------+-------------------------------------+
|names_3                              |names_translate                      |
+-------------------------------------+-------------------------------------+
|Andie,Ebbutt Andie|Ebbutt            |Andie1Ebbutt2Andie3Ebbutt            |
|Hobie,Deegan Hobie|Deegan            |Hobie1Deegan2Hobie3Deegan            |
|Denys,Belverstone Denys|Belverstone  |Denys1Belverstone2Denys3Belverstone  |
|Delphine,Pietersma Delphine|Pietersma|Delphine1Pietersma2Delphine3Pietersma|
|Putnem,Chasson Putnem|Chasson        |Putnem1Chasson2Putnem3Chasson        |
+-------------------------------------+-------------------------------------+

"""

How to remove column characters in PySpark Azure Databricks?

Let’s see how to remove characters with other characters of PySpark DataFrame in Azure Databricks.

Example 1:

In this example, Let’s see how to remove the delimiters using the translate() function.

from pyspark.sql.functions import translate

df.select("names_3", 
          translate("names_3", ", |", "").alias("names_translate")
         ).show(truncate=False)

"""
Output:

+-------------------------------------+----------------------------------+
|names_3                              |names_translate                   |
+-------------------------------------+----------------------------------+
|Andie,Ebbutt Andie|Ebbutt            |AndieEbbuttAndieEbbutt            |
|Hobie,Deegan Hobie|Deegan            |HobieDeeganHobieDeegan            |
|Denys,Belverstone Denys|Belverstone  |DenysBelverstoneDenysBelverstone  |
|Delphine,Pietersma Delphine|Pietersma|DelphinePietersmaDelphinePietersma|
|Putnem,Chasson Putnem|Chasson        |PutnemChassonPutnemChasson        |
+-------------------------------------+----------------------------------+

"""

In the above example, we have given an empty string as a replacing character. Therefore, PySpark assumes that we haven’t given replacing characters for the matching character and removed the replacing characters from the column.

Example 2:

In this example, let’s see what happens when we haven’t mentioned replacing characters for some matching character.

from pyspark.sql.functions import translate

df.withColumn("names_translate", translate("names_3", ", |", "*"))\
.select("names_3", "names_translate")\
.show(truncate=False)

"""
Output:

+-------------------------------------+-----------------------------------+
|names_3                              |names_translate                    |
+-------------------------------------+-----------------------------------+
|Andie,Ebbutt Andie|Ebbutt            |Andie*EbbuttAndieEbbutt            |
|Hobie,Deegan Hobie|Deegan            |Hobie*DeeganHobieDeegan            |
|Denys,Belverstone Denys|Belverstone  |Denys*BelverstoneDenysBelverstone  |
|Delphine,Pietersma Delphine|Pietersma|Delphine*PietersmaDelphinePietersma|
|Putnem,Chasson Putnem|Chasson        |Putnem*ChassonPutnemChasson        |
+-------------------------------------+-----------------------------------+

"""

In the above example, We have only mentioned the replacing character for “,” and haven’t mentioned any replacing characters for the rest of the matching characters. Therefore, the relevant characters have been removed from the column.

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you use the PySpark translate() in Azure Databricks?

These could be the possible reasons:

  1. To replace a single character with another
  2. To replace multiple characters with another
  3. To remove column characters

Real World Use Case Scenarios for PySpark DataFrame translate() in Azure Databricks?

  • Assume that you were given a date column of StringType. The date column values are separated by using different delimiters and you have asked to replace all delimiter values with ‘-‘. For example ‘2020/11/08’ to ‘2022-11-08’. You can perform this activity by using the PySpark translate() function.
  • Assume that you were given a date column of StringType. The date column values are separated by using different delimiters and you have asked to replace all delimiters. For example ‘2022-11-08’ to ‘20221108’. You can perform this activity by using the PySpark translate() function.

What are the alternatives to the translate() function in PySpark Azure Databricks?

There are multiple alternatives to the translate() function, which are as follows:

  • translate(): used for replacing column characters with another
  • regexpr_replace(): used for replacing column values using regular expression.
  • overlay(): used for replacing a source column with replacing column value starting from a position and proceeding for length.
  • substring(): used for extracting a column from an index and proceeding value.

Final Thoughts

In this article, we have learned about the PySpark translate() method to replace characters of DataFrame in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

PySpark in Azure Databricks, as explained by Arud Seka Berne S on azurelib.com.

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.