Are you looking to find how to drop duplicates of PySpark Dataframe into Azure Databricks cloud or maybe you are looking for a solution, to get unique records of a Dataframe in PySpark Databricks? If you are looking for any of these problem solutions, then you have landed on the correct page. I will also show you what and how to use PySpark to drop all duplicate records of a Dataframe in Azure Databricks. I will explain it with a practical example. So don’t waste time let’s start with a step-by-step guide to understanding how to get rid of duplicate elements using a PySpark Dataframe.
In this blog, I will teach you the following with practical examples:
- Syntax of dropDuplicates()
- Drop records of duplicated values
- Drop records based on the selected column duplicate values
dropDuplicates() method is used to drop or remove duplicate records of Dataframe based on columns specified in PySpark Azure Databricks.
Syntax: dataframe_name.dropDuplicates(column_names)
Contents
- 1 What is the syntax of the dropDuplicates() function in PySpark Azure Databricks?
- 2 Create a simple DataFrame
- 3 How to drop duplicate records in PySpark Azure Databricks?
- 4 How to drop duplicate records based on selected columns in PySpark Azure Databricks?
- 5 When should you use the PySpark dropDuplicates() function in Azure Databricks?
- 6 Real World Use Case Scenarios for PySpark DataFrame collect() function in Azure Databricks?
- 7 What are the alternatives of the dropDuplicates() function in PySpark Azure Databricks?
- 8 Final Thoughts
What is the syntax of the dropDuplicates() function in PySpark Azure Databricks?
The syntax is as follows:
dataframe_name.dropDuplicates(subset)
Parameter Name | Required | Description |
subset (List[str]) | Optional | It represents the list of columns to be considered for duplicate check |
Apache Spark Official documentation link: dropDuplicates()
Create a simple DataFrame
Gentle reminder:
In Databricks,
- sparkSession made available as spark
- sparkContext made available as sc
In case, you want to create it manually, use the below code.
from pyspark.sql.session import SparkSession
spark = SparkSession.builder
.master("local[*]")
.appName("azurelib.com")
.getOrCreate()
sc = spark.sparkContext
a) Create manual PySpark DataFrame
data = [
("Meenu", "Teacher", "Trichy"),
("Meenu", "Teacher", "Trichy"),
("Chandru", "Electrician", "Salem"),
("Chandru", "Electrician", "Chennai")
]
df = spark.createDataFrame(data = data, schema = ["name", "designation", "city"])
df.printSchema()
df.show()
"""
root
|-- name: string (nullable = true)
|-- designation: string (nullable = true)
|-- city: string (nullable = true)
+-------+-----------+-------+
| name|designation| city|
+-------+-----------+-------+
| Meenu| Teacher| Trichy|
| Meenu| Teacher| Trichy|
|Chandru|Electrician| Salem|
|Chandru|Electrician|Chennai|
+-------+-----------+-------+
"""
b) Creating a DataFrame by reading files
Download and use the below source file.
# replace the file_paths with the source file location which you have downloaded.
df_2 = spark.read.format("csv").option("header", True).load(file_path)
df_2.printSchema()
"""
root
|-- name: string (nullable = true)
|-- designation: string (nullable = true)
|-- city: string (nullable = true)
"""
Note: Here, I will be using the manually created DataFrame.
How to drop duplicate records in PySpark Azure Databricks?
The PySpark function dropDuplicates() is used to drop duplicate records from a DataFrame. It is useful in retrieving all the unique elements of the Dataframe.
Drop Duplicate Example in pyspark:
In the below example, we are trying to drop only duplicate records out of all records.
df.dropDuplicates().show()
"""
Output:
+-------+-----------+-------+
| name|designation| city|
+-------+-----------+-------+
| Meenu| Teacher| Trichy|
|Chandru|Electrician| Salem|
|Chandru|Electrician|Chennai|
+-------+-----------+-------+
"""
In the above example, you can see that we have two duplicate records, rows 2 and 4. But we have only dropped the second record because record 1 and 2 record is exactly the same whereas 3 and 4 have a difference in the column city.
How to drop duplicate records based on selected columns in PySpark Azure Databricks?
By specifying the column names as a List of strings in a subset argument, we can drop duplicates based on a particular column.
Drop Duplicated based on Selected Column Example:
In this example, we are trying to drop records 2 and 4 based on the name and designation column.
df.dropDuplicates(subset=["name", "designation"]).show()
"""
Output:
+-------+-----------+------+
| name|designation| city|
+-------+-----------+------+
|Chandru|Electrician| Salem|
| Meenu| Teacher|Trichy|
+-------+-----------+------+
"""
I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.
When should you use the PySpark dropDuplicates() function in Azure Databricks?
You can use the dropDuplicates() function to drop all duplicate records from a DataFrame. Because the dropDuplicates() function is a transformation that provides a Dataframe from an existing DataFrame.
Real World Use Case Scenarios for PySpark DataFrame collect() function in Azure Databricks?
- Assume that due to bad data the existing data frame has many duplicate records and you may want to solve this bad data issue. In this scenario, you can use drop_duplicate method to delete those records from the DataFrame.
- Assume that you have an employee who has to be unique across the employee DataFrame. However, due to a bad ETL job, some records have been inserted as duplicate employee IDs in the DataFrame. In this case, we can use the drop duplicate with the employee id as the column to remove repetitive emp id.
What are the alternatives of the dropDuplicates() function in PySpark Azure Databricks?
Use the distinct() function to drop duplicate records from a DataFrame. But the distinct() function is used to drop/remove duplicate rows based on all columns. Whereas in dropDuplicates() function, you can specify the selected columns.
Final Thoughts
In this article, we have learned about the PySpark dropDuplicates() method to remove duplicate records or rows of DataFrame in Azure Databricks, along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.
Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.
- For Azure Study material Join Telegram group : Telegram group link:
- Azure Jobs and other updates Follow me on LinkedIn: Azure Updates on LinkedIn
- Azure Tutorial Videos: Videos Link
- Azure Databricks Lesson 1
- Azure Databricks Lesson 2
- Azure Databricks Lesson 3
- Azure Databricks Lesson 4
- Azure Databricks Lesson 5
- Azure Databricks Lesson 6
- Azure Databricks Lesson 7
As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.