Are you looking to find how to use the join() function on PySpark DataFrame into Azure Databricks cloud or maybe you are looking for a solution, to join two Dataframes in PySpark Databricks? If you are looking for any of these problem solutions, then you have landed on the correct page. I will also show you what and how to use PySpark to join two DataFrames using the join() function in Azure Databricks. I will explain it with a practical example. So don’t waste time let’s start step by step guide to understanding how to join two DataFrames using PySpark.
In this blog, I will teach you the following with practical examples:
- Syntax of join()
- Common Join types
- Semi Join
- Anti Join
- Cross Join
- Self Join
join() method is used to join two Dataframes together based on condition specified in PySpark Azure Databricks.
Syntax: dataframe_name.join()
Contents
- 1 What is the syntax of the join() function in PySpark Azure Databricks?
- 2 Create a simple DataFrame
- 3 How to join DataFrames in PySpark Azure Databricks?
- 4 How to perform Semi Join in PySpark Azure Databricks?
- 5 How to perform Anti Join in PySpark Azure Databricks?
- 6 How to perform Cross Join in PySpark Azure Databricks?
- 7 How to perform Self Join in PySpark Azure Databricks?
- 8 When should you use the PySpark join() function in Azure Databricks?
- 9 Real World Use Case Scenarios for join() function in PySpark Azure Databricks?
- 10 What are the alternatives of the join() function in PySpark Azure Databricks?
- 11 Final Thoughts
What is the syntax of the join() function in PySpark Azure Databricks?
The syntax is as follows:
dataframe_name.join(other, on, how)
Parameter Name | Required | Description |
other (Dataframe) | Yes | It represents the second column to be joined. |
on (str, list, or Column) | Yes | A string for the join column name, a list of column names, a join expression (Column), or a list of Columns. If on is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides. |
how (str) | Optional | It represents join type, by default how=”inner”. |
Apache Spark Official Documentation Link: join()
Create a simple DataFrame
Gentle reminder:
In Databricks,
- sparkSession made available as spark
- sparkContext made available as sc
In case, you want to create it manually, use the below code.
from pyspark.sql.session import SparkSession
spark = SparkSession.builder
.master("local[*]")
.appName("azurelib.com")
.getOrCreate()
sc = spark.sparkContext
a) Creating a Dataframe manually
# 1. Employee Dataset
emp_data = [
(1,"Bel",2),
(2,"Thornie",1),
(3,"Chris",2),
(4,"Aloysia",2),
(5,"Dorena",5)
]
emp_df = spark.createDataFrame(data = emp_data, schema = ["emp_id", "emp_name", "dept_id"])
emp_df.printSchema()
emp_df.show(truncate=False)
"""
root
|-- emp_id: long (nullable = true)
|-- emp_name: string (nullable = true)
|-- dept_id: long (nullable = true)
+------+--------+-------+
|emp_id|emp_name|dept_id|
+------+--------+-------+
|1 |Bel |2 |
|2 |Thornie |1 |
|3 |Chris |2 |
|4 |Aloysia |2 |
|5 |Dorena |5 |
+------+--------+-------+
"""
# 2. Department Dataset
dept_data = [
(1, "IT"),
(2, "HR"),
(3, "Sales"),
(4, "Marketing")
]
dept_df = spark.createDataFrame(data = dept_data, schema = ["dept_id", "dept_name"])
dept_df.printSchema()
dept_df.show(truncate=False)
"""
root
|-- dept_id: long (nullable = true)
|-- dept_name: string (nullable = true)
+-------+---------+
|dept_id|dept_name|
+-------+---------+
|1 |IT |
|2 |HR |
|3 |Sales |
|4 |Marketing|
+-------+---------+
"""
b) Creating a Dataframe by reading files
# replace the file_path with the source file location which you have downloaded.
emp_df_2 = spark.read.format("csv").option("header", True).load("employee_file_path")
emp_df_2.printSchema()
dept_df_2 = spark.read.format("csv").option("header", True).load("department_file_path")
dept_df_2.printSchema()
"""
root
|-- emp_id: long (nullable = true)
|-- emp_name: string (nullable = true)
|-- dept_id: long (nullable = true)
root
|-- dept_id: long (nullable = true)
|-- dept_name: string (nullable = true)
"""
Note: Here, I will be using the manually created DataFrame.
How to join DataFrames in PySpark Azure Databricks?
The PySpark join() function is used to join two DataFrames together. PySpark DataFrame supports all basic SQL join types like INNER, LEFT OUTER, RIGHT OUTER, LEFT ANTI, LEFT SEMI, CROSS, and SELF JOIN.
In the below example, we are trying to join the employee DataFrame and department DataFrame on column “dept_id” using a different method and join type.
Example 1: Inner Join
emp_df.join(dept_df, "dept_id", "inner").show(truncate=False)
"""
Output:
+-------+------+--------+---------+
|dept_id|emp_id|emp_name|dept_name|
+-------+------+--------+---------+
|1 |2 |Thornie |IT |
|2 |1 |Bel |HR |
|2 |3 |Chris |HR |
|2 |4 |Aloysia |HR |
+-------+------+--------+---------+
"""
Example 2: Left Outer Join
emp_df.join(dept_df, emp_df.dept_id == dept_df.dept_id, "left").show(truncate=False)
"""
Output:
+------+--------+-------+-------+---------+
|emp_id|emp_name|dept_id|dept_id|dept_name|
+------+--------+-------+-------+---------+
|2 |Thornie |1 |1 |IT |
|1 |Bel |2 |2 |HR |
|3 |Chris |2 |2 |HR |
|4 |Aloysia |2 |2 |HR |
|5 |Dorena |5 |null |null |
+------+--------+-------+-------+---------+
"""
Example 3: Right Outer Join
emp_df.join(dept_df, "dept_id", "right").show(truncate=False)
"""
Output:
+-------+------+--------+---------+
|dept_id|emp_id|emp_name|dept_name|
+-------+------+--------+---------+
|1 |2 |Thornie |IT |
|2 |1 |Bel |HR |
|2 |3 |Chris |HR |
|2 |4 |Aloysia |HR |
|3 |null |null |Sales |
|4 |null |null |Marketing|
+-------+------+--------+---------+
"""
Example 4: Full Outer Join
emp_df.join(dept_df, "dept_id", "outer").show(truncate=False)
"""
Output:
+-------+------+--------+---------+
|dept_id|emp_id|emp_name|dept_name|
+-------+------+--------+---------+
|1 |2 |Thornie |IT |
|2 |1 |Bel |HR |
|2 |3 |Chris |HR |
|2 |4 |Aloysia |HR |
|3 |null |null |Sales |
|4 |null |null |Marketing|
|5 |5 |Dorena |null |
+-------+------+--------+---------+
"""
How to perform Semi Join in PySpark Azure Databricks?
Semi is very similar to an inner join. But the difference between Spark Left Semi Join (semi, left semi, and left semi) and inner join is that the former returns all columns from the left DataFrame/Dataset while the latter ignores all columns from the right dataset.
Example:
In the below example, we are trying to get only the employee’s “dept_id” which has a match in the department DataFrame.
emp_df.join(dept_df, "dept_id", "semi").show(truncate=False)
"""
Output:
+-------+------+--------+
|dept_id|emp_id|emp_name|
+-------+------+--------+
|1 |2 |Thornie |
|2 |1 |Bel |
|2 |3 |Chris |
|2 |4 |Aloysia |
+-------+------+--------+
"""
How to perform Anti Join in PySpark Azure Databricks?
Values from the left relation that don’t match the right are returned by an anti-join. Another name for it is a left anti-join.
Example:
In the below example, we are trying to get only the employee’s “dept_id” which has no match in the department DataFrame.
emp_df.join(dept_df, "dept_id", "anti").show(truncate=False)
"""
Output:
+-------+------+--------+
|dept_id|emp_id|emp_name|
+-------+------+--------+
|5 |5 |Dorena |
+-------+------+--------+
"""
How to perform Cross Join in PySpark Azure Databricks?
Each row from the first table and each row from the second table are simply combined in this join. For instance, if we have m rows in one table and n rows in another, the resulting table will have m*n rows.
Example:
In the below example, we are trying to get combine each record with full record.
emp_df.crossJoin(dept_df).show(truncate=False)
"""
Output:
+------+--------+-------+-------+---------+
|emp_id|emp_name|dept_id|dept_id|dept_name|
+------+--------+-------+-------+---------+
|1 |Bel |2 |1 |IT |
|1 |Bel |2 |2 |HR |
|1 |Bel |2 |3 |Sales |
|1 |Bel |2 |4 |Marketing|
|2 |Thornie |1 |1 |IT |
|2 |Thornie |1 |2 |HR |
|2 |Thornie |1 |3 |Sales |
|2 |Thornie |1 |4 |Marketing|
|3 |Chris |2 |1 |IT |
|3 |Chris |2 |2 |HR |
|3 |Chris |2 |3 |Sales |
|3 |Chris |2 |4 |Marketing|
|4 |Aloysia |2 |1 |IT |
|4 |Aloysia |2 |2 |HR |
|4 |Aloysia |2 |3 |Sales |
|4 |Aloysia |2 |4 |Marketing|
|5 |Dorena |5 |1 |IT |
|5 |Dorena |5 |2 |HR |
|5 |Dorena |5 |3 |Sales |
|5 |Dorena |5 |4 |Marketing|
+------+--------+-------+-------+---------+
"""
How to perform Self Join in PySpark Azure Databricks?
A table is connected to itself with a SELF JOIN, as the name suggests. In other words, based on certain requirements, every row in the table is attached to every other row as well as to itself.
Example:
In the below example, we are trying to get combine each record of a DataFrame by itself.
from pyspark.sql.functions import col
emp_df.alias("emp1") \
.join(emp_df.alias("emp2"), col("emp1.dept_id") == col("emp2.emp_id"), "left") \
.show()
"""
Output:
+------+--------+-------+------+--------+-------+
|emp_id|emp_name|dept_id|emp_id|emp_name|dept_id|
+------+--------+-------+------+--------+-------+
| 2| Thornie| 1| 1| Bel| 2|
| 1| Bel| 2| 2| Thornie| 1|
| 3| Chris| 2| 2| Thornie| 1|
| 4| Aloysia| 2| 2| Thornie| 1|
| 5| Dorena| 5| 5| Dorena| 5|
+------+--------+-------+------+--------+-------+
"""
I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.
When should you use the PySpark join() function in Azure Databricks?
You can use the join() function to join two DataFrames in PySpark. On the other hand, PySpark joins comes with more optimization by default (thanks to DataFrames and Datasets), but there will still be some performance issues to consider when using it.
Real World Use Case Scenarios for join() function in PySpark Azure Databricks?
Assume that you have two DataFrames an employee dataset and a Department dataset. The employee dataset includes the Employee ID, name, and department_id. The department dataset has department_id and name. For example the employee dataset looks like (“EMP1”, “Berne”, 1) and the department dataset looks like (1, “IT”). You want to combine both datasets together into (“EMP1”, “Berne”, 1, 1, “IT”), you can use the PySpark join() function to join DataFrames together and this function supports different joins and each joins have been explained with a practical example in the above section.
What are the alternatives of the join() function in PySpark Azure Databricks?
The PySpark function joins is the only function that helps in joining two or more DataFrame together, you can the same benefits in the PySpark SQL too. You could union() function for combining DataFrame in vertically.
Final Thoughts
In this article, we have learned about how to join two DataFrames based on columns of both DataFrames in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.
Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.
- For Azure Study material Join Telegram group : Telegram group link:
- Azure Jobs and other updates Follow me on LinkedIn: Azure Updates on LinkedIn
- Azure Tutorial Videos: Videos Link
- Azure Databricks Lesson 1
- Azure Databricks Lesson 2
- Azure Databricks Lesson 3
- Azure Databricks Lesson 4
- Azure Databricks Lesson 5
- Azure Databricks Lesson 6
- Azure Databricks Lesson 7
As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.