How to join DataFrames in PySpark Azure Databricks?

The PySpark join() function is used to join two DataFrames together. PySpark DataFrame supports all basic SQL join types like INNER, LEFT OUTER, RIGHT OUTER, LEFT ANTI, LEFT SEMI, CROSS, and SELF JOIN.

In the below example, we are trying to join the employee DataFrame and department DataFrame on column “dept_id” using a different method and join type.

Example 1: Inner Join

emp_df.join(dept_df, "dept_id", "inner").show(truncate=False)

"""
Output:

+-------+------+--------+---------+
|dept_id|emp_id|emp_name|dept_name|
+-------+------+--------+---------+
|1      |2     |Thornie |IT       |
|2      |1     |Bel     |HR       |
|2      |3     |Chris   |HR       |
|2      |4     |Aloysia |HR       |
+-------+------+--------+---------+
"""

Example 2: Left Outer Join

emp_df.join(dept_df, emp_df.dept_id == dept_df.dept_id, "left").show(truncate=False)

"""
Output:

+------+--------+-------+-------+---------+
|emp_id|emp_name|dept_id|dept_id|dept_name|
+------+--------+-------+-------+---------+
|2     |Thornie |1      |1      |IT       |
|1     |Bel     |2      |2      |HR       |
|3     |Chris   |2      |2      |HR       |
|4     |Aloysia |2      |2      |HR       |
|5     |Dorena  |5      |null   |null     |
+------+--------+-------+-------+---------+
"""

Example 3: Right Outer Join

emp_df.join(dept_df, "dept_id", "right").show(truncate=False)

"""
Output:

+-------+------+--------+---------+
|dept_id|emp_id|emp_name|dept_name|
+-------+------+--------+---------+
|1      |2     |Thornie |IT       |
|2      |1     |Bel     |HR       |
|2      |3     |Chris   |HR       |
|2      |4     |Aloysia |HR       |
|3      |null  |null    |Sales    |
|4      |null  |null    |Marketing|
+-------+------+--------+---------+
"""

Example 4: Full Outer Join

emp_df.join(dept_df, "dept_id", "outer").show(truncate=False)

"""
Output:

+-------+------+--------+---------+
|dept_id|emp_id|emp_name|dept_name|
+-------+------+--------+---------+
|1      |2     |Thornie |IT       |
|2      |1     |Bel     |HR       |
|2      |3     |Chris   |HR       |
|2      |4     |Aloysia |HR       |
|3      |null  |null    |Sales    |
|4      |null  |null    |Marketing|
|5      |5     |Dorena  |null     |
+-------+------+--------+---------+
"""

How to perform Semi Join in PySpark Azure Databricks?

Semi is very similar to an inner join. But the difference between Spark Left Semi Join (semi, left semi, and left semi) and inner join is that the former returns all columns from the left DataFrame/Dataset while the latter ignores all columns from the right dataset.

Example:

In the below example, we are trying to get only the employee’s “dept_id” which has a match in the department DataFrame.

emp_df.join(dept_df, "dept_id", "semi").show(truncate=False)

"""
Output:

+-------+------+--------+
|dept_id|emp_id|emp_name|
+-------+------+--------+
|1      |2     |Thornie |
|2      |1     |Bel     |
|2      |3     |Chris   |
|2      |4     |Aloysia |
+-------+------+--------+

"""

How to perform Anti Join in PySpark Azure Databricks?

Values from the left relation that don’t match the right are returned by an anti-join. Another name for it is a left anti-join.

Example:

In the below example, we are trying to get only the employee’s “dept_id” which has no match in the department DataFrame.

emp_df.join(dept_df, "dept_id", "anti").show(truncate=False)

"""
Output:

+-------+------+--------+
|dept_id|emp_id|emp_name|
+-------+------+--------+
|5      |5     |Dorena  |
+-------+------+--------+

"""

How to perform Cross Join in PySpark Azure Databricks?

Each row from the first table and each row from the second table are simply combined in this join. For instance, if we have m rows in one table and n rows in another, the resulting table will have m*n rows.

Example:

In the below example, we are trying to get combine each record with full record.

emp_df.crossJoin(dept_df).show(truncate=False)

"""
Output:

+------+--------+-------+-------+---------+
|emp_id|emp_name|dept_id|dept_id|dept_name|
+------+--------+-------+-------+---------+
|1     |Bel     |2      |1      |IT       |
|1     |Bel     |2      |2      |HR       |
|1     |Bel     |2      |3      |Sales    |
|1     |Bel     |2      |4      |Marketing|
|2     |Thornie |1      |1      |IT       |
|2     |Thornie |1      |2      |HR       |
|2     |Thornie |1      |3      |Sales    |
|2     |Thornie |1      |4      |Marketing|
|3     |Chris   |2      |1      |IT       |
|3     |Chris   |2      |2      |HR       |
|3     |Chris   |2      |3      |Sales    |
|3     |Chris   |2      |4      |Marketing|
|4     |Aloysia |2      |1      |IT       |
|4     |Aloysia |2      |2      |HR       |
|4     |Aloysia |2      |3      |Sales    |
|4     |Aloysia |2      |4      |Marketing|
|5     |Dorena  |5      |1      |IT       |
|5     |Dorena  |5      |2      |HR       |
|5     |Dorena  |5      |3      |Sales    |
|5     |Dorena  |5      |4      |Marketing|
+------+--------+-------+-------+---------+
"""

How to perform Self Join in PySpark Azure Databricks?

A table is connected to itself with a SELF JOIN, as the name suggests. In other words, based on certain requirements, every row in the table is attached to every other row as well as to itself.

Example:

In the below example, we are trying to get combine each record of a DataFrame by itself.

from pyspark.sql.functions import col

emp_df.alias("emp1") \
.join(emp_df.alias("emp2"), col("emp1.dept_id") == col("emp2.emp_id"), "left") \
.show()

"""
Output:

+------+--------+-------+------+--------+-------+
|emp_id|emp_name|dept_id|emp_id|emp_name|dept_id|
+------+--------+-------+------+--------+-------+
|     2| Thornie|      1|     1|     Bel|      2|
|     1|     Bel|      2|     2| Thornie|      1|
|     3|   Chris|      2|     2| Thornie|      1|
|     4| Aloysia|      2|     2| Thornie|      1|
|     5|  Dorena|      5|     5|  Dorena|      5|
+------+--------+-------+------+--------+-------+
"""

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

GitHub Link

When should you use the PySpark join() function in Azure Databricks?

You can use the join() function to join two DataFrames in PySpark. On the other hand, PySpark joins comes with more optimization by default (thanks to DataFrames and Datasets), but there will still be some performance issues to consider when using it.

Real World Use Case Scenarios for join() function in PySpark Azure Databricks?

Assume that you have two DataFrames an employee dataset and a Department dataset. The employee dataset includes the Employee ID, name, and department_id. The department dataset has department_id and name. For example the employee dataset looks like (“EMP1”, “Berne”, 1) and the department dataset looks like (1, “IT”). You want to combine both datasets together into (“EMP1”, “Berne”, 1, 1, “IT”), you can use the PySpark join() function to join DataFrames together and this function supports different joins and each joins have been explained with a practical example in the above section.

What are the alternatives of the join() function in PySpark Azure Databricks?

The PySpark function joins is the only function that helps in joining two or more DataFrame together, you can the same benefits in the PySpark SQL too. You could union() function for combining DataFrame in vertically.

Final Thoughts

In this article, we have learned about how to join two DataFrames based on columns of both DataFrames in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

For Azure Study material Join Telegram group : Telegram group link:
Azure Jobs and other updates Follow me on LinkedIn: Azure Updates on LinkedIn
Azure Tutorial Videos: Videos Link

Join Telegram Group

Download Free Azure Videos

Connect on LinkedIn

Arud Seka Berne S

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.

Parameter Name	Required	Description
other (Dataframe)	Yes	It represents the second column to be joined.
on (str, list, or Column)	Yes	A string for the join column name, a list of column names, a join expression (Column), or a list of Columns. If on is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides.
how (str)	Optional	It represents join type, by default how=”inner”.

What is the syntax of the join() function in PySpark Azure Databricks?

Create a simple DataFrame

a) Creating a Dataframe manually

b) Creating a Dataframe by reading files