How to join DataFrames in PySpark Azure Databricks?

Are you looking to find how to use the join() function on PySpark DataFrame into Azure Databricks cloud or maybe you are looking for a solution, to join two Dataframes in PySpark Databricks? If you are looking for any of these problem solutions, then you have landed on the correct page. I will also show you what and how to use PySpark to join two DataFrames using the join() function in Azure Databricks. I will explain it with a practical example. So don’t waste time let’s start step by step guide to understanding how to join two DataFrames using PySpark.

In this blog, I will teach you the following with practical examples:

  • Syntax of join()
  • Common Join types
  • Semi Join
  • Anti Join
  • Cross Join
  • Self Join

join() method is used to join two Dataframes together based on condition specified in PySpark Azure Databricks.

Syntax: dataframe_name.join()

What is the syntax of the join() function in PySpark Azure Databricks?

The syntax is as follows:

dataframe_name.join(other, on, how)
Parameter NameRequiredDescription
other (Dataframe)YesIt represents the second column to be joined.
on (str, list, or Column)YesA string for the join column name, a list of column names, a join expression (Column), or a list of Columns. If on is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides.
how (str)OptionalIt represents join type, by default how=”inner”.
Table 1: join() Method in PySpark Databricks Parameter list with Details

Apache Spark Official Documentation Link: join()

Create a simple DataFrame

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

a) Creating a Dataframe manually

# 1. Employee Dataset
emp_data = [
    (1,"Bel",2),
    (2,"Thornie",1),
    (3,"Chris",2),
    (4,"Aloysia",2),
    (5,"Dorena",5)
]

emp_df = spark.createDataFrame(data = emp_data, schema = ["emp_id", "emp_name", "dept_id"])
emp_df.printSchema()
emp_df.show(truncate=False)

"""
root
 |-- emp_id: long (nullable = true)
 |-- emp_name: string (nullable = true)
 |-- dept_id: long (nullable = true)

+------+--------+-------+
|emp_id|emp_name|dept_id|
+------+--------+-------+
|1     |Bel     |2      |
|2     |Thornie |1      |
|3     |Chris   |2      |
|4     |Aloysia |2      |
|5     |Dorena  |5      |
+------+--------+-------+
"""
# 2. Department Dataset
dept_data = [
    (1, "IT"),
    (2, "HR"),
    (3, "Sales"),
    (4, "Marketing")
]

dept_df = spark.createDataFrame(data = dept_data, schema = ["dept_id", "dept_name"])
dept_df.printSchema()
dept_df.show(truncate=False)

"""
root
 |-- dept_id: long (nullable = true)
 |-- dept_name: string (nullable = true)

+-------+---------+
|dept_id|dept_name|
+-------+---------+
|1      |IT       |
|2      |HR       |
|3      |Sales    |
|4      |Marketing|
+-------+---------+
"""

b) Creating a Dataframe by reading files

# replace the file_path with the source file location which you have downloaded.

emp_df_2 = spark.read.format("csv").option("header", True).load("employee_file_path")
emp_df_2.printSchema()

dept_df_2 = spark.read.format("csv").option("header", True).load("department_file_path")
dept_df_2.printSchema()

"""
root
 |-- emp_id: long (nullable = true)
 |-- emp_name: string (nullable = true)
 |-- dept_id: long (nullable = true)

root
 |-- dept_id: long (nullable = true)
 |-- dept_name: string (nullable = true)
"""

Note: Here, I will be using the manually created DataFrame.

How to join DataFrames in PySpark Azure Databricks?

The PySpark join() function is used to join two DataFrames together. PySpark DataFrame supports all basic SQL join types like INNER, LEFT OUTER, RIGHT OUTER, LEFT ANTI, LEFT SEMI, CROSS, and SELF JOIN.

In the below example, we are trying to join the employee DataFrame and department DataFrame on column “dept_id” using a different method and join type.

Example 1: Inner Join

emp_df.join(dept_df, "dept_id", "inner").show(truncate=False)

"""
Output:

+-------+------+--------+---------+
|dept_id|emp_id|emp_name|dept_name|
+-------+------+--------+---------+
|1      |2     |Thornie |IT       |
|2      |1     |Bel     |HR       |
|2      |3     |Chris   |HR       |
|2      |4     |Aloysia |HR       |
+-------+------+--------+---------+
"""

Example 2: Left Outer Join

emp_df.join(dept_df, emp_df.dept_id == dept_df.dept_id, "left").show(truncate=False)

"""
Output:

+------+--------+-------+-------+---------+
|emp_id|emp_name|dept_id|dept_id|dept_name|
+------+--------+-------+-------+---------+
|2     |Thornie |1      |1      |IT       |
|1     |Bel     |2      |2      |HR       |
|3     |Chris   |2      |2      |HR       |
|4     |Aloysia |2      |2      |HR       |
|5     |Dorena  |5      |null   |null     |
+------+--------+-------+-------+---------+
"""

Example 3: Right Outer Join

emp_df.join(dept_df, "dept_id", "right").show(truncate=False)

"""
Output:

+-------+------+--------+---------+
|dept_id|emp_id|emp_name|dept_name|
+-------+------+--------+---------+
|1      |2     |Thornie |IT       |
|2      |1     |Bel     |HR       |
|2      |3     |Chris   |HR       |
|2      |4     |Aloysia |HR       |
|3      |null  |null    |Sales    |
|4      |null  |null    |Marketing|
+-------+------+--------+---------+
"""

Example 4: Full Outer Join

emp_df.join(dept_df, "dept_id", "outer").show(truncate=False)

"""
Output:

+-------+------+--------+---------+
|dept_id|emp_id|emp_name|dept_name|
+-------+------+--------+---------+
|1      |2     |Thornie |IT       |
|2      |1     |Bel     |HR       |
|2      |3     |Chris   |HR       |
|2      |4     |Aloysia |HR       |
|3      |null  |null    |Sales    |
|4      |null  |null    |Marketing|
|5      |5     |Dorena  |null     |
+-------+------+--------+---------+
"""

How to perform Semi Join in PySpark Azure Databricks?

Semi is very similar to an inner join. But the difference between Spark Left Semi Join (semi, left semi, and left semi) and inner join is that the former returns all columns from the left DataFrame/Dataset while the latter ignores all columns from the right dataset.

Example:

In the below example, we are trying to get only the employee’s “dept_id” which has a match in the department DataFrame.

emp_df.join(dept_df, "dept_id", "semi").show(truncate=False)

"""
Output:

+-------+------+--------+
|dept_id|emp_id|emp_name|
+-------+------+--------+
|1      |2     |Thornie |
|2      |1     |Bel     |
|2      |3     |Chris   |
|2      |4     |Aloysia |
+-------+------+--------+

"""

How to perform Anti Join in PySpark Azure Databricks?

Values from the left relation that don’t match the right are returned by an anti-join. Another name for it is a left anti-join.

Example:

In the below example, we are trying to get only the employee’s “dept_id” which has no match in the department DataFrame.

emp_df.join(dept_df, "dept_id", "anti").show(truncate=False)

"""
Output:

+-------+------+--------+
|dept_id|emp_id|emp_name|
+-------+------+--------+
|5      |5     |Dorena  |
+-------+------+--------+

"""

How to perform Cross Join in PySpark Azure Databricks?

Each row from the first table and each row from the second table are simply combined in this join. For instance, if we have m rows in one table and n rows in another, the resulting table will have m*n rows.

Example:

In the below example, we are trying to get combine each record with full record.

emp_df.crossJoin(dept_df).show(truncate=False)

"""
Output:

+------+--------+-------+-------+---------+
|emp_id|emp_name|dept_id|dept_id|dept_name|
+------+--------+-------+-------+---------+
|1     |Bel     |2      |1      |IT       |
|1     |Bel     |2      |2      |HR       |
|1     |Bel     |2      |3      |Sales    |
|1     |Bel     |2      |4      |Marketing|
|2     |Thornie |1      |1      |IT       |
|2     |Thornie |1      |2      |HR       |
|2     |Thornie |1      |3      |Sales    |
|2     |Thornie |1      |4      |Marketing|
|3     |Chris   |2      |1      |IT       |
|3     |Chris   |2      |2      |HR       |
|3     |Chris   |2      |3      |Sales    |
|3     |Chris   |2      |4      |Marketing|
|4     |Aloysia |2      |1      |IT       |
|4     |Aloysia |2      |2      |HR       |
|4     |Aloysia |2      |3      |Sales    |
|4     |Aloysia |2      |4      |Marketing|
|5     |Dorena  |5      |1      |IT       |
|5     |Dorena  |5      |2      |HR       |
|5     |Dorena  |5      |3      |Sales    |
|5     |Dorena  |5      |4      |Marketing|
+------+--------+-------+-------+---------+
"""

How to perform Self Join in PySpark Azure Databricks?

A table is connected to itself with a SELF JOIN, as the name suggests. In other words, based on certain requirements, every row in the table is attached to every other row as well as to itself.

Example:

In the below example, we are trying to get combine each record of a DataFrame by itself.

from pyspark.sql.functions import col

emp_df.alias("emp1") \
.join(emp_df.alias("emp2"), col("emp1.dept_id") == col("emp2.emp_id"), "left") \
.show()

"""
Output:

+------+--------+-------+------+--------+-------+
|emp_id|emp_name|dept_id|emp_id|emp_name|dept_id|
+------+--------+-------+------+--------+-------+
|     2| Thornie|      1|     1|     Bel|      2|
|     1|     Bel|      2|     2| Thornie|      1|
|     3|   Chris|      2|     2| Thornie|      1|
|     4| Aloysia|      2|     2| Thornie|      1|
|     5|  Dorena|      5|     5|  Dorena|      5|
+------+--------+-------+------+--------+-------+
"""

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you use the PySpark join() function in Azure Databricks?

You can use the join() function to join two DataFrames in PySpark. On the other hand, PySpark joins comes with more optimization by default (thanks to DataFrames and Datasets), but there will still be some performance issues to consider when using it.

Real World Use Case Scenarios for join() function in PySpark Azure Databricks?

Assume that you have two DataFrames an employee dataset and a Department dataset. The employee dataset includes the Employee ID, name, and department_id. The department dataset has department_id and name. For example the employee dataset looks like (“EMP1”, “Berne”, 1) and the department dataset looks like (1, “IT”). You want to combine both datasets together into (“EMP1”, “Berne”, 1, 1, “IT”), you can use the PySpark join() function to join DataFrames together and this function supports different joins and each joins have been explained with a practical example in the above section.

What are the alternatives of the join() function in PySpark Azure Databricks?

The PySpark function joins is the only function that helps in joining two or more DataFrame together, you can the same benefits in the PySpark SQL too. You could union() function for combining DataFrame in vertically.

Final Thoughts

In this article, we have learned about how to join two DataFrames based on columns of both DataFrames in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

PySpark in Azure Databricks, as explained by Arud Seka Berne S on azurelib.com.

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.