What is the difference between repartition() and partitionBy() in PySpark Azure Databricks?

Are you looking to find out what is the difference between repartition() and partitionBy() functions in PySpark Azure Databricks cloud or maybe you are looking for a solution, to know how repartition() and partitionBy() functions write files? If you are looking for any of these problem solutions, you have landed on the correct page. I will also show you how to use both functions in PySpark Azure Databricks. I will explain it with a practical example. So please don’t waste time let’s start with a step-by-step guide to understand the difference between repartition() and partitionBy() in Azure Databricks.

In this blog, I will teach you the following with practical examples:

  • Difference between repartition() and partitionBy()
  • Syntax of repartition()
  • Repartition by number of partitions
  • Repartition by number of partitions and columns
  • Syntax of partitionBy()
  • Partitioning DataFrame based on columns

What is the difference between repartition() and partitionBy() functions in PySpark Azure Databricks?

repartition() is the DataFrame function used for increasing and decreasing partitions in memory. When we save this DataFrame to disk, all part files are created in a single directory.

partitionBy() is the DtaFrameWriter function used for partitioning files on disk while writing, and this creates a sub-directory for each part file.

Create a simple DataFrame

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

a) Create manual PySpark DataFrame

data = [
    (1,"Ricki","Russia"),
    (2,"Madelaine","Russia"),
    (3,"Neil","Brazil"),
    (4,"Lock","Brazil"),
    (5,"Prissie","India"),
    (6,"Myrilla","India"),
    (7,"Correna","USA"),
    (8,"Ricki","USA")
]

df = spark.createDataFrame(data, schema=["id","name","country"])
df.printSchema()
df.show()

"""
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- country: string (nullable = true)

+---+---------+-------+
| id|     name|country|
+---+---------+-------+
|  1|    Ricki| Russia|
|  2|Madelaine| Russia|
|  3|     Neil| Brazil|
|  4|     Lock| Brazil|
|  5|  Prissie|  India|
|  6|  Myrilla|  India|
|  7|  Correna|    USA|
|  8|    Ricki|    USA|
+---+---------+-------+
"""

b) Creating a DataFrame by reading files

Download and use the below source file.

# replace the file_path with the source file location which you have downloaded.

df_2 = spark.read.format("csv").option("header", True).load(file_path)
df_2.printSchema()

"""
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- country: string (nullable = true)
"""

Note: In this blog, I have used Azure Data Lake Storage Gen2 for writing the partitioned records. So that I can demonstrate to you how the files are stored in ADLS using the dbutils.fs() function, To use ADLS in Azure databricks, kindly mount your container to databricks. Here, I will be using the manually created DataFrame.

What is the syntax of the repartition() function in PySpark Azure Databricks?

The syntax is as follows:

dataframe_name.repartition(number_of_partitions, *columns)
Parameter NameRequiredDescription
number_of_partitionsYesIt represents the target number of partitions.
columns (str, Column)YesIt represents the columns to be considered for partitioning. The number_of_partitions was also made optional if partitioning columns were mentioned.
Table 1: repartition() Method in PySpark Databricks Parameter list with Details

Apache Spark Official Documentation Link: repartition()

How to use the repartition() function in PySpark Azure Databricks?

PySpark’s repartition() function is used for increasing and decreasing partitions in memory. Let’s understand this with some practical examples. Here are some examples of code that has three parts:

  • Using repartition() function
  • Writing the repartitioned DataFrame into a disk
  • Check how the files have been repartitioned

Example 1:

df1 = df.repartition(4)

# replace the save_location with your saving location
df1.write.format("csv").mode("overwrite").save("save_location")

# Check number of file has been written
for record in dbutils.fs.ls("saved_location"):
    # The written file name always starts with 'p'
    if record.name.startswith("p"):
        print(record.name)

"""
Output:

part-00000-tid-8211750290546117208-f9cd75cc-b2c5-45d7-bca8-109ace4cc629-39-1-c000.csv
part-00001-tid-8211750290546117208-f9cd75cc-b2c5-45d7-bca8-109ace4cc629-40-1-c000.csv
part-00002-tid-8211750290546117208-f9cd75cc-b2c5-45d7-bca8-109ace4cc629-41-1-c000.csv
part-00003-tid-8211750290546117208-f9cd75cc-b2c5-45d7-bca8-109ace4cc629-42-1-c000.csv

"""

In the above example, you can see that the DataFrame has been written in 4 files.

Example 2:

df2 = df.repartition(2, "country")

# replace the save_location with your saving location
df2.write.format("csv").mode("overwrite").save("save_location")

# Check number of file has been written
for record in dbutils.fs.ls("saved_location"):
    # The written file name always starts with 'p'
    if record.name.startswith("p"):
        print(record.name)

"""
Output:

part-00000-tid-2978030412679744719-1f0ab132-1e3f-4c0b-ad74-cd5022283147-45-1-c000.csv
part-00001-tid-2978030412679744719-1f0ab132-1e3f-4c0b-ad74-cd5022283147-46-1-c000.csv

"""

In the above example, you can see that the DataFrame has been written in 2 files. This uses a hash-based partition on the country column to produce a DataFrame with 2 partitions. For example:

  • hash(“India”) % 2 = 0
  • hash(“Russia”) % 2 = 0
  • hash(“USA”) % 2 = 1
  • hash(“Brazil”) % 2 = 1

Both the “India” and “Russia” files will be in the first file “part-00000” and the remaining countries will exist in the second file “part-00001”. It is assured that every row with the same country(partition key) will be placed in the same partition file.

What is the syntax of the partitionBy() function in PySpark Azure Databricks?

The syntax is as follows:

dataframe_name.write.partitionBy(columns)
Parameter NameRequiredDescription
columns (str or list)YesIt represents the columns to be considered for partitioning
Table 1: partitionBy() Method in PySpark Databricks Parameter list with Details

Apache Spark Official Documentation Link: partitionBy()

How to use the partitionBy() function in PySpark Azure Databricks?

As mentioned above, PySpark’s partitionBy() function is used for partitioning files on disk while writing, and this creates a sub-directory for each part file.

Example:

df.write.partitionBy("country").format("csv").mode("overwrite").save("save_location")

# Check number of file has been written
for record in dbutils.fs.ls("saved_location"):
    if record.name.startswith("country"):
        print(record.name)

"""
Output:

country=Brazil/
country=India/
country=Russia/
country=USA/

"""

In the above example, you can see that the DataFrame has been written in 4 different folders with the country name and the DataFrame will not have the country column init. This reduces some space.

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you use repartition() and partitionBy() functions of PySpark using Azure Databricks?

These could be the possible reasons:

  1. Repartition: For increasing the number of partitions in memory
  2. PartitionBy: For increasing the number of partitioning while writing data into a disk.

Real World Use Case Scenarios for using repartition() and partitionBy() of PySpark using Azure Databricks?

  • repartitionBy(): Assume you have huge dataset which has to processed and has the time consumption for executing this DataFrame is huge. In order to avoid this we repartition the data. This helps in increasing the partitions in memory, which results in faster data processing.
  • partitionBy(): Assume that you have 100M records of student data. The student dataset has their id, name, school, and city. You only have 10 student who studies in Chennai. And you got a requirement to fetch all the student who studies in Chennai. YOu can easily achieve this by filtering out records. But the time consumption is high because the execution will iterate through each record to check the city column. To avoid this we can partition and write student records on city column basics.

What are the alternatives for using repartition() and partitionBy() functions of PySpark using Azure Databricks?

There are multiple alternatives for using the repartition() and partitionBy() functions of PySpark DataFrame, which are as follows:

  • coalesce(): used for increasing and decreasing partition in memory
  • bucketBy(): used for partitioning the files based on hash key and supported only for writing records as hive table.

Final Thoughts

In this article, we have learned about the difference between the repartition() and partitionBy() functions of PySpark in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

PySpark in Azure Databricks, as explained by Arud Seka Berne S on azurelib.com.

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.