Difference between repartition() and coalesce() functions of PySpark in Azure Databricks?

Are you looking to find out the difference between the repartition() and coalesce() functions of PySpark in Azure Databricks cloud or maybe you are looking for a solution, to use the repartition() and coalesce() functions to partition the PySpark RDD and DataFrame in Azure Databricks? If you are looking for any of these problem solutions, you have landed on the correct page. I will also help you to increase or decrease the partitions for PySpark RDD and DataFrame in Azure Databricks. I will explain it by taking a practical example. So please don’t waste time let’s start with a step-by-step guide to understand the difference between both functions.

In this blog, I will teach you the following with practical examples:

  • Difference between repartition and coalesce function
  • Increase partition of RDD using repartition()
  • Decrease partition of RDD using coalesce()
  • Increase partition of DataFrame using repartition()
  • Decrease partition of DataFrame using coalesce()
  • PySpark default shuffle partition

Difference between repartition() and coalesce() functions:

Let’s try to understand first of all what is repartition and coalesce functions are.

The PySpark repartition() function is used for both increasing and decreasing the number of partitions of both RDD and DataFrame.

The PySpark coalesce() function is used for decreasing the number of partitions of both RDD and DataFrame in an effective manner.

Note that the PySpark preparation() and coalesce() functions are very expensive because they involve data shuffling across executors and even nodes. Therefore, try to minimize the use of these functions as much as possible.

Let’s differentiate repartition and coalesce with the help of a table:

SIRepartitionCoalesce
1Increase and decrease the number of partitions.Decrease the number of partitions.
2Create new partitions and does a full shuffle. Use existing partitions to minimize the amount of data that is shuffled.
3Results in an equal-sized partition.Results in a partition with different amounts of data.
4Does not affect parent RDD.Affects parent RDD.
Table 1: Difference between repartition() and coalesce function.

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

Let’s create an RDD, which will be used in the following sections:

rdd = sc.parallelize(range(1, 9), 4)
rdd.getNumPartitions()

"""
Output:

4

"""

# The above code distributes RDD into 4 partitions and the data is distributed as below.

"""
Partitioned data:

Partition 0: [1,2]
Partition 1: [3,4]
Partition 2: [5,6]
Partition 3: [7,8]

"""

How to increase the partition of PySpark RDD in Azure Databricks?

The PySpark repartition() function helps to increase and decrease the partitions of PySpark RDD. Let’s try to increase the partition with a practical example.

Example:

rdd = rdd.repartition(8)
rdd.getNumPartitions()

"""
Output:

8

"""

# The above code distributes RDD into 8 partitions and the data is distributed as below.

"""
Partitioned data:

Partition 0: [1]
Partition 1: [2]
Partition 2: [3]
Partition 3: [4]
Partition 4: [5]
Partition 5: [6]
Partition 6: [7]
Partition 7: [8]

"""

In the above example, we have seen how the repartition() function helps in increasing the RDD.

How to decrease the partition of PySpark RDD in Azure Databricks?

The PySpark coalesce() function helps to decrease the partitions of PySpark RDD, Let’s try to decrease the partition with a practical example.

Example:

rdd = rdd.coalesce(2)
rdd.getNumPartitions()

"""
Output:

2

"""

# The above code distributes RDD into 2 partitions and the data is distributed as below.

"""
Partitioned data:

Partition 0: [1,2]
Partition 1: [3,4,5,6,7,8]

"""

As explained before, the coalesce() function will not re-distribute the data evenly. This happens to reduce shuffling.

Let’s create a DataFrame, which will be used in the following sections:

df = spark.range(16)
df.printSchema()
df.rdd.getNumPartitions()

"""
Output:

root
 |-- id: long (nullable = false)

8

"""

# The above code distributes RDD into 8 partitions and the data is distributed as below.

"""
Partitioned data:

Partition 0: [0,1]
Partition 1: [2,3]
Partition 2: [4,5]
Partition 3: [6,7]
Partition 4: [8,9]
Partition 5: [10,11]
Partition 6: [12,13]
Partition 7: [14,15]

"""

How to increase the partition of PySpark DataFrame in Azure Databricks?

The PySpark repartition() function helps to increase and decrease the partitions of PySpark DataFrame. Let’s try to increase the partition with a practical example.

Example:

df_1 = df.repartition(16)
df_1.rdd.getNumPartitions()

"""
Output:

16

"""

# The above code distributes RDD into 16 partitions and the data is distributed as below.

"""
Partitioned data:

Partition 0: [0]
Partition 1: [1]
Partition 2: [2]
Partition 3: [3]
Partition 4: [4]
Partition 5: [5]
Partition 6: [6]
Partition 7: [7]
Partition 8: [8]
Partition 9: [9]
Partition 10: [10]
Partition 11: [11]
Partition 12: [12]
Partition 13: [13]
Partition 14: [14]
Partition 15: [15]

"""

In the above example, we have seen how the repartition() function helps in increasing the partition of a DataFrame.

How to decrease the partition of PySpark DataFrame in Azure Databricks?

The PySpark coalesce() function helps to decrease the partitions of the PySpark DataFrame. Let’s try to decrease the partition with a practical example.

Example:

df_2 = df.repartition(4)
df_2.rdd.getNumPartitions()

"""
Output:

4

"""

# The above code distributes RDD into 4 partitions and the data is distributed as below.

"""
Partitioned data:

Partition 0: [0,1]
Partition 1: [2,3]
Partition 2: [4,5,6,7,8,9]
Partition 3: [10,11,12,13,14,15]

"""

As explained before, the coalesce() function will not re-distribute the data evenly. This happens to reduce shuffling.

Note: When calling groupBy(), union(), join(), and similar methods on a DataFrame, the data is moved between different executors and even machines before being automatically partitioned into 200 partitions. Using the spark.sql.shuffle.partitions setting, PySpark sets the default shuffling partition to 200.

Example:

df_3 = df.groupBy("id").count()
print(df_3.rdd.getNumPartitions())

"""
Output:

200

"""

Using coalesce() or repartition() functions, you can modify the partitions after shuffle operations.

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you use the repartition() and coalesce() functions in Azure Databricks?

These could be the possible reasons:

  1. For increasing and decreasing partitions
  2. For optimisation

Real World Use Case Scenarios for PySpark repartition() and coalesce() functions in Azure Databricks?

  • Repartition: Assume that you were given a huge dataset for processing and it seems like the processing was slow and you want to finish it faster in a balanced manner. You can go for repartition.
  • Coalesce: Assume that you were given a huge dataset for processing and it seems like you have limited resources for processing and don’t to cause out-of-memory issues. You can go for coalesce. But always remember that the coalesce() function affected parent RDD.

What are the alternatives for increasing and decreasing partitions of RDD and DataFrame in Azure Databricks?

There are multiple alternatives for increasing and decreasing partitions of RDD and DataFrame, which are as follows:

  • repartition(): used for increasing and decreasing partition and guarantees equal partition.
  • Coalesce: for increasing and decreasing partition and doesn’t guarantee equal partition

Final Thoughts

In this article, we have learned about the difference between repartition and coalesce functions, increasing and decreasing the number of partitions of PySpark RDD and DataFrame in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

PySpark in Azure Databricks, as explained by Arud Seka Berne S on azurelib.com.

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.