How to use parallelize() function of PySpark in Azure Databricks?

Are you looking to find out how to create an RDD of PySpark in Azure Databricks cloud or maybe you are looking for a solution, to create an empty RDD of PySpark in Azure Databricks? If you are looking for any of these problem solutions, you have landed on the correct page. I will also help you how to create an RDD using parallelize() function of PySpark in Azure Databricks. I will explain it by taking a practical example. So please don’t waste time let’s start with a step-by-step guide to understand how to create an RDD.

In this blog, I will teach you the following with practical examples:

  • Syntax of parallelize() function
  • Create an RDD
  • Create an empty RDD
  • Check RDD has value

The PySpark function parallelize() is a SparkContext function used for creating an RDD from a python collection.

SparkContext.parallelize()

What is the syntax of the parallelize() function in PySpark Azure Databricks?

The syntax is as follows:

SparkContext.parallelize(collection, number_of_partitions)
Parameter NameRequiredDescription
collection (Iterable)YesIt represents the collection of information.
number_of_partition (int)OptionalIt controls the number of partitions of RDD.
Table 1: parallelize() Method in PySpark Databricks Parameter list with Details

Apache Spark Official Documentation Link: parallelize()

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

How to create an RDD of PySpark DataFrame on Azure Databricks using parallelize() function?

The PySpark parallelize() function helps create a new RDD from a Python collection, Let’s try to create an RDD using practical examples.

Example 1:

rdd_1 = sc.parallelize([1,2,3,4,5,6,7,8,9,10])

print(f"Number of partitions: {rdd_1.getNumPartitions()}")
print(f"Data: {rdd_1.collect()}")

"""
Output:

Number of partitions: 8
Data: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

"""

We have used the collection function to collect the distributed RDD to the console. Also, you can see that we have used getNumPartitions() to see the number of partitions of that RDD. In simple terms, the RDD was decided into multiple partitions since it was processed parallelly.

Example 2:

In this example, let’s try to control the RDD partitions by using parallelize function parameters.

rdd_2 = sc.parallelize([1,2,3,4,5,6,7,8,9,10], numSlices=2)

print(f"Number of partitions: {rdd_2.getNumPartitions()}")
rdd_2.collect()

"""
Output:

Number of partitions: 2
Out[32]: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

"""

You can see that the number of partitions of rdd_2 was controlled by specifying the number of partitions in the numSlices parameter.

How to create an empty RDD of PySpark DataFrame on Azure Databricks?

The PySpark parallelize() function helps create an empty RDD, Let’s try to create an empty RDD using various examples.

Example 1:

frdd_3 = sc.emptyRDD()
rdd_3.collect()

"""
Output:

[]

"""

Example 2:

rdd_4 = sc.parallelize([])
rdd_4.collect()

"""
Output:

[]

"""

How to check whether a PySpark RDD has value or not in Azure Databricks?

Let’s see how to check whether a PySpark RDD has value or not in Azure Databricks. Let’s try to find it using practical examples.

Example:

from pyspark.sql.functions import col

# Non-empt RDD
print(f"First RDD is empty: {rdd_1.isEmpty()}")
print(f"Second RDD is empty: {rdd_2.isEmpty()}")

# Empty RDD
print(f"Third RDD is empty: {rdd_3.isEmpty()}")
print(f"Forth RDD is empty: {rdd_4.isEmpty()}")

"""
Output:

First RDD is empty: False
Second RDD is empty: False
Third RDD is empty: True
Forth RDD is empty: True

"""

With the help of RDD’s isEmpty( function, we can get to know whether has value in it or not.

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you use the parallelize() function in Azure Databricks?

These could be the possible reasons:

  1. To create an RDD in PySpark
  2. To create an RDD by defining the number of partitions in PySpark

Real World Use Case Scenarios for PySpark parallelize() function in Azure Databricks?

Assume you were given a list of the collection which has to be converted into RDD for parallel processing. In this scenario, the PySpark parallelize() helps in creating an RDD using a collection. Also, we can specify the number of partitions at the initial stage itself.

What are the alternatives for creating PySpark RDD in Azure Databricks?

There are multiple alternatives for creating a PySpark RDD, which are as follows:

  • From reading files
  • RDD from existing RDD

Final Thoughts

In this article, we have learned about a PySpark RDD in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

PySpark in Azure Databricks, as explained by Arud Seka Berne S on azurelib.com.

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.