How to collect all elements in PySpark Azure Databricks?

Are you looking to find how to collect all elements of PySpark DataFrame into Azure Databricks cloud or maybe you are looking for a solution, to fetch all records of a DataFrame in PySpark Databricks? If you are looking for any of these problem solutions, then you have landed on the correct page. I will also show you what and how to use PySpark to get all records of a DataFrame in Azure Databricks. I will explain it by taking a practical example. So don’t waste time let’s start step by step guide to understanding how to get all elements using PySpark DataFrame.

In this blog, I will teach you the following with practical examples:

  • Syntax of collect()
  • Collecting RDD and DataFrame records
  • Making use of collected records
  • Collecting specific records

collect() method is used in retrieving all the elements of the row from each partition in a dataframe and brings that over the driver node/program in PySpark Azure Databricks.

Syntax: dataframe_name.collect()

What is the syntax of the collect() function in PySpark Azure Databricks?

The syntax is as follows:

dataframe_name.collect()

Official Apache Spark documentation link: collect()

How to get all records of RDD and DataFrame in PySpark Azure Databricks?

Collect() is a RDD and Dataframe function and operation used to retrieve data from a Dataframe. It is useful for retrieving all the elements of the row from each partition in an RDD and bringing them over to the driver node/program.

Examples:

# 1. RDD
rdd = df.rdd
rdd.collect()

# 2. DataFrame
df.rdd.collect()

# The above code s generate the following output

"""
Output:

[
Row(subject_id='SUB001', subject_name='Tamil'),
 Row(subject_id='SUB002', subject_name='English'),
 Row(subject_id='SUB003', subject_name='Maths'),
 Row(subject_id='SUB004', subject_name='Science'),
 Row(subject_id='SUB005', subject_name='Social')
]

"""

How to make use of collected records in PySpark Azure Databricks?

Since the collect() function is an action, it returns an Array of Rows, which you can iterate through using the Python loop function.

Example:

for record in df.collect():
    print(f"The subject id of {record.subject_name} is {record.subject_id.upper()}.")

"""
Output:

The subject id of Tamil is SUB001.
The subject id of English is SUB002.
The subject id of Maths is SUB003.
The subject id of Science is SUB004.
The subject id of Social is SUB005.

"""

How to get particular records out of all records in PySpark Azure Databricks using the collect() function?

In this section, let’s try to get the first two rows using the count() function as an example.

Example:

df.collect()[:2]

"""
Output:

[
 Row(subject_id='SUB001', subject_name='Tamil'),
 Row(subject_id='SUB002', subject_name='English')
]

"""

How to fetch data only for the first row and first column from a DataFrame?

Example:

df.collect()[0][0]

"""
Output:

'sub001'

"""

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you use the PySpark collect() function in Azure Databricks?

You can use the collect() function to retrieve all elements from a small Dataframe. Because the collect() function is an action that instructs Spark to perform computation and send the result back to the driver. This may lead to Out of Memory issues.

Real World Use Case Scenarios for PySpark DataFrame collect() function in Azure Databricks?

  • You will use the collect method only when the DataFrame is small in size.

What are the alternatives of the collect() function in PySpark Azure Databricks?

There are multiple ways to fetch the records from the DataFrame. For example:

  • Take() method to fetch n number of records
  • Show() method to display the records

What is the disadvantage of using the collect() function in pySpark Azure Databricks?

The collect() method brings all the records from the DataFrame on the driver node. If the DataFrame contains a huge amount of data, then there would be chances that the driver node goes out of memory and crashed.

Final Thoughts

In this article, we have learned about the PySpark collect() method to retrieve all elements of DataFrame in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

PySpark in Azure Databricks, as explained by Arud Seka Berne S on azurelib.com.

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.