Explained: How to convert PySpark RDD to DataFrame in Databricks

Are you looking to find how to convert pyspark rdd’s into the dataframe in Azure Databricks cloud or maybe you are looking for a solution, to creating the Dataframe from pysprk rdd in the Databricks? If you are looking for any of these problem solutions then you have landed on the correct page. I will also show you what and how to use pysprak rdd conversion function. I will explain it by taking a practical example. So don’t waste time let’s start step by step guide to understand pyspark rdd to Dataframe conversion in Azure Databricks.

There are two ways to convert pyspark rdd to dataframe in Databricks

1. Use the rdd.toDF() function to convert the pyspark rdd to dataframe.

2. Use the createDataFrame() function by passing the rdd.

What is RDD in the pyspark?

RDD stands for resilient distributed dataset in the spark. It is the lowest level of abstraction for the data representation in the spark or Databricks.

What is Dataframe in Databricks?

DataFrames are tables that contain rows and columns of data in Databricks. There is a two-dimensional structure, in which each column contains values of a specific variable, whereas each row contains one set of values from each column.

What are the different ways to convert pyspark rdd to Dataframe in Azure Databricks with Example?

There are two common ways through which we can convert the pyspark rdd to Dataframes. Let’s understand and see both the ways in detail with example below.

Solution 1: Use the rdd.toDF() function to convert the pyspark rdd to dataframe.

PySpark has the toDF() function in RDD which can be used to convert RDD into Dataframes in Databricks.

Let’s first create a RDD in the pyspark Databricks

import pyspark
from pyspark.sql import SparkSession

#Below line need if you are running the code outside the spark or when you don't have spark session object created automatically
#spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate() 

#Create the array to hold the sample data
data= [("John","AWS"),("Mike","GCP"),("Harry","Spark R"),("Robin","Spark scala")]

#use the fuction to convert the array to rdd
rdd = spark.sparkContext.parallelize(data)

Create a dataframe from RDD in the pyspark Databricks

df = rdd.toDF()
df.printSchema()
df.show()

"""
Output

 |-- _1: string (nullable = true)
 |-- _2: long (nullable = true)


+-----+-----------+
|   _1|         _2|
+-----+-----------+
| John|        AWS|
| Mike|        GCP|
|Harry|    Spark R|
|Robin|Spark scala|
+-----+-----------+

"""

Solution 2: Use the createDataFrame() function to convert rdd to dataframe by passing the rdd

createDataframe() function need two arguments

Parameter NameParameter Description
rddPass the RDD which needs to convert to Dataframe
schemaPass the schema definition (column name )
Table 1: createDataframe() function arguments list

Let’s see the example below

import pyspark
from pyspark.sql import SparkSession
data = [("John","AWS"),("Mike","GCP"),("Harry","Spark R"),("Robin","Spark scala")]
rdd = spark.sparkContext.parallelize(data)
employeeSchema =  ["employee_name","expertise"]

#passing the rdd and schema both to convert rdd to df
df = spark.createDataFrame(rdd, schema = employeeSchema)
df.printSchema()
df.show()

"""
Output::

 |-- employee_name: string (nullable = true)
 |-- expertise: string (nullable = true)

+-------------+-----------+
|employee_name|  expertise|
+-------------+-----------+
|         John|        AWS|
|         Mike|        GCP|
|        Harry|    Spark R|
|        Robin|Spark scala|
+-------------+-----------+

"""

How to convert rdd to dataframe in pyspark Databricks with Schema Example?

There are two ways to convert the pyspark rdd to dataframe with schema in databricks.

Solution 1: Using the toDF() function

import pyspark
from pyspark.sql import SparkSession

#spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

data = [("John","AWS"),("Mike","GCP"),("Harry","Spark R"),("Robin","Spark scala")]
rdd = spark.sparkContext.parallelize(data)
employeeSchema =  ["employee_name","expertise"]
df = rdd.toDF()
df.printSchema()
df.show()

"""

Output::

|-- _1: string (nullable = true)
 |-- _2: string (nullable = true)

+-----+-----------+
|   _1|         _2|
+-----+-----------+
| John|        AWS|
| Mike|        GCP|
|Harry|    Spark R|
|Robin|Spark scala|
+-----+-----------+

"""

Solution 2: Using the createDataFrame() function

Syntax is given below with example

#Example-2

import pyspark
from pyspark.sql import SparkSession
data = [("John","AWS"),("Mike","GCP"),("Harry","Spark R"),("Robin","Spark scala")]
rdd = spark.sparkContext.parallelize(data)
employeeSchema =  ["employee_name","expertise"]
df = spark.createDataFrame(rdd, schema = employeeSchema)
df.printSchema()
df.show()

"""
Output::

|-- employee_name: string (nullable = true)
 |-- expertise: string (nullable = true)

+-------------+-----------+
|employee_name|  expertise|
+-------------+-----------+
|         John|        AWS|
|         Mike|        GCP|
|        Harry|    Spark R|
|        Robin|Spark scala|
+-------------+-----------+
"""

When and Why you should do pyspark RDD to Dataframe conversion in Databricks?

There are certain use case scenarios when it is recommended to use the pyspark rdd to dataframe conversion within the Databricks which are as follows:

  • If the data size is small enough and you have manually created the rdd by passing the data. In this scenario, you can convert the rdd to dataframes
  • You have some old legacy code written and now you want to leverage the latest spark functionality then it makes sense to convert the rdd to dataframes.

Real World Use Case Scenarios for pyspark RDD to Dataframe conversion in Databricks?

  • Legacy code which was written in the lower than Spark 2.0 version and now you are upgrading the spark version. Hence you can update the code to start using the dataframes capability of the Spark engine to run the code in a more optimized manner.

ToRDD() Function Official Documentation Link

Final Thoughts

In this article, we have learned about pyspark rdd to Dataframe conversion in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with a practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

DeepakGoyal

Deepak Goyal is certified Azure Cloud Solution Architect. He is having around decade and half experience in designing, developing and managing enterprise cloud solutions. He is also Big data certified professional and passionate cloud advocate.