How to convert RDD to DataFrame in PySpark Azure Databricks?

Are you looking to find out how to convert PySpark RDD into DataFrame in Azure Databricks cloud or maybe you are looking for a solution, to make PySpark DataFrame out of existing RDD? If you are looking for any of these problem solutions, you have landed on the correct page. I will also help you how to convert PySpark RDD to DataFrame with multiple examples in Azure Databricks. I will explain it by taking a practical example. So please don’t waste time let’s start with a step-by-step guide to understand how to convert RDD into DataFrame using various functions in PySpark.

In this blog, I will teach you the following with practical examples:

  • Syntax of toDF() and createDataFrame()
  • Converting RDD into DataFrame using toDF()
  • Converting RDD into DataFrame using createDataFrame()

The PySpark toDF() and createDataFrame() functions are used to manually create DataFrames from an existing RDD or collection of data with specified column names in PySpark Azure Databricks.

Syntax:

data_frame.toDF()

spark.createDataFrame()

What is the syntax of the toDF() function in PySpark Azure Databricks?

The syntax is as follows:

data_frame.toDF(column_names)
Parameter NameRequiredDescription
column_names (str, Column)OptionalIt represents the new column names.
Table 1: toDF() Method in PySpark Databricks Parameter list with Details

Apache Spark Official documentation link: toDF()

What is the syntax of the createDataFrame() function in PySpark Azure Databricks?

The syntax is as follows:

spark.createDataFrame(data, schema, samplingRation, verifySchema)
Parameter NameRequiredDescription
data (RDD, iterable)YesIt represents the data that has to be converted in the form of a DataFrame.
schema (str, list, DataType)OptionalIt represents the structure of DataFrame.
samplingRatio (float)OptionalIt represents the sampling ratio of rows used for inferring.
verifySchema (bool)OptionalIt verifies the row against the schema, it is enabled by default.
Table 1: createDataFrame() Method in PySpark Databricks Parameter list with Details

Apache Spark Official documentation link: createDataFrame()

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

Create a simple RDD

Before getting started, let’s create an RDD which will be helpful in the upcoming code samples.

columns = ["name", "dept", "salary"]
employees_data = [
    ("Kumar", "Sales", 25000),
    ("Shankar", "IT", 32000),
    ("Kavitha", "HR", 27000)
]

rdd = sc.parallelize(employees_data)
rdd.collect()

"""
Output:

[('Kumar', 'Sales', 25000), ('Shankar', 'IT', 32000), ('Kavitha', 'HR', 27000)]

"""

How to convert PySpark RDD into DataFrame in Azure Databricks using toDF() function?

Let’s see how to convert a PySpark RDD into a DataFrame in Azure Databricks using the toDF() function.

Example 1:

# a) without column name
df1 = rdd.toDF()
df1.printSchema()
df1.show()

"""
Output:

root
 |-- _1: string (nullable = true)
 |-- _2: string (nullable = true)
 |-- _3: long (nullable = true)

+-------+-----+-----+
|     _1|   _2|   _3|
+-------+-----+-----+
|  Kumar|Sales|25000|
|Shankar|   IT|32000|
|Kavitha|   HR|27000|
+-------+-----+-----+

"""

As you can see, when we don’t specify any column names, PySpark will refer to the column as _1, _2, etc. In order to specify the schema, pass column names using the ‘schema’ parameter.

Example 2:

# b) Using toDF() with column name
df2 = rdd.toDF(schema=columns)
df2.printSchema()
df2.show()

"""
Output:

root
 |-- name: string (nullable = true)
 |-- dept: string (nullable = true)
 |-- salary: long (nullable = true)

+-------+-----+------+
|   name| dept|salary|
+-------+-----+------+
|  Kumar|Sales| 25000|
|Shankar|   IT| 32000|
|Kavitha|   HR| 27000|
+-------+-----+------+

"""

How to convert PySpark RDD into DataFrame in Azure Databricks using createDataFrame() function?

Let’s see how to convert a PySpark RDD into a DataFrame in Azure Databricks using the createDataFrame() function.

Example 1:

# a) without column name
df3 = spark.createDataFrame(rdd)
df3.printSchema()
df3.show()

"""
Output:

root
 |-- _1: string (nullable = true)
 |-- _2: string (nullable = true)
 |-- _3: long (nullable = true)

+-------+-----+-----+
|     _1|   _2|   _3|
+-------+-----+-----+
|  Kumar|Sales|25000|
|Shankar|   IT|32000|
|Kavitha|   HR|27000|
+-------+-----+-----+

"""

The same thing happens here, when we don’t specify any column names, PySpark will refer to the column as _1, _2, etc. In order to specify the schema, pass column names using the ‘schema’ parameter.

Example 2:

# b) with column name
df4 = spark.createDataFrame(rdd, schema=columns)
df4.printSchema()
df4.show()

"""
Output:

root
 |-- name: string (nullable = true)
 |-- dept: string (nullable = true)
 |-- salary: long (nullable = true)

+-------+-----+------+
|   name| dept|salary|
+-------+-----+------+
|  Kumar|Sales| 25000|
|Shankar|   IT| 32000|
|Kavitha|   HR| 27000|
+-------+-----+------+

"""

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you convert PySpark RDD to DataFrame in Azure Databricks?

These could be the possible reasons:

  1. DataFrame provides more advantages over RDD
  2. Data gets organized into named columns similar to Database tables
  3. Catalyst Optimizer in DataFrame helps in providing effective logical code.

Real World Use Case Scenarios for converting PySpark RDD to DataFrame in Azure Databricks?

Assume that you were given an RDD to perform some aggregation on top of it. But in RDD even simple grouping and aggregating procedures take longer to complete. The DataFrame API is quite simple to use. For exploratory analysis, creating aggregated statistics on sizable data sets is quicker.

What are the alternatives to convert PySpark RDD to DataFrame in Azure Databricks?

There are multiple alternatives to converting PySpark RDD to a DataFrame, which are as follows:

  • toDF()
  • spark.createDataFrame()

Final Thoughts

In this article, we have learned about converting PySpark RDD into DataFrame in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

PySpark in Azure Databricks, as explained by Arud Seka Berne S on azurelib.com.

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.