How to create DataFrames in PySpark Azure Databricks?

Are you looking to find out how to create a PySpark DataFrame from collections of data in the Azure Databricks cloud or maybe you are looking for a solution, to create PySpark DataFrame by reading a data source? If you are looking for any of these problem solutions, you have landed on the correct page. I will also help you how to use the PySpark different functions with multiple examples in Azure Databricks. I will explain it by taking a practical example. So please don’t waste time let’s start with a step-by-step guide to understand how to create DataFrame using various functions in PySpark.

In this blog, I will teach you the following with practical examples:

  • Syntax of toDF()
  • Creating DataFrame from existing RDD
  • Creating DataFrame from the Collections
  • Creating DatFrame from reading files

The PySpark toDF() and createDataFrame() functions are used to manually create DataFrames from an existing RDD or collection of data with specified column names in PySpark Azure Databricks.

Syntax:

data_frame.toDF()

spark.createDataFrame()

What is the syntax of the toDF() function in PySpark Azure Databricks?

The syntax is as follows:

data_frame.toDF(column_names)
Parameter NameRequiredDescription
column_names (str, Column)OptionalIt represents the new column names.
Table 1: toDF() Method in PySpark Databricks Parameter list with Details

Apache Spark Official documentation link: toDF()

What is the syntax of the createDataFrame() function in PySpark Azure Databricks?

The syntax is as follows:

spark.createDataFrame(data, schema, samplingRation, verifySchema)
Parameter NameRequiredDescription
data (RDD, iterable)YesIt represents the data that has to be converted in the form of a DataFrame.
schema (str, list, DataType)OptionalIt represents the structure of DataFrame.
samplingRatio (float)OptionalIt represents the sampling ratio of rows used for inferring.
verifySchema (bool)OptionalIt verifies the row against the schema, it is enabled by default.
Table 1: createDataFrame() Method in PySpark Databricks Parameter list with Details

Apache Spark Official documentation link: createDataFrame()

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

Before getting started, let’s create an RDD which will be helpful in the upcoming code samples.

columns = ["id","Name","Age"]
students_data_1 = [
    (1, "Arjun", 23),
    (2, "Sandhiya", 25),
    (3, "Ranjith", 27)
]

rdd = sc.parallelize(students_data)
rdd.collect()

"""
Output:

[(1, 'Arjun', 23), (2, 'Sandhiya', 25)]

"""

How to create a PySpark DataFrame from an existing RDD in Azure Databricks?

Let’s see how to create a PySpark DataFrame from an existing RDD in Azure Databricks using various methods.

Example using toDF() function:

# a) without column name
df1 = rdd.toDF()
df1.printSchema()
df1.show()

"""
Output:

root
 |-- _1: long (nullable = true)
 |-- _2: string (nullable = true)
 |-- _3: long (nullable = true)

+---+--------+---+
| _1|      _2| _3|
+---+--------+---+
|  1|   Arjun| 23|
|  2|Sandhiya| 25|
+---+--------+---+

"""

As you can see, when we don’t specify any column names, PySpark will refer to the column as _1, _2, etc. In order to specify the schema pass column names using the ‘schema’ parameter.

# b) Using toDF() with column name
df2 = rdd.toDF(schema=columns)
df2.printSchema()
df2.show()

"""
Output:

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)

+---+--------+---+
| id|    name|age|
+---+--------+---+
|  1|   Arjun| 23|
|  2|Sandhiya| 25|
+---+--------+---+

"""

Example using createDataFrame() function:

# a) without column name
df3 = spark.createDataFrame(rdd)
df3.printSchema()
df3.show()

"""
Output:

root
 |-- _1: long (nullable = true)
 |-- _2: string (nullable = true)
 |-- _3: long (nullable = true)

+---+--------+---+
| _1|      _2| _3|
+---+--------+---+
|  1|   Arjun| 23|
|  2|Sandhiya| 25|
+---+--------+---+

"""

The same thing happens here, when we don’t specify any column names, PySpark will refer to the column as _1, _2, etc. In order to specify the schema pass column names using the ‘schema’ parameter.

# b) with column name
df4 = spark.createDataFrame(rdd, schema=columns)
df4.printSchema()
df4.show()

"""
Output:

root
 |-- id: long (nullable = true)
 |-- Name: string (nullable = true)
 |-- Age: long (nullable = true)

+---+--------+---+
| id|    Name|Age|
+---+--------+---+
|  1|   Arjun| 23|
|  2|Sandhiya| 25|
+---+--------+---+

"""

How to create a PySpark DataFrame from different collections in Azure Databricks?

Let’s see how to create a PySpark DataFrame from different collections in Azure Databricks.

Examples:

# 1. From list
list_data = [
    ["Arun", "Kumar", 13], 
    ["Janani", "Shree", 25]]

spark.createDataFrame(list_data, schema=["first_name", "last_name", "age"]).show()

"""
Output:

+----------+---------+---+
|first_name|last_name|age|
+----------+---------+---+
|      Arun|    Kumar| 13|
|    Janani|    Shree| 25|
+----------+---------+---+

"""
# 2. From Tuple
tuple_data = [
    ("Charan", "Nagesh", 32), 
    ["Indhu", "Madhi", 26]]

spark.createDataFrame(tuple_data, schema=["first_name", "last_name", "age"]).show()

"""
Output:

+----------+---------+---+
|first_name|last_name|age|
+----------+---------+---+
|    Charan|   Nagesh| 32|
|     Indhu|    Madhi| 26|
+----------+---------+---+

"""
# 3. From key value pair
dict_data = [
    {'first_name': 'Balaji', 'last_name': 'Pandian', 'age': 46},
    {'first_name': 'Elango', 'last_name': 'Dharan', 'age': 41}
]
spark.createDataFrame(dict_data).show()

"""
Output:

+---+----------+---------+
|age|first_name|last_name|
+---+----------+---------+
| 46|    Balaji|  Pandian|
| 41|    Elango|   Dharan|
+---+----------+---------+

"""

Whenever we create a DataFrame from a collection of dictionaries, we used it to get a DataFrame of unordered columns. Therefore, don’t specific any schema to it, it might result in column and data mismatch.

# 4. Using Row
from pyspark.sql import Row

row_data = [
    Row("Nithesh", "Khan", 14), 
    Row("Akash", "Sharma", 14)]

spark.createDataFrame(row_data, schema=["first_name", "last_name", "age"]).show()

"""
Output:

+----------+---------+---+
|first_name|last_name|age|
+----------+---------+---+
|   Nithesh|     Khan| 14|
|     Akash|   Sharma| 14|
+----------+---------+---+

"""
# 5. Using Row with key value arguments

from pyspark.sql import Row

row_kv_data = [
    Row(first_name="Manik", last_name="Basha", age=54),
    Row(first_name="Vijay", last_name="Antony", age=36)
]
spark.createDataFrame(row_kv_data).show()

"""
Output:

+----------+---------+---+
|first_name|last_name|age|
+----------+---------+---+
|     Manik|    Basha| 54|
|     Vijay|   Antony| 36|
+----------+---------+---+

"""

How to create a PySpark DataFrame by reading files in Azure Databricks?

Let’s see how to create a PySpark DataFrame by reading files in Azure Databricks. With the help of the PySpark DtaFrame read method, we can read files of various types as mentioned below.

Examples:

# 1. From CSV file
spark.read.csv(r"C:\Users\USER\Desktop\sample_csv.csv")

# 2. From Text file
spark.read.text(r"C:\Users\USER\Desktop\sample_text.txt")

# 3. From JSON file
spark.read.json(r"C:\Users\USER\Desktop\sample_json.json")

# 4. From Parquet file
spark.read.parquet(r"C:\Users\USER\Desktop\sample_parquet.parquet")

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you use toDF() & createDataFrame() in Azure Databricks?

These could be the possible reasons:

  1. For converting RDD into DataFrame
  2. For converting pandas DataFrame into PySpark DataFrame

Real World Use Case Scenarios for using toDF() & createDataFrame() in Azure Databricks?

  • Assume that you were given an RDD to perform some aggregation on top of it. But in RDD even simple grouping and aggregating procedures take longer to complete. The DataFrame API is quite simple to use. For exploratory analysis, creating aggregated statistics on sizable data sets is quicker.
  • You have a Pandas DataFrame and in order to make it faster you might need to convert Pandas DataFrame into PySpark DataFrame because PySpark runs stages and tasks in multiple executors and nodes.

What are the alternatives for creating a DataFrame in PySpark Azure Databricks?

There are multiple alternatives for creating a DataFrame in PySpark Azure Databricks, which are as follows:

  • toDF()
  • spark.createDataFrame()

These alternatives were discussed with multiple examples in the above section.

Final Thoughts

In this article, we have learned about the PySpark PySpark toDF() & createDataFrame() method of a DataFrame in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

PySpark in Azure Databricks, as explained by Arud Seka Berne S on azurelib.com.

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.