How to convert DataFrame from Pandas to PySpark in Azure Databricks?

Are you looking to find out how to convert Pandas to PySpark DataFrame in the Azure Databricks cloud or maybe you are looking for a solution, to change the column names while converting Pandas DataFrame to PySpark DataFrame in PySpark Databricks using the createDataFrame() function? If you are looking for any of these problem solutions, you have landed on the correct page. I will also show you how to change the column names and column datatypes while converting in Azure Databricks. I will explain it with a practical example. So please don’t waste time let’s start with a step-by-step guide to understanding how to convert Pandas to PySpark DataFrame in Azure Databricks.

In this blog, I will teach you the following with practical examples:

  • Syntax of createDataFrame() function
  • Converting Pandas to PySpark DataFrame
  • Changing column datatype while converting

The PySpark createDataFrame() function is used to manually create DataFrames from an existing RDD, collection of data, and DataFrame with specified column names in PySpark Azure Databricks.

Syntax:

spark.createDataFrame()

What is the syntax of the createDataFrame() function in PySpark Azure Databricks?

The syntax is as follows:

spark.createDataFrame(data, schema, samplingRation, verifySchema)
Parameter NameRequiredDescription
data (RDD, iterable)YesIt represents the data that has to be converted in the form of a DataFrame.
schema (str, list, DataType)OptionalIt represents the structure of DataFrame.
samplingRatio (float)OptionalIt represents the sampling ratio of rows used for inferring.
verifySchema (bool)OptionalIt verifies the row against the schema, it is enabled by default.
Table 1: createDataFrame() Method in PySpark Databricks Parameter list with Details

Apache Spark Official Documentation Lnk: createDataFrame()

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

Create a simple Pandas DataFrame

Let’s start by creating a simple Pandas DataFrame using the pandas module and functions.

import pandas as pd

data = [
    [1, "Anand"],
    [2, "Bernald"],
    [3, "Chandran"],
    [4, "Delisha"],
    [5, "Maran"],
]

pandas_df = pd.DataFrame(data, columns = ["id", "name"])
print(pandas_df)

"""
   id      name
0   1     Anand
1   2   Bernald
2   3  Chandran
3   4   Delisha
4   5     Maran
"""

How to convert Pandas to PySpark DataFrame in Azure Databricks?

Let’s see how to convert Pandas to PySpark DataFrame in Azure Databricks with a practical example

Example:

pyspark_df = spark.createDataFrame(pandas_df)
pyspark_df.show()

"""
Output:

+---+--------+
| id|    name|
+---+--------+
|  1|   Anand|
|  2| Bernald|
|  3|Chandran|
|  4| Delisha|
|  5|   Maran|
+---+--------+

"""

By passing the pandas DataFrame to the createDataFrame function we were able to convert from Pandas to PySpark DataFrame.

How to change column name and datatype while converting Pandas to PySpark DataFrame in Azure Databricks?

Let’s see how to change column name and datatype while converting Pandas to PySpark DataFrame in Azure Databricks with some practical examples.

Example 1:

# Method 1: Using StructType

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

new_schema = StructType([
    StructField("emp_id", IntegerType()),
    StructField("emp_name", StringType()),
])

pyspark_df = spark.createDataFrame(pandas_df, schema=new_schema)
pyspark_df.printSchema()
pyspark_df.show()

"""
Output:

root
 |-- emp_id: integer (nullable = true)
 |-- emp_name: string (nullable = true)

+------+--------+
|emp_id|emp_name|
+------+--------+
|     1|   Anand|
|     2| Bernald|
|     3|Chandran|
|     4| Delisha|
|     5|   Maran|
+------+--------+

"""

Example 2:

# Method 2: Using DDL Format

ddl_schema = "empId STRING, empName STRING"

pyspark_df = spark.createDataFrame(pandas_df, schema=ddl_schema)
pyspark_df.printSchema()
pyspark_df.show()

"""
Output:

root
 |-- empId: string (nullable = true)
 |-- empName: string (nullable = true)

+-----+--------+
|empId| empName|
+-----+--------+
|    1|   Anand|
|    2| Bernald|
|    3|Chandran|
|    4| Delisha|
|    5|   Maran|
+-----+--------+

"""

In the above two examples, we have used both StructType and DDL format schema to change column name and datatype while converting Pandas to PySpark DataFrame.

I have attached the complete code used in this blog in a notebook format in this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you convert Pandas to PySpark DataFrame in Azure Databricks?

When working with a large dataset, Python pandas DataFrame is not good enough to perform complex transformation operations on big data sets. For this reason, if you have a Spark cluster, it’s preferable to convert pandas to PySpark DataFrame, perform the complex transformations on the Spark cluster, and then convert it back.

Real World Use Case Scenarios for converting Pandas to PySpark DataFrame in Azure Databricks?

Assume that you have a huge dataset for processing and you have made all the transformations in pandas, but the processing will be too slow compared to PySpark competing. Because PySpark supports parallel processing and thus results in faster computation.

What are the alternatives to convert Pandas to PySpark DataFrame in Azure Databricks?

The best way for converting Pandas DataFrame into PySpark DataFrame is by using passing the Pandas DataFrame into the createDataFrame() function.

Final Thoughts

In this article, we have learned about converting Pandas to PySpark DataFrame in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

PySpark in Azure Databricks, as explained by Arud Seka Berne S on azurelib.com.

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.