Are you looking to find out how to convert Pandas to PySpark DataFrame in the Azure Databricks cloud or maybe you are looking for a solution, to change the column names while converting Pandas DataFrame to PySpark DataFrame in PySpark Databricks using the createDataFrame() function? If you are looking for any of these problem solutions, you have landed on the correct page. I will also show you how to change the column names and column datatypes while converting in Azure Databricks. I will explain it with a practical example. So please don’t waste time let’s start with a step-by-step guide to understanding how to convert Pandas to PySpark DataFrame in Azure Databricks.
In this blog, I will teach you the following with practical examples:
- Syntax of createDataFrame() function
- Converting Pandas to PySpark DataFrame
- Changing column datatype while converting
The PySpark createDataFrame() function is used to manually create DataFrames from an existing RDD, collection of data, and DataFrame with specified column names in PySpark Azure Databricks.
Syntax:
spark.createDataFrame()
Contents
- 1 What is the syntax of the createDataFrame() function in PySpark Azure Databricks?
- 2 Create a simple Pandas DataFrame
- 3 How to convert Pandas to PySpark DataFrame in Azure Databricks?
- 4 How to change column name and datatype while converting Pandas to PySpark DataFrame in Azure Databricks?
- 5 When should you convert Pandas to PySpark DataFrame in Azure Databricks?
- 6 Real World Use Case Scenarios for converting Pandas to PySpark DataFrame in Azure Databricks?
- 7 What are the alternatives to convert Pandas to PySpark DataFrame in Azure Databricks?
- 8 Final Thoughts
What is the syntax of the createDataFrame() function in PySpark Azure Databricks?
The syntax is as follows:
spark.createDataFrame(data, schema, samplingRation, verifySchema)
Parameter Name | Required | Description |
data (RDD, iterable) | Yes | It represents the data that has to be converted in the form of a DataFrame. |
schema (str, list, DataType) | Optional | It represents the structure of DataFrame. |
samplingRatio (float) | Optional | It represents the sampling ratio of rows used for inferring. |
verifySchema (bool) | Optional | It verifies the row against the schema, it is enabled by default. |
Apache Spark Official Documentation Lnk: createDataFrame()
Gentle reminder:
In Databricks,
- sparkSession made available as spark
- sparkContext made available as sc
In case, you want to create it manually, use the below code.
from pyspark.sql.session import SparkSession
spark = SparkSession.builder
.master("local[*]")
.appName("azurelib.com")
.getOrCreate()
sc = spark.sparkContext
Create a simple Pandas DataFrame
Let’s start by creating a simple Pandas DataFrame using the pandas module and functions.
import pandas as pd
data = [
[1, "Anand"],
[2, "Bernald"],
[3, "Chandran"],
[4, "Delisha"],
[5, "Maran"],
]
pandas_df = pd.DataFrame(data, columns = ["id", "name"])
print(pandas_df)
"""
id name
0 1 Anand
1 2 Bernald
2 3 Chandran
3 4 Delisha
4 5 Maran
"""
How to convert Pandas to PySpark DataFrame in Azure Databricks?
Let’s see how to convert Pandas to PySpark DataFrame in Azure Databricks with a practical example
Example:
pyspark_df = spark.createDataFrame(pandas_df)
pyspark_df.show()
"""
Output:
+---+--------+
| id| name|
+---+--------+
| 1| Anand|
| 2| Bernald|
| 3|Chandran|
| 4| Delisha|
| 5| Maran|
+---+--------+
"""
By passing the pandas DataFrame to the createDataFrame function we were able to convert from Pandas to PySpark DataFrame.
How to change column name and datatype while converting Pandas to PySpark DataFrame in Azure Databricks?
Let’s see how to change column name and datatype while converting Pandas to PySpark DataFrame in Azure Databricks with some practical examples.
Example 1:
# Method 1: Using StructType
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
new_schema = StructType([
StructField("emp_id", IntegerType()),
StructField("emp_name", StringType()),
])
pyspark_df = spark.createDataFrame(pandas_df, schema=new_schema)
pyspark_df.printSchema()
pyspark_df.show()
"""
Output:
root
|-- emp_id: integer (nullable = true)
|-- emp_name: string (nullable = true)
+------+--------+
|emp_id|emp_name|
+------+--------+
| 1| Anand|
| 2| Bernald|
| 3|Chandran|
| 4| Delisha|
| 5| Maran|
+------+--------+
"""
Example 2:
# Method 2: Using DDL Format
ddl_schema = "empId STRING, empName STRING"
pyspark_df = spark.createDataFrame(pandas_df, schema=ddl_schema)
pyspark_df.printSchema()
pyspark_df.show()
"""
Output:
root
|-- empId: string (nullable = true)
|-- empName: string (nullable = true)
+-----+--------+
|empId| empName|
+-----+--------+
| 1| Anand|
| 2| Bernald|
| 3|Chandran|
| 4| Delisha|
| 5| Maran|
+-----+--------+
"""
In the above two examples, we have used both StructType and DDL format schema to change column name and datatype while converting Pandas to PySpark DataFrame.
I have attached the complete code used in this blog in a notebook format in this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.
When should you convert Pandas to PySpark DataFrame in Azure Databricks?
When working with a large dataset, Python pandas DataFrame is not good enough to perform complex transformation operations on big data sets. For this reason, if you have a Spark cluster, it’s preferable to convert pandas to PySpark DataFrame, perform the complex transformations on the Spark cluster, and then convert it back.
Real World Use Case Scenarios for converting Pandas to PySpark DataFrame in Azure Databricks?
Assume that you have a huge dataset for processing and you have made all the transformations in pandas, but the processing will be too slow compared to PySpark competing. Because PySpark supports parallel processing and thus results in faster computation.
What are the alternatives to convert Pandas to PySpark DataFrame in Azure Databricks?
The best way for converting Pandas DataFrame into PySpark DataFrame is by using passing the Pandas DataFrame into the createDataFrame() function.
Final Thoughts
In this article, we have learned about converting Pandas to PySpark DataFrame in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.
Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.
- For Azure Study material Join Telegram group : Telegram group link:
- Azure Jobs and other updates Follow me on LinkedIn: Azure Updates on LinkedIn
- Azure Tutorial Videos: Videos Link
- Azure Databricks Lesson 1
- Azure Databricks Lesson 2
- Azure Databricks Lesson 3
- Azure Databricks Lesson 4
- Azure Databricks Lesson 5
- Azure Databricks Lesson 6
- Azure Databricks Lesson 7
As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.