How to create schemas for DataFrame in PySpark Azure Databricks?

Are you looking to find out how to manually create a schema for PySpark DataFrame in Azure Databricks cloud or maybe you are looking for a solution, to create a nested schema for PySpark DataFrame using various methods? If you are looking for any of these problem solutions, you have landed on the correct page. I will also help you how to create and modify a defined schema in Azure Databricks. I will explain it by taking a practical example. So please don’t waste time let’s start with a step-by-step guide to understand how to create manually schema for a DataFrame in PySpark.

In this blog, I will teach you the following with practical examples:

  • Syntax of StructField()
  • Creating simple schema
  • Creating nested schema
  • Creating complex schema
  • Creating schema using DDL format
  • Modifying existing schema
  • Changing existing column types

The PySpark StructType() and StructField() functions are used to create a manual schema for PySpark DataFrame in Azure Databricks.The StructType represents the structure of DataFrame whereas the StructField represents the metadata of the column.

StructType([
StructField(“column_name_1”, column_type, contains_null),
StructField(“column_name_1”, column_type, contains_null)
])

What is the syntax of the StructField() function in PySpark Azure Databricks?

The syntax is as follows:

StructField(column_name, column_type, contains_null, metadata)
Parameter NameRequiredDescription
column_names (str)YesIt represents the new column names.
column_type (DataType)YesIt represents the column data type.
contains_null (bool)OptionalIt represents whether the column can contain null values or not.
metadata (dict)OptionalIt represents a dictionary from string to simple type that can be toInternald to JSON automatically.
Table 1: StructField() Method in PySpark Databricks Parameter list with Details

Apache Spark Official documentation link: StructField()

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

How to manually create a simple schema for PySpark DataFrame in Azure Databricks?

Let’s see how to manually create a simple schema for a PySpark DataFrame using the StructType() and StructField() functions in Azure Databricks.

Example:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

defined_schema = StructType([
    StructField("first_name", StringType(), nullable=True),
    StructField("last_name", StringType(), nullable=True),
    StructField("age", IntegerType(), nullable=True)
])

data = [("Anand", "Raj", 22),("Benish", "Chris", 42),("Nandhini", "Sree", 52)]
df = spark.createDataFrame(data, schema=defined_schema)
df.printSchema()
df.show()

"""
Output:

root
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- age: integer (nullable = true)

+----------+---------+---+
|first_name|last_name|age|
+----------+---------+---+
|     Anand|      Raj| 22|
|    Benish|    Chris| 42|
|  Nandhini|     Sree| 52|
+----------+---------+---+

"""

As mentioned above, StructType() helps in creating DataFrame structure and StructField() helps in defining the column metadata. We have different types of DataType in PySpark. For simplicity, we have used the StringType and IntergerType above.

How to manually create a nested schema for PySpark DataFrame in Azure Databricks?

Let’s see how to manually create a nested schema for a PySpark DataFrame using the StructType() and StructField() functions in Azure Databricks.

Example:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

defined_schema = StructType([
    StructField("name", StructType([
        StructField("first_name", StringType(), nullable=True),
        StructField("last_name", StringType(), nullable=True)
    ]), nullable=True),
    StructField("age", IntegerType(), nullable=True)])

data = [(("Anand", "Raj"), 22),(("Benish", "Chris"), 42),(("Nandhini", "Sree"), 52)]
df = spark.createDataFrame(data, schema=defined_schema)
df.printSchema()
df.show()

"""
Output:

root
 |-- name: struct (nullable = true)
 |    |-- first_name: string (nullable = true)
 |    |-- last_name: string (nullable = true)
 |-- age: integer (nullable = true)

+----------------+---+
|            name|age|
+----------------+---+
|    {Anand, Raj}| 22|
| {Benish, Chris}| 42|
|{Nandhini, Sree}| 52|
+----------------+---+

"""

How to create an ArrayType column using StructType() in Azure Databricks?

In this section, let’s see how to create ArrayType column using StructType() in Azure Databricks. Let’s try to group the names together.

Example:

from pyspark.sql.types import StructType, StructField, StringType, ArrayType, IntegerType

defined_schema = StructType([
    StructField("array_name", ArrayType(StringType(), True)),
    StructField("age", IntegerType())
])

data = [(("Anand", "Raj"), 22),(("Benish", "Chris"), 42),(("Nandhini", "Sree"), 52)]
df = spark.createDataFrame(data, schema=defined_schema)
df.printSchema()
df.show()

"""
Output:

root
 |-- array_name: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- age: integer (nullable = true)

+----------------+---+
|      array_name|age|
+----------------+---+
|    [Anand, Raj]| 22|
| [Benish, Chris]| 42|
|[Nandhini, Sree]| 52|
+----------------+---+

"""

How to create an MapType column using StructType() in Azure Databricks?

In this section, let’s see how to create MapType column using StructType() in Azure Databricks. Let’s try to create a MapType column out of a collection of data.

Example:

from pyspark.sql.types import StructType, StructField, StringType, MapType, IntegerType

defined_schema = StructType([
    StructField("map_name", MapType(StringType(), StringType())),
    StructField("age", IntegerType())
])

data = [
    (({"first_name":"Anand", "last_name": "Raj"}), 22),
    (({"first_name":"Benish", "last_name": "Chris"}), 42),
    (({"first_name":"Nandhini", "last_name": "Sree"}), 52)]

df = spark.createDataFrame(data, schema=defined_schema)
df.printSchema()
df.show(truncate=False)

"""
Output:

root
 |-- map_name: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- age: integer (nullable = true)

+-------------------------------------------+---+
|map_name                                   |age|
+-------------------------------------------+---+
|{last_name -> Raj, first_name -> Anand}    |22 |
|{last_name -> Chris, first_name -> Benish} |42 |
|{last_name -> Sree, first_name -> Nandhini}|52 |
+-------------------------------------------+---+

"""

How to create a simple schema for PySpark DataFrame using DDL format in Azure Databricks?

Let’s see how to manually create a simple schema for a PySpark DataFrame using DDL format strings in Azure Databricks.

Example:

ddl_schema = "first_name STRING, last_name STRING, age STRING"

data = [("Anand", "Raj", 22),("Benish", "Chris", 42),("Nandhini", "Sree", 52)]
df = spark.createDataFrame(data, schema=ddl_schema)
df.printSchema()
df.show()

"""
Output:

root
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- age: string (nullable = true)

+----------+---------+---+
|first_name|last_name|age|
+----------+---------+---+
|     Anand|      Raj| 22|
|    Benish|    Chris| 42|
|  Nandhini|     Sree| 52|
+----------+---------+---+

"""

How to change a PySpark DataFrame structure in Azure Databricks?

In this section, We will see how to change a PySpark DataFrame structure in Azure Databricks. So, let’s try to create a new column of StructType out of individual columns.

Example:

from pyspark.sql.functions import struct, col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

defined_schema = StructType([
    StructField("first_name", StringType(), nullable=True),
    StructField("last_name", StringType(), nullable=True),
     StructField("age", IntegerType(), nullable=True)
])

data = [("Anand", "Raj", 22),("Benish", "Chris", 42),("Nandhini", "Sree", 52)]
df = spark.createDataFrame(data, schema=defined_schema)

modified_df = df.withColumn("name", struct(
    col("first_name").alias("f_name"),
    col("last_name").alias("l_name")
)).select("name", "age")

modified_df.printSchema()
modified_df.show()

"""
Output:

root
 |-- name: struct (nullable = false)
 |    |-- f_name: string (nullable = true)
 |    |-- l_name: string (nullable = true)
 |-- age: integer (nullable = true)

+----------------+---+
|            name|age|
+----------------+---+
|    {Anand, Raj}| 22|
| {Benish, Chris}| 42|
|{Nandhini, Sree}| 52|
+----------------+---+

"""

How to change a PySpark DataFrame column DataType in Azure Databricks?

In this section, We will see how to change a PySpark DataFrame column DataType in Azure Databricks. So, let’s try to change the column “age” DataType from IntergerType to StringType.

Example:

from pyspark.sql.types import StringType

modified_df = modified_df.withColumn("string_age", col("age").cast(StringType()))
modified_df.printSchema()
modified_df.show()

"""
Output:

root
 |-- name: struct (nullable = false)
 |    |-- f_name: string (nullable = true)
 |    |-- l_name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- string_age: string (nullable = true)

+----------------+---+----------+
|            name|age|string_age|
+----------------+---+----------+
|    {Anand, Raj}| 22|        22|
| {Benish, Chris}| 42|        42|
|{Nandhini, Sree}| 52|        52|
+----------------+---+----------+

"""

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you manually create a DataFrame schema in Azure Databricks?

PySpark DataFrame can infer a schema, but sometimes we might need to define it manually. For example in order to convert a date string column into a DateType column while inferring file schema, we have to explicitly provide a schema for it.

Real World Use Case Scenarios for manually creating a DataFrame schema in Azure Databricks?

Assume that you were given a CSV file and you have to create a DataFrame by reading the file by applying the correct datatype to each column. This can be achieved by inferring schema, but in some, you might need to explicitly mention the datatype of a particular column. Whenever you want to parse a date time column of string type into DateType format, we used to specify the dateFormat option while reading file, in order to parse the column successfully we have explicitly mentioned the output DataFrame column by creating and applying a Schema.

What are the alternatives for creating a DataFrame schema manually in Azure Databricks?

There are multiple alternatives for creating a DataFrame schema manually, which are as follows:

  • Using StructType and StructField
  • Using DDL format schema

These alternatives were discussed with multiple examples in the above section.

Final Thoughts

In this article, we have learned about creating, modifying, and changing a PySpark DataFrame schema in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

PySpark in Azure Databricks, as explained by Arud Seka Berne S on azurelib.com.

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.