How to use JSON SQL Functions in PySpark Azure Databricks?

Are you looking to find out how to convert JSON string to StructType of PySpark DataFrame in the Azure Databricks cloud, or maybe you are looking for a solution, to convert StructType to StringType in PySpark Databricks using PySpark’s SQL methods? If you are looking for any of these problem solutions, you have landed on the correct page. I will also show you how to use PySpark SQL functions to extract value from JSON or create new columns from JSON of PySpark DataFrames in Azure Databricks. I will explain it by taking a practical example. So don’t waste time let’s start with a step-by-step guide to understanding how to extract, create, or convert JSON column values in PySpark DataFrame.

In this blog, I will teach you the following with practical examples:

  • Introduction to PySpark JSON SQL Functions
  • Commonly used JSON SQL functions

Create a simple DataFrame

Let’s understand the use of JSON SQL functions with various examples. Let’s start by creating a DataFrame.

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

a) Create manual PySpark DataFrame

json_string = "{'Name':'chevrolet', 'Miles_per_Gallon':18, 'Cylinders':8, 'Displacement':307, 'Horsepower':130, 'Weight_in_lbs':3504, 'Acceleration':12, 'Year':'1970-01-01', 'Origin':'USA'}"

df = spark.createDataFrame([(1, json_string)], schema=["id", "value"])
df.printSchema()
df.show(truncate=False)

"""
root
 |-- id: long (nullable = true)
 |-- value: string (nullable = true)

+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |value                                                                                                                                                                         |
+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1  |{'Name':'chevrolet', 'Miles_per_Gallon':18, 'Cylinders':8, 'Displacement':307, 'Horsepower':130, 'Weight_in_lbs':3504, 'Acceleration':12, 'Year':'1970-01-01', 'Origin':'USA'}|
+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
"""

b) Creating a DataFrame by reading files

Download and use the below source file.

# replace the file_path with the source file location which you have downloaded.

df_2 = spark.read.format("csv").option("header", True).load(file_path)
df_2.printSchema()

"""
root
 |-- id: string (nullable = true)
 |-- value: string (nullable = true)
"""

Note: Here, I will be using the manually created DataFrame.

What are the commonly used JSON functions in PySpark Azure Databricks?

In PySpark, we have lots of functions for the JSON string column. Some of the commonly used JSON functions are listed below:

  • from_json(): Converts JSON string into StructType or MapType
  • to _json(): Converts MapType or Struct type to JSON string
  • json_tuple(): Extract the Data from JSON and create them as new columns
  • get_json_object(): Extracts JSON element from a JSON string based on JSON path specified.
  • schema_of_json(): Create schema string from JSON string

How to convert JSON string to StructType or MapType in PySpark Azure Databricks?

In this section, I will explain the from_json() function using practical examples.

from pyspark.sql.types import StringType, MapType
from pyspark.sql.functions import from_json

map_df = df.withColumn("value", from_json("value", MapType(StringType(),StringType())))
map_df.printSchema()
map_df.show(truncate=False)

"""
Output:

root
 |-- id: long (nullable = true)
 |-- value: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

+---+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |value                                                                                                                                                                            |
+---+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1  |{Name -> chevrolet, Miles_per_Gallon -> 18, Cylinders -> 8, Displacement -> 307, Horsepower -> 130, Weight_in_lbs -> 3504, Acceleration -> 12, Year -> 1970-01-01, Origin -> USA}|
+---+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

"""

How to convert StructType or MapType to JSON string in PySpark Azure Databricks?

In this section, I will explain the to_json() function using practical examples.

from pyspark.sql.functions import to_json

# 2. to_json
json_df = map_df.withColumn("value", to_json("value"))
json_df.printSchema()
json_df.show(truncate=False)

"""
Output:

root
 |-- id: long (nullable = true)
 |-- value: string (nullable = true)

+---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |value                                                                                                                                                                             |
+---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1  |{"Name":"chevrolet","Miles_per_Gallon":"18","Cylinders":"8","Displacement":"307","Horsepower":"130","Weight_in_lbs":"3504","Acceleration":"12","Year":"1970-01-01","Origin":"USA"}|
+---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

"""

How to extract and create columns from JSON columns in PySpark Azure Databricks?

In this section, I will explain the json_tuple() function using practical examples.

from pyspark.sql.functions import json_tuple

ind_df = json_df\
.select("id", json_tuple(json_df.value, "Name", "Miles_per_Gallon", "Cylinders", "Displacement", "Horsepower", "Weight_in_lbs", "Acceleration", "Year", "Origin")) \
.toDF("id", "Name", "Miles_per_Gallon", "Cylinders", "Displacement", "Horsepower", "Weight_in_lbs", "Acceleration", "Year", "Origin")
ind_df.printSchema()
ind_df.show(truncate=False)

"""
Output:

root
 |-- id: long (nullable = true)
 |-- Name: string (nullable = true)
 |-- Miles_per_Gallon: string (nullable = true)
 |-- Cylinders: string (nullable = true)
 |-- Displacement: string (nullable = true)
 |-- Horsepower: string (nullable = true)
 |-- Weight_in_lbs: string (nullable = true)
 |-- Acceleration: string (nullable = true)
 |-- Year: string (nullable = true)
 |-- Origin: string (nullable = true)

+---+---------+----------------+---------+------------+----------+-------------+------------+----------+------+
|id |Name     |Miles_per_Gallon|Cylinders|Displacement|Horsepower|Weight_in_lbs|Acceleration|Year      |Origin|
+---+---------+----------------+---------+------------+----------+-------------+------------+----------+------+
|1  |chevrolet|18              |8        |307         |130       |3504         |12          |1970-01-01|USA   |
+---+---------+----------------+---------+------------+----------+-------------+------------+----------+------+

"""

How to extract element from a JSON string in PySpark Azure Databricks?

In this section, I will explain the get_json_object() function using practical examples.

from pyspark.sql.functions import get_json_object

df.select("id", \
          get_json_object("value", "$.Name").alias("Name"), \
          get_json_object("value", "$.Origin").alias("Origin"), \
         ).show()

"""
Output:

+---+---------+------+
| id|     Name|Origin|
+---+---------+------+
|  1|chevrolet|   USA|
+---+---------+------+

"""

How to extract schema from JSON string in PySpark Azure Databricks?

In this section, I will explain the schema_of_json() function using practical examples.

from pyspark.sql.functions import schema_of_json, lit

json_string = df.select("value").collect()[0]["value"]

schema_df = schemaStr=spark.range(1) \
    .select(schema_of_json(lit(json_string)))

print(schema_df.collect()[0][0])

"""
Output:

STRUCT<Acceleration: BIGINT, Cylinders: BIGINT, Displacement: BIGINT, Horsepower: BIGINT, Miles_per_Gallon: BIGINT, Name: STRING, Origin: STRING, Weight_in_lbs: BIGINT, Year: STRING>

"""

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you use the PySpark JSON SQL Functions in Azure Databricks?

These are some of the possible reasons:

  1. To convert JSON string into MapType or StructType
  2. To convert MapType or StructType into JSON string
  3. To extract the schema of a JSON string
  4. To create new columns out of JSON string

Real World Use Case Scenarios for PySpark DataFrame JSON SQL functions in Azure Databricks?

  • Assume that you were given a column of JSON string values and you want to convert that StringType column into a MapType or StructType column for accessing the column values easily. You can use the JSON SQL function from_json() for this conversion.
  • Assume that you were given a MapTYpe of StructType column and you want to convert that into a JSON string value. You can use the JSON SQL function to_json() for this conversion.
  • Assume that you were given a column of JSON string values and you want to extract only some keys out of it to create new columns. You can use json_tuple() and to extract only one key you can use get_json_object()
  • Assume that you have a JSON string value column and you want to make it a MapType or StructType. In order to make it a defined DataType you want to the schema for it. To extract the schema of the JSON string column you can use the schema_of_json() function.

The PySpark JSON SQL functions have been covered in the above section with practical examples.

What are the alternatives of the JSON SQL Functions in PySpark Azure Databricks?

You can use the PySpark User Defined Functions (UDF) for handling JSON in a PySpark DataFrame. But the PySpark in-built functions are better performing than PySpark UDF, compile-time safe, and should be used instead of creating your own custom functions (UDF). Avoid utilizing custom UDF at all costs if the performance of your PySpark application is crucial because they cannot be guaranteed to perform.

Final Thoughts

In this article, we have learned about the PySpark JSON SQL methods to extract, manipulate, and convert the columns of DataFrame in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

PySpark in Azure Databricks, as explained by Arud Seka Berne S on azurelib.com.

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.