How to create and manipulate MapType columns in PySpark Azure Databricks?

Are you looking to find out how to create a MapType column of PySpark DataFrame in the Azure Databricks cloud, or maybe you are looking for a solution, to manipulate the DataFrame MapType column’s value in PySpark Databricks using PySpark’s in-built methods? If you are looking for any of these problem solutions, you have landed on the correct page. I will also show you how to use PySpark to create and manipulate the MapType column and its values in DataFrames in Azure Databricks. I will explain it by taking a practical example. So don’t waste time let’s start with a step-by-step guide to understanding how to create and change the MapType column values in PySpark DataFrame.

In this blog, I will teach you the following with practical examples:

  • Syntax of MapType
  • Creating an instance of the MapType column
  • Creating MapType column using StructType
  • Selecting the sub-class column of MapType column
  • Functions of MapType column

The MapType() is used to create an instance of the MapType column datatype of the DataFrame in PySpark Azure Databricks.

Syntax:

MapType(key_datatype, value_datatype, cantains_null)

What is the syntax of the select() function in PySpark Azure Databricks?

The syntax is as follows:

MapType(key_datatype, value_datatype, cantains_null)
Parameter NameRequiredDescription
key_datatype (DataType)YesIt represents the key column data type.
value_datatype (DataType)YesIt represents the value column data type.
contains_null (bool)OptionalIt represents whether we can have a null value or not.
Table 1: MapType() Method in PySpark Databricks Parameter list with Details

Apache Spark Official Documentation Link: MapType()

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

How to create instance of MapType column in PySpark Azure Databricks?

The PySpark MapType() takes three arguments: key column datatype, value column datatype, and a boll value representing whether it can have a null value. By default, contains_null is true. Let’s start by creating a DataFrame.

from pyspark.sql.types import IntegerType, StringType, MapType

map_column = MapType(IntegerType(),StringType(),False)

Note: The data type of the key column will not accept a null/None value.

How to create MapType columns in PySpark Azure Databricks?

Download and use the below source file.

from pyspark.sql.types import StructType, StructField, MapType, StringType

schema = StructType([
    StructField("name", StringType(), True),
    StructField("info", MapType(StringType(), StringType(), True), True)
])

data = [
    ("Arasan", ({"age": "23", "blood": "B+"})),
    ("Karthik", ({"age": "24", "blood": "B-"})),
    ("Sanjay", ({"age": "28", "blood": "O-"})),
    ("Marish", None)
]

# If you want to read data from a file, I have attached the dataset above this code, download and use it.

df = spark.createDataFrame(data, schema=schema)
df.printSchema()
df.show(truncate=False)

"""
Output:

root
 |-- name: string (nullable = true)
 |-- info: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

+-------+------------------------+
|name   |info                    |
+-------+------------------------+
|Arasan |{blood -> B+, age -> 23}|
|Karthik|{blood -> B-, age -> 24}|
|Sanjay |{blood -> O-, age -> 28}|
|Marish |null                    |
+-------+------------------------+

"""

How to access sub-columns of MapType column in PySpark Azure Databricks?

Examples:

from pyspark.sql.functions import col

# Method 1
df.select("info.age").show(truncate=False)

"""
Output:

+----+
|age |
+----+
|23  |
|24  |
|28  |
|null|
+----+

"""

# Method 2
df.select(df.info.age).show(truncate=False)

# Method 3
df.select(col("info").age).show(truncate=False)

# Method 4
df.select(col("info")["age"]).show(truncate=False)

The above four example gives the following output.

"""
Output:

+---------+
|info[age]|
+---------+
|23       |
|24       |
|28       |
|null     |
+---------+

"""

How to extract sub-columns out of MapType column in PySpark Azure Databricks?

Examples:

# 1. Using select() function
df.select("name", "info.age", "info.blood").show()

# 2. Using withColumn() function
df.withColumn("age", df.info.age)\
.withColumn("blood", df.info.blood)\
.select("name", "age", "blood")\
.show()

The above four example gives the following output.

"""
Output:

+-------+----+-----+
|   name| age|blood|
+-------+----+-----+
| Arasan|  23|   B+|
|Karthik|  24|   B-|
| Sanjay|  28|   O-|
| Marish|null| null|
+-------+----+-----+

"""

How to create new rows out of MapType column in PySpark Azure Databricks?

We can generate new rows from the given column of MapType by using the PySpark explode() function. The explode function will not create a new row for a map column having null as a value.

Example:

from pyspark.sql.functions import explode

df.select("name", explode("info")).show()

"""
Output:

+-------+-----+-----+
|   name|  key|value|
+-------+-----+-----+
| Arasan|blood|   B+|
| Arasan|  age|   23|
|Karthik|blood|   B-|
|Karthik|  age|   24|
| Sanjay|blood|   O-|
| Sanjay|  age|   28|
+-------+-----+-----+

"""

As mentioned above, the person “Marish” record was not listed in the above example output.

How to create new rows out of MapType column having null values in PySpark Azure Databricks?

We can generate new rows from the given column of MapType by using the PySpark explode_outer() function. The explode_outer() function also creates new rows for a map column having null as a value.

Example:

from pyspark.sql.functions import explode_outer

df.select("name", explode_outer("info")).show()

"""
Output:

+-------+-----+-----+
|   name|  key|value|
+-------+-----+-----+
| Arasan|blood|   B+|
| Arasan|  age|   23|
|Karthik|blood|   B-|
|Karthik|  age|   24|
| Sanjay|blood|   O-|
| Sanjay|  age|   28|
| Marish| null| null|
+-------+-----+-----+

"""

As mentioned above, the person “Marish” record has a null record but is listed in the above example output.

How to create new rows with their index position out of MapType column in PySpark Azure Databricks?

We can generate new rows from the given column of MapType by using the PySpark explode_outer() function. The posexplode() function will create new rows along with the position of the elements inside the map column.

Example:

from pyspark.sql.functions import posexplode

df.select("name", posexplode("info")).show()

"""
Output:

+-------+---+-----+-----+
|   name|pos|  key|value|
+-------+---+-----+-----+
| Arasan|  0|blood|   B+|
| Arasan|  1|  age|   23|
|Karthik|  0|blood|   B-|
|Karthik|  1|  age|   24|
| Sanjay|  0|blood|   O-|
| Sanjay|  1|  age|   28|
+-------+---+-----+-----+

"""

As mentioned above, each element’s position is recorded in a separate column.

How to create new rows out of MapType column having null values along with their index value in PySpark Azure Databricks?

We can generate new rows from the given column of MapType by using the PySpark explode_outer() function. The explode_outer() function also creates new rows for a map column having null as a value and creates an index column that represents the element index position.

Example:

from pyspark.sql.functions import posexplode_outer

df.select("name", posexplode_outer("info")).show()

"""
Output:

+-------+----+-----+-----+
|   name| pos|  key|value|
+-------+----+-----+-----+
| Arasan|   0|blood|   B+|
| Arasan|   1|  age|   23|
|Karthik|   0|blood|   B-|
|Karthik|   1|  age|   24|
| Sanjay|   0|blood|   O-|
| Sanjay|   1|  age|   28|
| Marish|null| null| null|
+-------+----+-----+-----+

"""

As mentioned above, each element’s position is recorded in a separate column and includes null records too.

How to extract keys and values of MapType column into an array in PySpark Azure Databricks?

We can extract MapType column keys or values into an ArrayType column by using the PySpark map_keys() and map_values() functions. The map_keys() and map_values() functions take a column as input.

Example 1:

from pyspark.sql.functions import map_keys

df.select("name", map_keys("info")).show()

"""
Output:

+-------+--------------+
|   name|map_keys(info)|
+-------+--------------+
| Arasan|  [blood, age]|
|Karthik|  [blood, age]|
| Sanjay|  [blood, age]|
| Marish|          null|
+-------+--------------+

"""

Example 2:

from pyspark.sql.functions import map_values

df.select("name", map_values("info")).show()

"""
Output:

+-------+----------------+
|   name|map_values(info)|
+-------+----------------+
| Arasan|        [B+, 23]|
|Karthik|        [B-, 24]|
| Sanjay|        [O-, 28]|
| Marish|            null|
+-------+----------------+

"""

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you use MapType column functions in PySpark Azure Databricks?

These are some of the possible reasons:

  1. To create new records from a MapType column by exploding
  2. To convert MapType column into JSON string
  3. To collect all the keys of the MapType column
  4. To collect all the values of the MapType column

Real World Use Case Scenarios for using MapType column functions in PySpark Azure Databricks?

  • Assume you were given a dataset containing People’s names, and their favorite foods, for example (“Berne”, {food1: “Apple”, food2: “Pancake”, food3: “waffle”}). Now you have given a requirement to fetch the all favorite foods of people without duplication. For this, you might need to unstack the MapType column into a StringType and apply a collect_set() function.
  • Assume you were given a dataset containing the Candidate’s ID, name, and contacts, for example (“Berne”, {“email”: “abc@gmail.com”, “mobile”: “976543210”}). Your requirement is to get all the keys into an array for example (“Berne”, [“email”, “mobile”]) for checking candidates you haven’t submitted their contact details yet. In order to fulfill this requirement we can use the map_keys() function.
  • Assume you were given a dataset containing the Candidate’s ID, name, and preferences, for example (“Berne”, {“pref1”: “Chennai”, “pref2”: “Bangalore”, “pref3”: “Mumbai”}). Your requirement is to get all the values into an array for example (“Berne”, [“Chennai”, “Bangalore”, “Mumbai”]). In order to fulfill this requirement we can use the map_values() function.

These are some of the real-world examples, each of the map functions is explained in detail with a practical example in the above section.

What are the alternatives for creating MapType column in PySpark Azure Databricks?

These alternatives were discussed with multiple examples in the above section.

  • Use lit() function for creating a MapType column from a literal value
  • Use from_json() function to create a MapType column from a JSON string
  • Use create_map() function to create a MapType column from existing columns

Final Thoughts

In this article, we have learned about the PySpark MapType and its method to manipulate the columns of DataFrame in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

PySpark in Azure Databricks, as explained by Arud Seka Berne S on azurelib.com.

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.