How to replace null values in PySpark Azure Databricks?

Are you looking to find how to replace null values of PySpark DataFrame columns in Azure Databricks cloud or maybe you are looking for a solution, to change the DataFrame columns None value in PySpark Databricks using the select methods? If you are looking for any of these problem solutions, you have landed on the correct page. I will also show you how to use PySpark to replace the column null values in DataFrames in Azure Databricks. I will explain it by taking a practical example. So don’t waste time let’s start step by step guide to understanding how to replace columns with null values in PySpark DataFrame.

In this blog, I will teach you the following with practical examples:

  • Syntax of fill() and fillna()
  • Replace null values of the entire column
  • Replace null values of the selected column
  • Replace null values of the selected column with different values
  • Replace null values of the selected column with an aggregated value

fill() & fillna() method used to replace the column null/None value of the DataFrame in PySpark Azure Databricks.

Syntax:

dataframe_name.na.fill()

dataframe_name.fillna()

What is the syntax of the select() function in PySpark Azure Databricks?

The syntax is as follows:

dataframe_name.na.fill(replace_value, selected_columns)

dataframe_name.fillna(replace_value, selected_columns)
Parameter NameRequiredDescription
replace_value (int, float, string, bool, or dict)YesIt represents the value that has been replaced on behalf of the null value. If the value is a dict, then selected_columns can be ignored.
selected_columns (str, tuple or list)OptionalIt represents the column whose null value has to be replaced.
Table 1: fill() & fillna() Method in PySpark Databricks Parameter list with Details

Official documentation link: fill() and fillna()

Create a simple DataFrame

Let’s understand the use of the fill() and fillna() functions with a variety of examples. Let’s start by creating a DataFrame.

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

a) Create manual PySpark DataFrame

data = [
    ("Duel in the Sun",None,None,7.0),
    ("Tom Jones",None,None,7.0),
    ("Oliver!","Musical","Sony Pictures",7.5),
    ("To Kill A Mockingbird",None,"Universal",8.4),
    ("Tora, Tora, Tora",None,None,None)
]

df = spark.createDataFrame(data, schema=["title", "genre", "distributor", "imdb_rating"])
df.printSchema()
df.show(5, truncate=False)

"""
root
 |-- title: string (nullable = true)
 |-- genre: string (nullable = true)
 |-- distributor: string (nullable = true)
 |-- imdb_rating: double (nullable = true)

+---------------------+-------+-------------+-----------+
|title                |genre  |distributor  |imdb_rating|
+---------------------+-------+-------------+-----------+
|Duel in the Sun      |null   |null         |7.0        |
|Tom Jones            |null   |null         |7.0        |
|Oliver!              |Musical|Sony Pictures|7.5        |
|To Kill A Mockingbird|null   |Universal    |8.4        |
|Tora, Tora, Tora     |null   |null         |null       |
+---------------------+-------+-------------+-----------+
"""

b) Creating a DataFrame by reading files

Download and use the below source file.

# replace the file_path with the source file location which you have downloaded.

df_2 = spark.read.format("csv").option("inferSchema", True).option("header", True).load("file_path")
df_2.printSchema()

"""
root
 |-- title: string (nullable = true)
 |-- genre: string (nullable = true)
 |-- distributor: string (nullable = true)
 |-- imdb_rating: double (nullable = true)
"""

Note: Here, I will be using the manually created DataFrame.

How to replace null values of columns in PySpark Azure Databricks?

By providing replacing value to fill() or fillna() PySpark function in Azure Databricks you can replace the null values in the entire column.

Note that if you pass “0” as a value, the fill() or fillna() functions will only replace the null values only on numeric columns. If you pass a string value to the function, it will replace all the null values only on string columns.

Examples:

# 1. Replace only the numeric columns
df.na.fill(0).show()
df.fillna(0).show()

#The above four example gives the following output.

"""
Output:

+--------------------+-------+-------------+-----------+
|               title|  genre|  distributor|imdb_rating|
+--------------------+-------+-------------+-----------+
|     Duel in the Sun|   null|         null|        7.0|
|           Tom Jones|   null|         null|        7.0|
|             Oliver!|Musical|Sony Pictures|        7.5|
|To Kill A Mocking...|   null|    Universal|        8.4|
|    Tora, Tora, Tora|   null|         null|        0.0|
+--------------------+-------+-------------+-----------+

"""

# 2. Replace only the string columns
df.na.fill("unknown").show()
df.fillna("unknown").show()

#The above four example gives the following output.

"""
Output:

+--------------------+-------+-------------+-----------+
|               title|  genre|  distributor|imdb_rating|
+--------------------+-------+-------------+-----------+
|     Duel in the Sun|unknown|      unknown|        7.0|
|           Tom Jones|unknown|      unknown|        7.0|
|             Oliver!|Musical|Sony Pictures|        7.5|
|To Kill A Mocking...|unknown|    Universal|        8.4|
|    Tora, Tora, Tora|unknown|      unknown|       null|
+--------------------+-------+-------------+-----------+

"""

How to replace the null value of selected columns in PySpark Azure Databricks?

By providing a value and column to be considered the fill() or fillna() PySpark function in Azure Databricks you can replace the null values in the selected column.

Example 1:

# Changing the null values of 'genre' column to 'unknown-genre'

df.na.fill(value="unknown-genre", subset=["genre"]).show()
df.fillna(value="unknown-genre", subset=["genre"]).show()

#The above four example gives the following output.

"""
Output:

+--------------------+-------------+-------------+-----------+
|               title|        genre|  distributor|imdb_rating|
+--------------------+-------------+-------------+-----------+
|     Duel in the Sun|unknown-genre|         null|        7.0|
|           Tom Jones|unknown-genre|         null|        7.0|
|             Oliver!|      Musical|Sony Pictures|        7.5|
|To Kill A Mocking...|unknown-genre|    Universal|        8.4|
|    Tora, Tora, Tora|unknown-genre|         null|       null|
+--------------------+-------------+-------------+-----------+

"""

Example 2:

# Changing the null values of 'distributor' column to 'unknown-distributor'

df.na.fill(value="unknown-distributor", subset=["distributor"]).show()
df.fillna(value="unknown-distributor", subset=["distributor"]).show()

#The above four example gives the following output.

"""
Output:

+--------------------+-------+-------------------+-----------+
|               title|  genre|        distributor|imdb_rating|
+--------------------+-------+-------------------+-----------+
|     Duel in the Sun|   null|unknown-distributor|        7.0|
|           Tom Jones|   null|unknown-distributor|        7.0|
|             Oliver!|Musical|      Sony Pictures|        7.5|
|To Kill A Mocking...|   null|          Universal|        8.4|
|    Tora, Tora, Tora|   null|unknown-distributor|       null|
+--------------------+-------+-------------------+-----------+

"""

How to replace null values of the selected columns with different values?

By providing value as dict the fill() or fillna() PySpark function in Azure Databricks you can replace the null values in selected columns with different values.

Note that if you pass the value as dict no need to specify any subset to the function.

Example:

# Changing the null of selected columns with different values
df.na.fill({"genre": "unknown-genre", "distributor": "unknown-distributor"}).show()

"""
Output:

+--------------------+-------------+-------------------+-----------+
|               title|        genre|        distributor|imdb_rating|
+--------------------+-------------+-------------------+-----------+
|     Duel in the Sun|unknown-genre|unknown-distributor|        7.0|
|           Tom Jones|unknown-genre|unknown-distributor|        7.0|
|             Oliver!|      Musical|      Sony Pictures|        7.5|
|To Kill A Mocking...|unknown-genre|          Universal|        8.4|
|    Tora, Tora, Tora|unknown-genre|unknown-distributor|       null|
+--------------------+-------------+-------------------+-----------+

"""

How to replace the null value with an aggregated value?

Here is an example. You can calculate the average value of the IMDB rating as 7.475. By passing this value to the function, you can replace the null value with the column average value.

Note: You can’t insert a string value into a numeric column. I you insert a string value into a numeric column, the fill() or fillna() functions will just ignore it.

Example:

from pyspark.sql.functions import avg

avg_rating = df.select(avg("imdb_rating").alias("avg_rating")).collect()[0]["avg_rating"]
print(f"The average IMDB rating is: {avg_rating}")

df.fillna(value=avg_rating, subset=["imdb_rating"]).show()
# Note you can't replace null value of numeric column with string value.

"""
Output:

The average IMDB rating is: 7.475
+--------------------+-------+-------------+-----------+
|               title|  genre|  distributor|imdb_rating|
+--------------------+-------+-------------+-----------+
|     Duel in the Sun|   null|         null|        7.0|
|           Tom Jones|   null|         null|        7.0|
|             Oliver!|Musical|Sony Pictures|        7.5|
|To Kill A Mocking...|   null|    Universal|        8.4|
|    Tora, Tora, Tora|   null|         null|      7.475|
+--------------------+-------+-------------+-----------+

"""

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you use the PySpark fill() or fillna() function in Azure Databricks?

These could be the possible reasons:

  1. To fill null values
  2. To fill all null values of numeric columns
  3. To fill all null values of string columns
  4. To replace null values with some aggregated values

Real World Use Case Scenarios for using the fill() function in PySpark Azure Databricks?

By default aggregation functions omit null values for calculation, for example, let’s consider numeric column values as (10,20, None) if you perform sum() on top of this column you will get 30 as output. But let’s create a requirement where you want to replace null values with the average number of that column. In order to do this, you have to find the average of that column and pass the average value into the fill() function.

What are the alternatives for filling null values in PySpark Azure Databricks?

These alternatives were discussed with multiple examples in the above section.

  • fill() used for filling null values in PySpark
  • fillna() function is the same as fill() function

Final Thoughts

In this article, we have learned about the PySpark fill() & fillna() method to replace null values of DataFrame columns in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

PySpark in Azure Databricks, as explained by Arud Seka Berne S on azurelib.com.

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.