How to count null, None, NaN, and an empty string in PySpark Azure Databricks?

Are you looking to find out how to count null, None, and an empty string in PySpark Azure Databricks cloud or maybe you are looking for a solution, to count the numpy NaN value in PySpark Databricks using the PySpark inbuilt function? If you are looking for any of these problem solutions, you have landed on the correct page. I will also show you how to use some of the PySpark inbuilt functions to find the null values in Azure Databricks. I will explain it with a practical example. So please don’t waste time let’s start with a step-by-step guide to understanding how to count null, None, NaN, and an empty string of PySpark DataFrame in Azure Databricks.

In this blog, I will teach you the following with practical examples:

  • Count Empty strings
  • Count “null” strings
  • Count None values
  • Count Numpy NaN values
  • Using it all together

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

Create a simple DataFrame

Let’s start by creating a simple Pandas DataFrame, which will be used in upcoming topics.

import numpy as np

data = [
    (1, "", None),
    (2, "null", None),
    (3, "NULL", np.NaN),
    (4, "Suresh", np.NaN),
    (5, "James", np.NaN)
]

df = spark.createDataFrame(data, schema=["id","name","salary"])
df.printSchema()
df.show(truncate=False)

"""
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- salary: double (nullable = true)

+---+------+------+
|id |name  |salary|
+---+------+------+
|1  |      |null  |
|2  |null  |null  |
|3  |NULL  |NaN   |
|4  |Suresh|NaN   |
|5  |James |NaN   |
+---+------+------+
"""

How to count empty strings in a PySpark DataFrame in Azure Databricks?

Let’s see how to count empty strings in a PySpark DataFrame in Azure Databricks. In the below example, let’s try to find the empty strings in the “name” column using different methods.

Example 1:

from pyspark.sql.functions import col, when, count

# method 1:
df.select(
    count(
        when(col("name") == "", 1)
    ).alias("1. Empty string")
 ).show()

"""
Output:

+---------------+
|1. Empty string|
+---------------+
|              1|
+---------------+

"""

As you can see, we have only one empty string value inside the “name” column. Here we have used a comparison operator, and if it matches, it returns 1. By counting all the values, it returns the number of empty strings in that particular column.

Example 2:

from pyspark.sql.functions import col, when, length, count

# Method 2:
df.select(
    count(
        when(length(col("name")) == 0, 1)
    ).alias("2. Empty string")
 ).show()

"""
Output:

+---------------+
|2. Empty string|
+---------------+
|              1|
+---------------+

"""

You can also use the length() function to check the length of the “name column values. If the column value length is 0, which means the column has no value in it.

How to count null strings in a PySpark DataFrame in Azure Databricks?

Let’s see how to count null strings in a PySpark DataFrame in Azure Databricks. In the below example, let’s try to find the null strings in the “name” column using practical example.

Example:

from pyspark.sql.functions import col, when, lower

df.select(
    count(
        when(lower(col("name")) == "null", 1) # case-sensitive
    ).alias("Null string")
 ).show()

"""
Output:

+-----------+
|Null string|
+-----------+
|          2|
+-----------+

"""

As you can see, we have two null string values inside the “name” column. Here we have used a comparison operator along with a lower() function which converts the value into lowercase. This is because the comparison operators are case-sensitive.

If we mention “null”, it only looks for “null” not “NULL”. Therefore, it is always better to convert the value to lower or uppercase before comparison.

How to count None values in a PySpark DataFrame in Azure Databricks?

Let’s see how to count None values in a PySpark DataFrame in Azure Databricks. In the below example, let’s try to find the None value or null values in the “salary” column using some practical examples.

Example 1:

from pyspark.sql.functions import col, when, isnull

# Method 1:
df.select(
    count(
        when(isnull(col("salary")), 1)
    ).alias("1. None value")
 ).show()

"""
Output:

+-------------+
|1. None value|
+-------------+
|            2|
+-------------+

"""

The PySpark’s isnull() function is used for finding the null or None type values in a PySpark DataFrame.

Example 2:

from pyspark.sql.functions import col, when

df.select(
    count(
        when(col("salary").isNull(), 1)
    ).alias("2. None value")
 ).show()

"""
Output:

+-------------+
|2. None value|
+-------------+
|            2|
+-------------+

"""

Alternatively, you can also use the column function isnull() to check for null or None values in a DataFrame.

How to count Numpy NaN values in a PySpark DataFrame in Azure Databricks?

Let’s see how to count Numpy NaN values in a PySpark DataFrame in Azure Databricks. In the below example, let’s try to find the Numpy NaN values in the “salary” column using a practical example.

Example:

from pyspark.sql.functions import col, when, isnan

df.select(
    count(
        when(isnan(col("salary")), 1)
    ).alias("NaN value")
 ).show()

"""
Output:

+---------+
|NaN value|
+---------+
|        3|
+---------+

"""

The PySpark’s isnan() function is used for finding the Numpy NaN type values in a PySpark DataFrame.

How to use all the above-listed functions together in a PySpark DataFrame using Azure Databricks?

Let’s see how to use all the above-listed functions in a PySpark DataFrame using Azure Databricks. In the below example, let’s try to count the null, None, Nan, and an empty string of the DataFrame using a practical example.

Example:

from pyspark.sql.functions import col, when, isnull, isnan, count, lower

df.select([count(when(
    isnan(col(each_col)) | \
    (col(each_col) == "") | \
    isnull(col(each_col)) | \
    (lower(col(each_col)) == "null"), 1 \
)).alias(f"{each_col} -> count") for each_col in df.columns]).show()

"""
Output:

+-----------+-------------+---------------+
|id -> count|name -> count|salary -> count|
+-----------+-------------+---------------+
|          0|            3|              5|
+-----------+-------------+---------------+

"""

I have attached the complete code used in this blog in a notebook format in this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you count null, None, NaN, and an empty string in a PySpark DataFrame using Azure Databricks?

These could be the possible reasons:

  1. When you want to filter out null or None values
  2. When you want to remove null or None values

Real World Use Case Scenarios for counting null, None, NaN, and an empty string in a PySpark DataFrame using Azure Databricks?

  • Assume you have a student dataset and you have been asked to count the number of students who have not filled in the parent contact number.
  • Assume that you have a student dataset and you have been asked to report the number of students in a department. These are some of the scenario for counting null or None values.

What are the alternatives for count null, None, NaN, and an empty string in a PySpark DataFrame using Azure Databricks?

There are multiple alternatives for counting null, None, NaN, and an empty string in a PySpark DataFrame, which are as follows:

  • col() == “” method used for finding empty value
  • isnan() function used for finding the NumPy null values
  • isNull() and col().isNull() functions are used for finding the null values

These are some of the function that helps in finding the null, None, and NaN values. Use the above-mentioned function with the when and count functions to get the counts.

Final Thoughts

In this article, we have learned about counting null, None, NaN, and an empty string in a PySpark DataFrame in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

arud seka berne s

I have experience in developing solutions in Python, Big Data, and applications spanning across technologies. A Technology Evangelist for Bigdata (Hadoop, Hive, Spark) and other technologies.