How to find null and not null values in PySpark Azure Databricks?

Are you looking to find out how to find null values in PySpark Azure Databricks cloud or maybe you are looking for a solution, to find not null values in PySpark Databricks using the PySpark inbuilt function? If you are looking for any of these problem solutions, you have landed on the correct page. I will also show you how to use SQL expressions for finding null and not null values in PySpark Azure Databricks. I will explain it with a practical example. So please don’t waste time let’s start with a step-by-step guide to understanding how to find null and not null values of PySpark DataFrame in Azure Databricks.

In this blog, I will teach you the following with practical examples:

  • Finding null values
  • Finding not null values
  • Finding null and null values using multiple expressions
  • Finding null and not null values using SQL expression

Create a simple DataFrame

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

a) Create manual PySpark DataFrame

data = [
    (1,"Philis",None,None),
    (2,"Issiah",None,"American"),
    (3,"Moishe","9134728713",None),
    (4,"Vivianna","8272404634",None),
    (5,"Bendix",None,None)
]

df = spark.createDataFrame(data, schema=["id","name","phone","address"])
df.printSchema()
df.show(5, truncate=False)

"""
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- phone: string (nullable = true)
 |-- address: string (nullable = true)

+---+--------+----------+--------+
|id |name    |phone     |address |
+---+--------+----------+--------+
|1  |Philis  |null      |null    |
|2  |Issiah  |null      |American|
|3  |Moishe  |9134728713|null    |
|4  |Vivianna|8272404634|null    |
|5  |Bendix  |null      |null    |
+---+--------+----------+--------+
"""

b) Creating a DataFrame by reading files

Download and use the below source file.

# replace the file_path with the source file location which you have downloaded.

df_2 = spark.read.format("csv").option("header", True).load(file_path)
df_2.printSchema()

"""
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- phone: string (nullable = true)
 |-- address: string (nullable = true)
"""

Note: Here, I will be using the manually created DataFrame.

How to find null values of the PySpark DataFrame Column in Azure Databricks?

Let’s see how to find the null values of the PySpark DataFrame column in Azure Databricks. In the below example, let’s try to find out about people who have not given their mobile numbers using some practical examples.

Example 1:

# Method 1:

from pyspark.sql.functions import col

df.select("id", "name", col("phone").isNull().alias("has_no_phone")).show()

"""
Output:

+---+--------+---------+
| id|    name|has_no_phone|
+---+--------+---------+
|  1|  Philis|     true|
|  2|  Issiah|     true|
|  3|  Moishe|    false|
|  4|Vivianna|    false|
|  5|  Bendix|     true|
+---+--------+---------+

"""

Example 2:

# Method 2:

from pyspark.sql.functions import col, isnull

df.select("id", "name", isnull("phone").alias("has_no_phone")).show()

"""
Output:

+---+--------+---------+
| id|    name|has_no_phone|
+---+--------+---------+
|  1|  Philis|     true|
|  2|  Issiah|     true|
|  3|  Moishe|    false|
|  4|Vivianna|    false|
|  5|  Bendix|     true|
+---+--------+---------+

"""

Both the Column.isNull() and isnull() functions are used for finding the null/None values of a DataFrame column. Since the above two functions return a bool value, it can be used with filter() or where() to fetch the empty records.

df.filter(isnull("phone")) \
.select("id","name","phone").show()

"""
Output:

+---+------+-----+
| id|  name|phone|
+---+------+-----+
|  1|Philis| null|
|  2|Issiah| null|
|  5|Bendix| null|
+---+------+-----+

"""

How to find not null values of the PySpark DataFrame Column in Azure Databricks?

Let’s see how to find the not null values of the PySpark DataFrame column in Azure Databricks. In the below example, let’s try to find out about people who have given their mobile numbers using a practical example.

Example:

from pyspark.sql.functions import col

df.select("id", "name", col("phone").isNotNull().alias("has_phone")).show()

"""
Output:

+---+--------+---------+
| id|    name|has_phone|
+---+--------+---------+
|  1|  Philis|    false|
|  2|  Issiah|    false|
|  3|  Moishe|     true|
|  4|Vivianna|     true|
|  5|  Bendix|    false|
+---+--------+---------+

"""

The Column.isNotNull() function is used for finding the not null/None values of a DataFrame column. Since this function returns a bool value, it can be used with filter() or where() to fetch the non-empty records.

df.filter(col("phone").isNotNull()) \
.select("id","name","phone").show()

"""
Output:

+---+--------+----------+
| id|    name|     phone|
+---+--------+----------+
|  3|  Moishe|9134728713|
|  4|Vivianna|8272404634|
+---+--------+----------+

"""

How to use isNull() and isNotNull() column functions together in PySpark Azure Databricks?

Let’s see how to use isNull() and isNotNull() column functions together in PySpark Azure Databricks. In the below example, let’s try to find out about people who have phone but no address.

Example:

from pyspark.sql.functions import col

df.filter(
    (col("phone").isNull()) & (col("address").isNotNull())
).show()

"""
Output:

+---+------+-----+--------+
| id|  name|phone| address|
+---+------+-----+--------+
|  2|Issiah| null|American|
+---+------+-----+--------+

"""

For best practice, always wrap the conditions with ‘()’.

How to find null and not null values in PySpark DataFrame using SQL expression?

Let’s see how to find null and not null values in PySpark Azure Databricks using SQL expression. In order to use raw SQL expression we have to convert our DataFrame into View.

df.createOrReplaceTempView("contacts")

Let’s use Spark SQL for finding the null and not null values with some practical examples.

Example 1:

# isNull

spark.sql('''
    SELECT id, name, phone FROM contacts
    WHERE phone IS NULL
''').show()

"""
Output:

+---+------+-----+
| id|  name|phone|
+---+------+-----+
|  1|Philis| null|
|  2|Issiah| null|
|  5|Bendix| null|
+---+------+-----+

"""

Example 2:

# isNotNull

spark.sql('''
    SELECT id, name, phone FROM contacts
    WHERE phone IS NOT NULL
''').show()

"""
Output:

+---+--------+----------+
| id|    name|     phone|
+---+--------+----------+
|  3|  Moishe|9134728713|
|  4|Vivianna|8272404634|
+---+--------+----------+

"""

Example 3:

# Multiple expression

spark.sql('''
    SELECT * FROM contacts
    WHERE phone IS NULL and address IS NULL
''').show()

"""
Output:

+---+------+-----+-------+
| id|  name|phone|address|
+---+------+-----+-------+
|  1|Philis| null|   null|
|  5|Bendix| null|   null|
+---+------+-----+-------+

"""

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you find null and not null values of PySpark DataFrame using Azure Databricks?

These could be the possible reasons:

  1. We should check for null values when the column is subjected to aggregation.
  2. The not null values helps in filtering out the null values.

Real World Use Case Scenarios for finding null and not null values of PySpark DataFrame using Azure Databricks?

  • Assume you have a student dataset and you have been asked to fetch the students who has not submitted their parent numbers, the isNull() function helps in this scenario to fetch the null values.
  • Assume that you have a student dataset and you have been asked to send letters to all the student who has filled in their address. By using the isNotNull() function you can fetch the student who has entered the address.

What are the alternatives for finding null and not null values of PySpark DataFrame using Azure Databricks?

There are multiple alternatives for finding null and not null values of PySpark DataFrame, which are as follows:

  • column().isNull() and column().isNotNull()
  • PySpark SQL expression

Final Thoughts

In this article, we have learned about how to find null and not null values of PySpark DataFrame in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

arud seka berne s

I have experience in developing solutions in Python, Big Data, and applications spanning across technologies. A Technology Evangelist for Bigdata (Hadoop, Hive, Spark) and other technologies.