Are you looking to find out how to find null values in PySpark Azure Databricks cloud or maybe you are looking for a solution, to find not null values in PySpark Databricks using the PySpark inbuilt function? If you are looking for any of these problem solutions, you have landed on the correct page. I will also show you how to use SQL expressions for finding null and not null values in PySpark Azure Databricks. I will explain it with a practical example. So please don’t waste time let’s start with a step-by-step guide to understanding how to find null and not null values of PySpark DataFrame in Azure Databricks.
In this blog, I will teach you the following with practical examples:
- Finding null values
- Finding not null values
- Finding null and null values using multiple expressions
- Finding null and not null values using SQL expression
Contents
- 1 Create a simple DataFrame
- 2 How to find null values of the PySpark DataFrame Column in Azure Databricks?
- 3 How to find not null values of the PySpark DataFrame Column in Azure Databricks?
- 4 How to use isNull() and isNotNull() column functions together in PySpark Azure Databricks?
- 5 How to find null and not null values in PySpark DataFrame using SQL expression?
- 6 When should you find null and not null values of PySpark DataFrame using Azure Databricks?
- 7 Real World Use Case Scenarios for finding null and not null values of PySpark DataFrame using Azure Databricks?
- 8 What are the alternatives for finding null and not null values of PySpark DataFrame using Azure Databricks?
- 9 Final Thoughts
Create a simple DataFrame
Gentle reminder:
In Databricks,
- sparkSession made available as spark
- sparkContext made available as sc
In case, you want to create it manually, use the below code.
from pyspark.sql.session import SparkSession
spark = SparkSession.builder
.master("local[*]")
.appName("azurelib.com")
.getOrCreate()
sc = spark.sparkContext
a) Create manual PySpark DataFrame
data = [
(1,"Philis",None,None),
(2,"Issiah",None,"American"),
(3,"Moishe","9134728713",None),
(4,"Vivianna","8272404634",None),
(5,"Bendix",None,None)
]
df = spark.createDataFrame(data, schema=["id","name","phone","address"])
df.printSchema()
df.show(5, truncate=False)
"""
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- phone: string (nullable = true)
|-- address: string (nullable = true)
+---+--------+----------+--------+
|id |name |phone |address |
+---+--------+----------+--------+
|1 |Philis |null |null |
|2 |Issiah |null |American|
|3 |Moishe |9134728713|null |
|4 |Vivianna|8272404634|null |
|5 |Bendix |null |null |
+---+--------+----------+--------+
"""
b) Creating a DataFrame by reading files
Download and use the below source file.
# replace the file_path with the source file location which you have downloaded.
df_2 = spark.read.format("csv").option("header", True).load(file_path)
df_2.printSchema()
"""
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- phone: string (nullable = true)
|-- address: string (nullable = true)
"""
Note: Here, I will be using the manually created DataFrame.
How to find null values of the PySpark DataFrame Column in Azure Databricks?
Let’s see how to find the null values of the PySpark DataFrame column in Azure Databricks. In the below example, let’s try to find out about people who have not given their mobile numbers using some practical examples.
Example 1:
# Method 1:
from pyspark.sql.functions import col
df.select("id", "name", col("phone").isNull().alias("has_no_phone")).show()
"""
Output:
+---+--------+---------+
| id| name|has_no_phone|
+---+--------+---------+
| 1| Philis| true|
| 2| Issiah| true|
| 3| Moishe| false|
| 4|Vivianna| false|
| 5| Bendix| true|
+---+--------+---------+
"""
Example 2:
# Method 2:
from pyspark.sql.functions import col, isnull
df.select("id", "name", isnull("phone").alias("has_no_phone")).show()
"""
Output:
+---+--------+---------+
| id| name|has_no_phone|
+---+--------+---------+
| 1| Philis| true|
| 2| Issiah| true|
| 3| Moishe| false|
| 4|Vivianna| false|
| 5| Bendix| true|
+---+--------+---------+
"""
Both the Column.isNull() and isnull() functions are used for finding the null/None values of a DataFrame column. Since the above two functions return a bool value, it can be used with filter() or where() to fetch the empty records.
df.filter(isnull("phone")) \
.select("id","name","phone").show()
"""
Output:
+---+------+-----+
| id| name|phone|
+---+------+-----+
| 1|Philis| null|
| 2|Issiah| null|
| 5|Bendix| null|
+---+------+-----+
"""
How to find not null values of the PySpark DataFrame Column in Azure Databricks?
Let’s see how to find the not null values of the PySpark DataFrame column in Azure Databricks. In the below example, let’s try to find out about people who have given their mobile numbers using a practical example.
Example:
from pyspark.sql.functions import col
df.select("id", "name", col("phone").isNotNull().alias("has_phone")).show()
"""
Output:
+---+--------+---------+
| id| name|has_phone|
+---+--------+---------+
| 1| Philis| false|
| 2| Issiah| false|
| 3| Moishe| true|
| 4|Vivianna| true|
| 5| Bendix| false|
+---+--------+---------+
"""
The Column.isNotNull() function is used for finding the not null/None values of a DataFrame column. Since this function returns a bool value, it can be used with filter() or where() to fetch the non-empty records.
df.filter(col("phone").isNotNull()) \
.select("id","name","phone").show()
"""
Output:
+---+--------+----------+
| id| name| phone|
+---+--------+----------+
| 3| Moishe|9134728713|
| 4|Vivianna|8272404634|
+---+--------+----------+
"""
How to use isNull() and isNotNull() column functions together in PySpark Azure Databricks?
Let’s see how to use isNull() and isNotNull() column functions together in PySpark Azure Databricks. In the below example, let’s try to find out about people who have phone but no address.
Example:
from pyspark.sql.functions import col
df.filter(
(col("phone").isNull()) & (col("address").isNotNull())
).show()
"""
Output:
+---+------+-----+--------+
| id| name|phone| address|
+---+------+-----+--------+
| 2|Issiah| null|American|
+---+------+-----+--------+
"""
For best practice, always wrap the conditions with ‘()’.
How to find null and not null values in PySpark DataFrame using SQL expression?
Let’s see how to find null and not null values in PySpark Azure Databricks using SQL expression. In order to use raw SQL expression we have to convert our DataFrame into View.
df.createOrReplaceTempView("contacts")
Let’s use Spark SQL for finding the null and not null values with some practical examples.
Example 1:
# isNull
spark.sql('''
SELECT id, name, phone FROM contacts
WHERE phone IS NULL
''').show()
"""
Output:
+---+------+-----+
| id| name|phone|
+---+------+-----+
| 1|Philis| null|
| 2|Issiah| null|
| 5|Bendix| null|
+---+------+-----+
"""
Example 2:
# isNotNull
spark.sql('''
SELECT id, name, phone FROM contacts
WHERE phone IS NOT NULL
''').show()
"""
Output:
+---+--------+----------+
| id| name| phone|
+---+--------+----------+
| 3| Moishe|9134728713|
| 4|Vivianna|8272404634|
+---+--------+----------+
"""
Example 3:
# Multiple expression
spark.sql('''
SELECT * FROM contacts
WHERE phone IS NULL and address IS NULL
''').show()
"""
Output:
+---+------+-----+-------+
| id| name|phone|address|
+---+------+-----+-------+
| 1|Philis| null| null|
| 5|Bendix| null| null|
+---+------+-----+-------+
"""
I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.
When should you find null and not null values of PySpark DataFrame using Azure Databricks?
These could be the possible reasons:
- We should check for null values when the column is subjected to aggregation.
- The not null values helps in filtering out the null values.
Real World Use Case Scenarios for finding null and not null values of PySpark DataFrame using Azure Databricks?
- Assume you have a student dataset and you have been asked to fetch the students who has not submitted their parent numbers, the isNull() function helps in this scenario to fetch the null values.
- Assume that you have a student dataset and you have been asked to send letters to all the student who has filled in their address. By using the isNotNull() function you can fetch the student who has entered the address.
What are the alternatives for finding null and not null values of PySpark DataFrame using Azure Databricks?
There are multiple alternatives for finding null and not null values of PySpark DataFrame, which are as follows:
- column().isNull() and column().isNotNull()
- PySpark SQL expression
Final Thoughts
In this article, we have learned about how to find null and not null values of PySpark DataFrame in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.
Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.
- For Azure Study material Join Telegram group : Telegram group link:
- Azure Jobs and other updates Follow me on LinkedIn: Azure Updates on LinkedIn
- Azure Tutorial Videos: Videos Link
- Azure Databricks Lesson 1
- Azure Databricks Lesson 2
- Azure Databricks Lesson 3
- Azure Databricks Lesson 4
- Azure Databricks Lesson 5
- Azure Databricks Lesson 6
- Azure Databricks Lesson 7
As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.