How to filter records of DataFrame in PySpark Azure Databricks?

Are you looking to find how to filter records of PySpark Dataframe into Azure Databricks cloud or maybe you are looking for a solution, to get records of a Dataframe based on condition in PySpark Databricks? If you are looking for any of these problem solutions, then you have landed on the correct page. I will also show you what and how to use PySpark to filter out records of a Dataframe using the filter() function in Azure Databricks. I will explain it with a practical example. So don’t waste time let’s start with a step-by-step guide to understanding how to filter records or rows using a PySpark Dataframe.

In this blog, I will teach you the following with practical examples:

  • Syntax of filter()
  • Filter records based on a single condition
  • Filter records based on multiple conditions
  • Filter records based on array values
  • Filter records using string functions

filter() method is used to get matching records from Dataframe based on column conditions specified in PySpark Azure Databricks.

Syntax: dataframe_name.filter(condition)

What is the syntax of the filter() function in PySpark Azure Databricks?

The syntax is as follows:

dataframe_name.filter(condition)
Parameter NameRequiredDescription
condition (Column or str)YesIt represents the condition to fetch records from DataFrame.
Table 1: filter() Method in PySpark Databricks Parameter list with Details

Apache Spark Official documentation link: filter()

Create a simple DataFrame

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

a) Create manual PySpark DataFrame

data = [     
    ("Sentra",25,8,"USA"),
    ("Titan",13,10,"Japan"),
    ("1000",25,8,"Japan"),
    ("Voyager",19,8,"USA"),
    ("Cabriolet",14,10,"Japan")
]

df = spark.createDataFrame(data, schema=["name","miles","cylinders","origin"])
df.printSchema()
df.show()

"""
root
 |-- name: string (nullable = true)
 |-- miles: long (nullable = true)
 |-- cylinders: long (nullable = true)
 |-- origin: string (nullable = true)

+---------+-----+---------+------+
|     name|miles|cylinders|origin|
+---------+-----+---------+------+
|   Sentra|   25|        8|   USA|
|    Titan|   13|       10| Japan|
|     1000|   25|        8| Japan|
|  Voyager|   19|        8|   USA|
|Cabriolet|   14|       10| Japan|
+---------+-----+---------+------+
"""

b) Creating a DataFrame by reading files

Download and use the below source file.

# replace the file_paths with the source file location which you have downloaded.

df_2 = spark.read.format("csv").option("header", True).load(file_path)
df_2.printSchema()

"""
root
 |-- name: string (nullable = true)
 |-- miles: long (nullable = true)
 |-- cylinders: long (nullable = true)
 |-- origin: string (nullable = true)
"""

Note: Here, I will be using the manually created DataFrame.

How to filter records based on column conditions in PySpark Azure Databricks?

The PySpark function filter() is used to fetch records from a DataFrame based on column condition.

Filter Example in pyspark Azure Databricks:

In the below example, we are trying to get 8-cylinder cars.

from pyspark.sql.functions import col

# Method 1
df.filter("cylinders = 8").show()

# Method 3
df.filter(df.cylinders == 8).show()

# Method 3
df.filter(col("cylinders") == 8).show()

# Method 4
df.filter(df["cylinders"] == 8).show()

# The above code sample gives the same output as mentioned below.

"""
Output:

+-------+-----+---------+------+
|   name|miles|cylinders|origin|
+-------+-----+---------+------+
| Sentra|   25|        8|   USA|
|   1000|   25|        8| Japan|
|Voyager|   19|        8|   USA|
+-------+-----+---------+------+

"""

How to filter records based on multiple conditions in PySpark Azure Databricks?

Filter Example in pyspark Azure Databricks with multiple conditions :

In the below example, we are trying to get 8-cylinder cars that are not from the USA.

from pyspark.sql.functions import col

# Method 1
df.filter("cylinders = 8 AND origin != 'USA'").show()

# Method 2
df.filter( (df.cylinders == 8) & (df.origin != "USA") ).show()

# Method 3
df.filter( (col("cylinders") == 8) & (col("origin") != "USA") ).show()

# Method 4
df.filter( (df["cylinders"] == 8) & ~(df["origin"] == "USA") ).show()

# The above code sample gives the same output as mentioned below.

"""
Output:

+----+-----+---------+------+
|name|miles|cylinders|origin|
+----+-----+---------+------+
|1000|   25|        8| Japan|
+----+-----+---------+------+

"""

How to filter records based on array values in PySpark Azure Databricks?

Example filter records based on array values:

In the below example, we are trying to get cars from Japan and German.

from pyspark.sql.functions import col

origin_list = ["Japan", "German"]
df.filter(col("origin").isin(origin_list)).show()

"""
Output:

+---------+-----+---------+------+
|     name|miles|cylinders|origin|
+---------+-----+---------+------+
|    Titan|   13|       10| Japan|
|     1000|   25|        8| Japan|
|Cabriolet|   14|       10| Japan|
+---------+-----+---------+------+

"""


How to filter records using string functions in PySpark Azure Databricks?

Example:

In the below example, we are trying to get cars whose name starts with ford.

# 1. Starts with car name as 'T'
df.filter(col('name').startswith('T')).show()

"""
Output:

+-----+-----+---------+------+
| name|miles|cylinders|origin|
+-----+-----+---------+------+
|Titan|   13|       10| Japan|
+-----+-----+---------+------+

"""

# 2. Ends with origin as 'an'
df.filter(col('origin').endswith('an')).show()

"""
Output:

+---------+-----+---------+------+
|     name|miles|cylinders|origin|
+---------+-----+---------+------+
|    Titan|   13|       10| Japan|
|     1000|   25|        8| Japan|
|Cabriolet|   14|       10| Japan|
+---------+-----+---------+------+

"""

# 3. Contains 'ya' in car name
df.filter(col('name').contains('ya')).show()

"""
Output:

+-------+-----+---------+------+
|   name|miles|cylinders|origin|
+-------+-----+---------+------+
|Voyager|   19|        8|   USA|
+-------+-----+---------+------+

"""

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you use the PySpark filter() function in Azure Databricks?

You can use the filter() function to get or fetch records based on certain column conditions. For best performance, always use the filter as the first transformation in your code.

Real World Use Case Scenarios for PySpark DataFrame collect() function in Azure Databricks?

  • My DataFrame contains all the records. However, for my business use case, only certain records are relevant. Hence in this, we can use the filter function to remove all the irrelevant records from the DataFrame.

What are the alternatives of the dropDuplicates() function in PySpark Azure Databricks?

If you are coming from an SQL background use the where() function to get records from a DataFrame based on column condition, both these functions operate exactly the same.

where() function works almost similarly to the filter function.

Final Thoughts

In this article, we have learned about the PySpark filter() method to fetch records or rows of DataFrame based on column conditions in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

PySpark in Azure Databricks, as explained by Arud Seka Berne S on azurelib.com.

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.