How to find number of rows and columns in PySpark Azure Databricks?

Are you looking to find how to find number of rows of a PySpark Dataframe using Azure Databricks cloud or maybe you are looking for a solution, to find number of columns of a Dataframe in PySpark using Azure Databricks? If you are looking for any of these problem solutions, then you have landed on the correct page. I will also show you how to use count() function to count non-null values in PySpark Azure Databricks. I will explain it with a practical example. So don’t waste time let’s start with a step-by-step guide to understanding how to find number of rows and columns in PySpark Azure Databricks.

In this blog, I will teach you the following with practical examples:

  • Get Row count
  • Get Column count
  • Count Null Values
  • Count values of column

The PySpark count() method is used to count the number of records in PySpark DataFrame on Azure Databricks by excluding null/None values.

Syntax: dataframe_name.count()

Apache Spark Official documentation link: count()

Create a simple DataFrame

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

a) Create manual PySpark DataFrame

data = [
    (1,"Em","VN"),
    (2,"Emalia","DE"),
    (3,"Odette",None),
    (4,"Mandy","DE"),
    (4,"Mandy","DE")
]

df = spark.createDataFrame(data, schema=["id","name","country"])
df.printSchema()
df.show(truncate=False)

"""
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- country: string (nullable = true)

+---+------+-------+
|id |name  |country|
+---+------+-------+
|1  |Em    |VN     |
|2  |Emalia|DE     |
|3  |Odette|null   |
|4  |Mandy |DE     |
|4  |Mandy |DE     |
+---+------+-------+
"""

b) Creating a DataFrame by reading files

Download and use the below source file.

# replace the file_paths with the source file location which you have downloaded.

df_2 = spark.read.format("csv").option("header", True).load(file_path)
df_2.printSchema()

"""
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- country: string (nullable = true)
"""

How to get number of rows in PySpark DataFrame using Azure Databricks?

In this section, let’s see how to get the number of rows in PySpark DataFrame using Azure Databricks with various methods.

Example 1:

# Method 1:

rows = df.count()
print(f"Number of rows: {rows}")

"""
Output:

Number of rows: 5

"""

Example 2:

from pyspark.sql.functions import col, lit, count

# Method 2:
df.select(count("*")).show()

# Method 3:
df.select(count(lit(1))).show()

# The above codes generated the following output

"""
Output:

+--------+
|count(1)|
+--------+
|       5|
+--------+

"""

How to get number of columns in PySpark DataFrame using Azure Databricks?

In this section, let’s see how to get the number of columns in PySpark DataFrame using Azure Databricks with an example.

Example:

columns = len(df.columns)
print(f"Number of columns: {columns}")

"""
Output:

Number of columns: 3

"""

How to count null values in PySpark DataFrame using Azure Databricks?

In this section, let’s see how to count null values in PySpark DataFrame using Azure Databricks using various methods.

Example 1:

from pyspark.sql.functions import count, col

# Method 1:
nulls = df.filter(col("country").isNull()).count()
print(f"1. Number of null records: {nulls}")

"""
Output:

1. Number of null records: 1

"""

Example 2:

from pyspark.sql.functions import count, col, when

# Method 2:
df.select(
    count(
        when(col("country").isNull(), 1).otherwise(None) # <- count() ignores null/None
     ).alias("2. Number of null records")
 ).show()

"""
Output:

+-------------------------+
|2. Number of null records|
+-------------------------+
|                        1|
+-------------------------+

"""

How to count column values of PySpark DataFrame in Azure Databricks?

In this section, let’s see how to count column values of PySpark DataFrame in Azure Databricks using SQL expression with various methods.

Example 1:

# Method 1:

from pyspark.sql.functions import count

df.select(count("id"), count("name"), count("country")).show()

"""
Output:

+---------+-----------+--------------+
|count(id)|count(name)|count(country)|
+---------+-----------+--------------+
|        5|          5|             4|
+---------+-----------+--------------+

"""

Example 2:

# Method 2:

from pyspark.sql.functions import count

df.select(*[count(column) for column in df.columns]).show()

"""
Output:

+---------+-----------+--------------+
|count(id)|count(name)|count(country)|
+---------+-----------+--------------+
|        5|          5|             4|
+---------+-----------+--------------+

"""

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you count rows and columns in PySpark Azure Databricks?

These could be the possible reasons:

  1. Whenever you want to see how many columns you have
  2. Whenever you want to get the total number of records in your DataFrame

Real World Use Case Scenarios for counting rows and columns in PySpark Azure Databricks?

  • Assume that you have two datasets and you what to the number of records present in the datasets for broadcasting. Here the above-mentioned techniques for finding a number of rows will be helpful in finding the row counts and the small dataset will be used for broadcasting.
  • Assume you will always get a CSV file having 10 columns. In case of a column mismatch, you should not consider the file for processing. The above-discussed method for finding a number of columns will be helpful.

What are the alternatives for counting rows and columns in PySpark Azure Databricks?

There are multiple alternatives for counting rows and columns, which are as follows:

  • rdd.count(): used for counting the number of records in an RDD
  • count(‘*’): used for counting the number of records that excludes the null value
  • count(1): used for counting the number of records which includes null
  • count(column_name): used for counting the number of records in a column of DataFrame

Final Thoughts

In this article, we have learned about counting rows and columns in PySpark Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

PySpark in Azure Databricks, as explained by Arud Seka Berne S on azurelib.com.

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.