How to find the time difference in PySpark Azure Databricks?

Are you looking to find out how to find the time difference in PySpark Azure Databricks cloud or maybe you are looking for a solution, to find out the time difference in PySpark Azure Databricks? If you are looking for any of these problem solutions, you have landed on the correct page. I will also help you to find out time difference using both PySpark and SQL function in PySpark Azure Databricks. I will explain it with a practical example. So please don’t waste time let’s start with a step-by-step guide to understand how to find time difference in PySpark.

In this blog, I will teach you the following with practical examples:

  • Syntax of unix_timestamp() function
  • Time difference by casting column
  • Time difference using UNIX time functions
  • Time difference in minutes
  • Time difference in hours
  • Time difference by parsing time
  • Time difference of parsing date time

Important: Because there is no built-in PySpark function for calculating time difference, we devised our own method. Let’s discuss these methods in the below section.

In this blog, we will use the to_timestamp function frequently. So it is best to know before using it.

The PySpark to_timestamp() function is a PySpark SQL function used for converting date-time column of string type to date time type.

to_timestampe()

What is the syntax of the to_timestamp() function in PySpark Azure Databricks?

The syntax is as follows:

to_timestamp(date_time_column, pattern)
Parameter NameRequiredDescription
date_time_column (str, Column)YesIt represents the date-time column.
pattern (str)OptionalIt represents the date-time format pattern of the date-time column.
Table 1: to_timestamp() Method in PySpark Databricks Parameter list with Details

Apache Spark Official Documentation Link: to_timestamp()

Note: PySpark’s SQL Timestamp function supports both DataFrame and SQL work, very similar to traditional SQL. If you work with data extraction, transformation, and loading, you should have a good understanding of SQL Date functions.

Create a simple DataFrame

Let’s understand how to work with UNIX time with a variety of examples. Let’s start by creating a DataFrame.

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

a) Create manual PySpark DataFrame

data = [
    ("17:54:02.000","2022-09-15 22:47:05","5:40:01.000 PM"),
    ("15:13:55.000","2022-05-21 02:30:12","3:50:43.000 AM"),
    ("10:30:55.000","2022-10-11 11:12:11","3:33:13.000 PM"),
    ("12:28:13.000","2022-06-14 06:11:39","12:22:24.000 AM"),
    ("12:38:58.000","2022-06-12 18:19:34","4:23:13.000 PM")
]

df = spark.createDataFrame(data, schema=["time_1","date_time_1","date_time_2"])
df.printSchema()
df.show(5, truncate=False)

"""
root
 |-- time_1: string (nullable = true)
 |-- date_time_1: string (nullable = true)
 |-- date_time_2: string (nullable = true)

+------------+-------------------+---------------+
|time_1      |date_time_1        |date_time_2    |
+------------+-------------------+---------------+
|17:54:02.000|2022-09-15 22:47:05|5:40:01.000 PM |
|15:13:55.000|2022-05-21 02:30:12|3:50:43.000 AM |
|10:30:55.000|2022-10-11 11:12:11|3:33:13.000 PM |
|12:28:13.000|2022-06-14 06:11:39|12:22:24.000 AM|
|12:38:58.000|2022-06-12 18:19:34|4:23:13.000 PM |
+------------+-------------------+---------------+
"""

b) Creating a DataFrame by reading files

Download and use the below source file.

# replace the file_path with the source file location which you have downloaded.

df_2 = spark.read.format("csv").option("inferSchema", True).option("header", True).load(file_path)
df_2.printSchema()

"""
root
 |-- time_1: string (nullable = true)
 |-- date_time_1: string (nullable = true)
 |-- date_time_2: string (nullable = true)
"""

Note: Here, I will be using the manually created DataFrame.

How to find time difference by casting date time column in PySpark Azure Databricks?

In this section, we’ll look at how to find the time difference in PySpark Azure Databricks by casting the date-time column. Let me explain the process before proceeding with an example.

  • Convert the date-time column from string type to timestamp type
  • Cast the date-time column to long type
  • Subtract both columns to get the time difference

Example:

from pyspark.sql.functions import to_timestamp, current_timestamp, col

# 1. Converting the date_time_1 column to timestampe format
df1 = df.select(
    to_timestamp("date_time_1").alias("date_time"),
    current_timestamp().alias("curr_time")
)
df1.printSchema()
df1.show(truncate=False)

# 2. Converting timestamp to long type
df1 = df1.select(col("date_time").cast("long"), col("curr_time").cast("long"))

# 3. Subtract casted columns
df1 = df1.withColumn("diff_sec", col("curr_time")-col("date_time"))
df1.show(truncate=False)

"""
Output:

root
 |-- date_time: timestamp (nullable = true)
 |-- curr_time: timestamp (nullable = false)

+-------------------+-----------------------+
|date_time          |curr_time              |
+-------------------+-----------------------+
|2022-09-15 22:47:05|2022-11-15 10:41:53.314|
|2022-05-21 02:30:12|2022-11-15 10:41:53.314|
|2022-10-11 11:12:11|2022-11-15 10:41:53.314|
|2022-06-14 06:11:39|2022-11-15 10:41:53.314|
|2022-06-12 18:19:34|2022-11-15 10:41:53.314|
+-------------------+-----------------------+

"""

How to find time difference in PySpark Azure Databricks using UNIX functions?

In this section, we’ll look at how to find the time difference in PySpark Azure Databricks by using UNIX functions. Let me explain the process before proceeding with an example after preparing DataFrame.

from pyspark.sql.functions import current_timestamp, unix_timestamp, col

# Preparing DatFrame
df2 = df.select(
    col("date_time_1").alias("date_time"),
    current_timestamp().alias("curr_time")
)
df2.printSchema()
df2.show(truncate=False)

"""
Output:

root
 |-- date_time: string (nullable = true)
 |-- curr_time: timestamp (nullable = false)

+-------------------+-----------------------+
|date_time          |curr_time              |
+-------------------+-----------------------+
|2022-09-15 22:47:05|2022-11-15 11:09:32.124|
|2022-05-21 02:30:12|2022-11-15 11:09:32.124|
|2022-10-11 11:12:11|2022-11-15 11:09:32.124|
|2022-06-14 06:11:39|2022-11-15 11:09:32.124|
|2022-06-12 18:19:34|2022-11-15 11:09:32.124|
+-------------------+-----------------------+

"""

Example:

  • Convert date-time to unixtime
  • Subtract both columns to get the time difference
from pyspark.sql.functions import unix_timestamp, col

# 1. Converting timestamp to unixtime
df2 = df2.select(
    unix_timestamp("date_time").alias("date_time"),
    unix_timestamp("curr_time").alias("curr_time"))

# 2. Subtract times
df2 = df2.withColumn("diff_sec", col("curr_time")-col("date_time"))
df2.show(truncate=False)

"""
Output:

+----------+----------+--------+
|date_time |curr_time |diff_sec|
+----------+----------+--------+
|1663282025|1668510572|5228547 |
|1653100212|1668510572|15410360|
|1665486731|1668510572|3023841 |
|1655187099|1668510572|13323473|
|1655057974|1668510572|13452598|
+----------+----------+--------+

"""

How to find time differences by minutes in PySpark using Azure Databricks?

In this section, we’ll look at how to find the time differences by minutes in PySpark using Azure Databricks with an example.

Example:

from pyspark.sql.functions import col, round

df3 = df1 \
.withColumn("diff_min", round(col("diff_sec") / 60)) \
.drop("diff_sec")

df3.show(truncate=False)

"""
Output:

+----------+----------+--------+
|date_time |curr_time |diff_min|
+----------+----------+--------+
|1663282025|1668508910|87115.0 |
|1653100212|1668508910|256812.0|
|1665486731|1668508910|50370.0 |
|1655187099|1668508910|222030.0|
|1655057974|1668508910|224182.0|
+----------+----------+--------+

"""

How to find time differences by hours in PySpark using Azure Databricks?

In this section, we’ll look at how to find the time difference by hours in PySpark using Azure Databricks with an example.

Example:

from pyspark.sql.functions import col, round

df4 = df1 \
.withColumn("diff_hrs", round(col("diff_sec") / 3600)) \
.drop("diff_sec")

df4.show(truncate=False)

"""
Output:

+----------+----------+--------+
|date_time |curr_time |diff_hrs|
+----------+----------+--------+
|1663282025|1668508910|1452.0  |
|1653100212|1668508910|4280.0  |
|1665486731|1668508910|839.0   |
|1655187099|1668508910|3701.0  |
|1655057974|1668508910|3736.0  |
+----------+----------+--------+

"""

How to find time differences in PySpark Azure Databricks by parsing time?

In this section, we’ll look at how to find the time difference in PySpark Azure Databricks by parsing time. Let me explain the process before proceeding with an example.

  • Convert the time column from string type to timestamp type
  • Cast the date-time column to long type
  • Subtract both columns to get the time difference

Example:

from pyspark.sql.functions import to_timestamp, current_timestamp, col

# 1. Converting the time_1 column to timestamp format
df5 = df.select(
    to_timestamp("time_1").alias("date_time"),
    current_timestamp().alias("curr_time")
)

# 2. Converting timestamp to long type
df5 = df5.select(col("date_time").cast("long"), col("curr_time").cast("long"))

# 3. Subtract times
df5 = df5.withColumn("diff_sec", col("curr_time")-col("date_time"))
df5.show(truncate=False)

"""
Output:

+----------+----------+--------+
|date_time |curr_time |diff_sec|
+----------+----------+--------+
|1668534842|1668511148|-23694  |
|1668525235|1668511148|-14087  |
|1668508255|1668511148|2893    |
|1668515293|1668511148|-4145   |
|1668515938|1668511148|-4790   |
+----------+----------+--------+

"""

How to find time differences in PySpark Azure Databricks by parsing date time?

In this section, we’ll look at how to find the time difference in PySpark Azure Databricks by parsing date time. Let me explain the process before proceeding with an example.

  • Convert the date time column from string type to timestamp type
  • Cast the date-time column to long type
  • Subtract both columns to get the time difference

Example:

from pyspark.sql.functions import to_timestamp, current_timestamp, col

# 1. Converting the time_1 column to timestampe format
df6 = df.select(
    to_timestamp("date_time_2", "h:mm:ss.SSS a").alias("date_time"),
    current_timestamp().alias("curr_time")
)

# 2. Converting timestamp to long type
df6 = df6.select(col("date_time").cast("long"), col("curr_time").cast("long"))

# 3. Subtract times
df6 = df6.withColumn("diff_sec", col("curr_time")-col("date_time"))
df6.show(truncate=False)

"""
Output:

+---------+----------+----------+
|date_time|curr_time |diff_sec  |
+---------+----------+----------+
|63601    |1668511210|1668447609|
|13843    |1668511210|1668497367|
|55993    |1668511210|1668455217|
|1344     |1668511210|1668509866|
|58993    |1668511210|1668452217|
+---------+----------+----------+

"""

How to find time difference in PySpark Azure Databricks using SQL expression?

Let’s see how to find the time difference in PySpark Azure Databricks with a SQL example by creating a small DataFrame. In order to use a raw SQL expression, we have to create a view of our PySpark DataFrame.

Example:

small_data = [("2020-01-01 00:00:00", "2021-12-31 11:59:59")]
df7 = spark.createDataFrame(small_data, schema=["from","to"])
df7.createOrReplaceTempView("sql_date_time")

spark.sql('''
SELECT
    unix_timestamp(`to`) - unix_timestamp(`from`) AS diff_sec,
    ROUND((unix_timestamp(`to`) - unix_timestamp(`from`)) / 60) AS diff_min,
    ROUND((unix_timestamp(`to`) - unix_timestamp(`from`)) / 3600) AS diff_hrs
FROM `sql_date_time`
''').show(truncate=False)

"""
Output:

+--------+---------+--------+
|diff_sec|diff_min |diff_hrs|
+--------+---------+--------+
|63115199|1051920.0|17532.0 |
+--------+---------+--------+

"""

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you find time difference in PySpark Azure Databricks?

These could be the possible reasons:

  1. To find the time difference between the Timestamp format
  2. To find the time difference between the DateTime format

Real World Use Case Scenarios for finding time difference in PySpark Azure Databricks?

Assume that you have a flight dataset and you want to find out the time taken for a flight from source to destination. We devised our own method despite the fact that PySpark lacks an inbuilt function for calculating time differences. You can use any of the above-mentioned methods to solve this.

What are the alternatives for finding time differences in PySpark Azure Databricks?

There are multiple alternatives for finding time differences in PySpark, which are as follows:

  • Time difference by casting column
  • Time difference using UNIX time functions

Final Thoughts

In this article, we have learned about how to find time differences using various date and time formats in PySpark Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

PySpark in Azure Databricks, as explained by Arud Seka Berne S on azurelib.com.

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.