How to use Timestamp SQL Functions in PySpark Azure Databricks?

Are you looking to find out how to extract, manipulate, or convert PySpark’s DataFrame timestamp columns in the Azure Databricks cloud, or maybe you are looking for a solution, to manipulate the DataFrame timestamp column’s value in PySpark Databricks using PySpark’s SQL methods? If you are looking for any of these problem solutions, you have landed on the correct page. I will also show you how to use PySpark SQL functions to extract, manipulate, or convert the timestamp column values of DataFrames in Azure Databricks. I will explain it by taking a practical example. So don’t waste time let’s start with a step-by-step guide to understanding how to extract, manipulate, or convert timestamp columns in PySpark DataFrame.

In this blog, I will teach you the following with practical examples:

  • Introduction to PySpark SQL Functions
  • Commonly used SQL Timestamp functions
  • Timestamp conversion
  • Data type conversion

Create a simple DataFrame

Let’s understand the use of the SQL timestamp functions with a variety of examples. Let’s start by creating a DataFrame.

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

a) Create manual PySpark DataFrame

data = [
    ("11:58:12.000 PM","12:20:06.000","2022-06-28 18:51:38"),
    ("10:59:42.000 PM","12:26:04.000","2021-12-23 07:51:28"),
    ("5:54:46.000 PM","17:47:13.000","2022-04-24 16:54:36"),
    ("6:16:26.000 PM","2:49:42.000","2021-11-15 17:22:32"),
    ("12:28:07.000 PM","19:01:51.000","2022-08-23 07:18:19"),
    ("6:55:01.000 AM","22:09:54.000","2022-02-23 01:00:06"),
    ("12:50:06.000 PM","4:17:42.000","2022-04-02 17:34:48"),
    ("4:16:39.000 PM","19:30:41.000","2021-10-31 00:24:51"),
    ("10:05:03.000 PM","15:59:34.000","2022-08-06 23:37:36"),
    ("9:44:34.000 PM","4:14:07.000","2022-07-19 04:57:58")
]

columns = ["time_1","time_2","date_time"]
df = spark.createDataFrame(data, schema=columns)
df.printSchema()
df.show(5, truncate=False)

"""
root
 |-- time_1: string (nullable = true)
 |-- time_2: string (nullable = true)
 |-- date_time: string (nullable = true)

+---------------+------------+-------------------+
|time_1         |time_2      |date_time          |
+---------------+------------+-------------------+
|11:58:12.000 PM|12:20:06.000|2022-06-28 18:51:38|
|10:59:42.000 PM|12:26:04.000|2021-12-23 07:51:28|
|5:54:46.000 PM |17:47:13.000|2022-04-24 16:54:36|
|6:16:26.000 PM |2:49:42.000 |2021-11-15 17:22:32|
|12:28:07.000 PM|19:01:51.000|2022-08-23 07:18:19|
+---------------+------------+-------------------+
"""

b) Creating a DataFrame by reading files

Download and use the below source file.

# replace the file_path with the source file location which you have downloaded.

df_2 = spark.read.format("csv").option("inferSchema", True).option("header", True).load("file_path")
df_2.printSchema()

"""
root
 |-- time_1: string (nullable = true)
 |-- time_2: string (nullable = true)
 |-- date_time: string (nullable = true)
"""

Note: Here, I will be using the manually created DataFrame.

PySpark’s SQL Timestamp function supports both DataFrame and SQL work, very similar to traditional SQL. If you work with data extraction, transformation, and loading, you should have a good understanding of SQL Date functions.

What are the commonly used Timestamp functions in PySpark Azure Databricks?

In PySpark, we have lots of functions for the timestamp column. But, for the sake of simplicity, let me divide these columns into three types. They are:

  • Extract functions
  • Manipulate functions
  • Format functions

How to extract times from time columns in PySpark Azure Databricks?

In this section, I will explain to you how to extract times from time columns in PySpark Azure Databricks using practical examples.

# 1. Extract time function

from pyspark.sql.functions import hour, minute, second

df \
.withColumn("Hour", hour("time_2")) \
.withColumn("Minute", minute("time_2")) \
.withColumn("Second", second("time_2")) \
.select("time_2", "Hour", "Minute", "Second") \
.show(5, truncate=False)

"""
Output:

+------------+----+------+------+
|time_2      |Hour|Minute|Second|
+------------+----+------+------+
|12:20:06.000|12  |20    |6     |
|12:26:04.000|12  |26    |4     |
|17:47:13.000|17  |47    |13    |
|2:49:42.000 |2   |49    |42    |
|19:01:51.000|19  |1     |51    |
+------------+----+------+------+

"""
# 2. current time

from pyspark.sql.functions import current_timestamp

df \
.withColumn("current_time", current_timestamp()) \
.select("current_time") \
.show(5, truncate=False)

"""
Output:

+-----------------------+
|current_time           |
+-----------------------+
|2022-10-28 11:29:48.641|
|2022-10-28 11:29:48.641|
|2022-10-28 11:29:48.641|
|2022-10-28 11:29:48.641|
|2022-10-28 11:29:48.641|
+-----------------------+

"""

How to format time columns in PySpark Azure Databricks?

In this section, I will explain to you all the time format functions using practical examples.

Example 1: Convert time format from 12hr to 24hr.

from pyspark.sql.functions import unix_timestamp, from_unixtime

fmt_df = df \
.withColumn("unix_time", unix_timestamp("time_1", "h:mm:ss.SSS a")) \
.withColumn("from_unix", from_unixtime("unix_time", "k:mm:ss.SSS")) \
.selectExpr("time_1 as 12hr", "unix_time", "from_unix as 24hr")

fmt_df.show(5)

"""
Output:

+---------------+---------+------------+
|           12hr|unix_time|        24hr|
+---------------+---------+------------+
|11:58:12.000 PM|    86292|23:58:12.000|
|10:59:42.000 PM|    82782|22:59:42.000|
| 5:54:46.000 PM|    64486|17:54:46.000|
| 6:16:26.000 PM|    65786|18:16:26.000|
|12:28:07.000 PM|    44887|12:28:07.000|
+---------------+---------+------------+

"""

Example 2: Convert time format from 24hr to 12hr.

from pyspark.sql.functions import unix_timestamp, from_unixtime

fmt_df.select("24hr") \
.withColumn("unix_time", unix_timestamp("24hr", "k:mm:ss.SSS")) \
.withColumn("12hr", from_unixtime("unix_time", "h:mm:ss.SSS a")) \
.show(5)

"""
Output:

+------------+---------+---------------+
|        24hr|unix_time|           12hr|
+------------+---------+---------------+
|23:58:12.000|    86292|11:58:12.000 PM|
|22:59:42.000|    82782|10:59:42.000 PM|
|17:54:46.000|    64486| 5:54:46.000 PM|
|18:16:26.000|    65786| 6:16:26.000 PM|
|12:28:07.000|    44887|12:28:07.000 PM|
+------------+---------+---------------+

"""

How to convert time column of StringType to TimestampType in PySpark Azure Databricks?

In this section, I will explain to you all the methods for converting a time column of a StringType to a TimestampType using practical examples.

from pyspark.sql.functions import to_timestamp, col

# String
df.select("time_2").printSchema()

"""
Output:

root
 |-- time_2: string (nullable = true)

"""

# Various method to convert string to date format
# Method 1:
df.withColumn("time_2", to_timestamp("time_2")) \
.select("time_2").printSchema()

# Method 2:
df.withColumn("time_2", col("time_2").cast("Timestamp")).select("time_2").printSchema()

# Method 3:
df.selectExpr("CAST(time_2 as TIMESTAMP)").printSchema()

#The above three example gives the following output.

"""
Output:

root
 |-- time_2: timestamp (nullable = true)

"""

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you use the PySpark SQL Timestamp Functions in Azure Databricks?

These are some of the possible reasons:

  1. To get the current time
  2. To extract hour, minute, second from time column
  3. To reformat time from one format to another
  4. To cast time column from StringType to TImestampType

Real World Use Case Scenarios for PySpark DataFrame SQL functions in Azure Databricks?

  • Assume that you have a pipeline running one time for each day and for audit purposes you have to create a column using the current time. The PySpark Timestamp current_timestamp() helps in creating this.
  • Assume that you have a pipeline running all day and fetch data one time for each hour. The dataset includes some columns and an audit column at the last. For better optimization, you decided to partition the records based on the specific time. You already have an audit column for example (1, “Berne”, “2022-11-30 10.02.15:000”) and you want to create a new column by extracting the time out of it. The PySpark Timestamp hour() function helps in extracting this. Similarly, we have minute() and seconds() functions too.
  • Assume that you have a date time column for example “5:54:46.000 PM” a 12-hour format time, but for maintaining the date time across all datasets in a particular format, you have reformated it. You can use the UNIX time formatting methods for this.

The PySpark Timestamp SQL functions have been covered in the above section with practical examples.

What are the alternatives of the SQL Timestamp Functions in PySpark Azure Databricks?

You can use the PySpark User Defined Functions (UDF) for handling times in a PySpark DataFrame. But the PySpark in-built functions are better performing than PySpark UDF, compile-time safe, and should be used instead of creating your own custom functions (UDF). Avoid utilizing custom UDF at all costs if the performance of your PySpark application is crucial because they cannot be guaranteed to perform.

Final Thoughts

In this article, we have learned about the PySpark SQL Timestamp methods to extract, manipulate, and convert the columns of DataFrame in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

PySpark in Azure Databricks, as explained by Arud Seka Berne S on azurelib.com.

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.