How to convert time of StringType into TimestampType in PySpark Azure Databricks?

Are you looking to find out how to convert date time column of string datatype to timestamp format in PySpark using Azure Databricks cloud or maybe you are looking for a solution, to format date time column of StringType to PySpark’s TimestampType format in PySpark Databricks using the to_timestamp() function? If you are looking for any of these problem solutions, you have landed on the correct page. I will also help you how to use PySpark to_timestamp() function with multiple examples in Azure Databricks. I will explain it by taking a practical example. So please don’t waste time let’s start with a step-by-step guide to understand how to use the to_timestamp() function in PySpark.

In this blog, I will teach you the following with practical examples:

  • Syntax of to_timestamp()
  • Convert String to PySpark Timestamp type
  • Format different time format of String type to PySpark Timestamp type
  • Using SQL expression

The Pyspark to_timestamp() function takes two arguments as input, a column and a time pattern.

Syntax:

to_timestamp()

What is the syntax of the to_timestamp() function in PySpark Azure Databricks?

The syntax is as follows:

to_timestamp(column_name, patter)
Parameter NameRequiredDescription
column_nameYesIt represents the column name whose datatype has to be changed.
patternYesIt represents the date time format patter of the column.
Table 1: to_timestamp() Method in PySpark Databricks Parameter list with Details

Apache Spark Official Documentation Link: to_timestamp()

Note: PySpark’s SQL Timestamp function supports both DataFrame and SQL work, very similar to traditional SQL. If you work with data extraction, transformation, and loading, you should have a good understanding of SQL Date functions.

Create a simple DataFrame

Let’s understand the use of the to_timestamp() function with a variety of examples. Let’s start by creating a DataFrame.

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

a) Create manual PySpark DataFrame

data = [
    ("11:58:12.000 PM","12:20:06.000","2022-06-28 18:51:38"),
    ("10:59:42.000 PM","12:26:04.000","2021-12-23 07:51:28"),
    ("5:54:46.000 PM","17:47:13.000","2022-04-24 16:54:36"),
    ("6:16:26.000 PM","2:49:42.000","2021-11-15 17:22:32"),
    ("12:28:07.000 PM","19:01:51.000","2022-08-23 07:18:19"),
    ("6:55:01.000 AM","22:09:54.000","2022-02-23 01:00:06"),
    ("12:50:06.000 PM","4:17:42.000","2022-04-02 17:34:48"),
    ("4:16:39.000 PM","19:30:41.000","2021-10-31 00:24:51"),
    ("10:05:03.000 PM","15:59:34.000","2022-08-06 23:37:36"),
    ("9:44:34.000 PM","4:14:07.000","2022-07-19 04:57:58")
]

columns = ["time_1","time_2","date_time"]
df = spark.createDataFrame(data, schema=columns)
df.printSchema()
df.show(5, truncate=False)

"""
root
 |-- time_1: string (nullable = true)
 |-- time_2: string (nullable = true)
 |-- date_time: string (nullable = true)

+---------------+------------+-------------------+
|time_1         |time_2      |date_time          |
+---------------+------------+-------------------+
|11:58:12.000 PM|12:20:06.000|2022-06-28 18:51:38|
|10:59:42.000 PM|12:26:04.000|2021-12-23 07:51:28|
|5:54:46.000 PM |17:47:13.000|2022-04-24 16:54:36|
|6:16:26.000 PM |2:49:42.000 |2021-11-15 17:22:32|
|12:28:07.000 PM|19:01:51.000|2022-08-23 07:18:19|
+---------------+------------+-------------------+
"""

b) Creating a DataFrame by reading files

Download and use the below source file.

# replace the file_path with the source file location which you have downloaded.

df_2 = spark.read.format("csv").option("header", True).load(file_path)
df_2.printSchema()

"""
root
 |-- time_1: string (nullable = true)
 |-- time_2: string (nullable = true)
 |-- date_time: string (nullable = true)
"""

Note: Here, I will be using the manually created DataFrame.

How to convert date time of StringType into TimestampType in PySpark Azure Databricks?

Let’s see how to convert the date time column of StringType into TimestampType of PySpark DataFrame in Azure Databricks.

Before using to_timestamp():

df.select("date_time").printSchema()
df.select("date_time").show(2)

"""
Output:

root
 |-- date_time: string (nullable = true)

+-------------------+
|          date_time|
+-------------------+
|2022-06-28 18:51:38|
|2021-12-23 07:51:28|
+-------------------+


"""

After using to_timestamp():

from pyspark.sql.functions import to_timestamp

mod_df = df.withColumn("date_time", to_timestamp("date_time")).select("date_time")
mod_df.printSchema()
mod_df.show(2)

"""
Output:

root
 |-- date_time: timestamp (nullable = true)

+-------------------+
|          date_time|
+-------------------+
|2022-06-28 18:51:38|
|2021-12-23 07:51:28|
+-------------------+

"""

Here you can see the “date_time” column datatype has been converted into TimestampType.

How to convert other date time format of StringType into TimestampType in PySpark Azure Databricks?

Let’s see how to convert different DateTime format columns of StringType to TimestampType of PySpark DataFrame in Azure Databricks.

Note: Spark uses pattern letters for date and timestamp parsing and formatting.

Example 1:

In this example let’s try to convert a 12 hours time format of StringType column into PySpark’s Timestamp format.

from pyspark.sql.functions import to_timestamp

fmt_df = df.withColumn("fmt_time", to_timestamp("time_1", "h:mm:s.SSS a"))\
.select("time_1", "fmt_time")

fmt_df.printSchema()
fmt_df.show(2)

"""
Output:

root
 |-- time_1: string (nullable = true)
 |-- fmt_time: timestamp (nullable = true)

+---------------+-------------------+
|         time_1|           fmt_time|
+---------------+-------------------+
|11:58:12.000 PM|1970-01-01 23:58:12|
|10:59:42.000 PM|1970-01-01 22:59:42|
+---------------+-------------------+

"""

Example 2:

In this example let’s try to convert a 24 hours time format of StringType column into PySpark’s Timestamp format.

from pyspark.sql.functions import to_timestamp

fmt_df = df.withColumn("fmt_time", to_timestamp("time_2", "k:mm:s.SSS")) \
.select("time_2", "fmt_time")

fmt_df.printSchema()
fmt_df.show(2)

"""
Output:

root
 |-- time_2: string (nullable = true)
 |-- fmt_time: timestamp (nullable = true)

+------------+-------------------+
|      time_2|           fmt_time|
+------------+-------------------+
|12:20:06.000|1970-01-01 12:20:06|
|12:26:04.000|1970-01-01 12:26:04|
+------------+-------------------+

"""

How to convert date time of StringType into TimestampType in PySpark Azure Databricks using SQL expression?

Let’s see how to convert DateTime column of StringType to TimestampType of PySpark DataFrame in Azure Databricks using SQL expression.

Example:

Note: In order to use raw SQL format, we have to create a view for the DataFrame.

df.createOrReplaceTempView("time_table")

spark.sql('''
SELECT date_time, to_timestamp(date_time) AS fmt_date_time FROM time_table
''').printSchema()

"""
Output:

root
 |-- date_time: string (nullable = true)
 |-- fmt_date_time: timestamp (nullable = true)

"""

I have attached the complete code used in this blog in a notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you use the PySpark to_timestamp() in Azure Databricks?

You can use the pyspark to_timestamp() function to format any date time formats into PySpark TImestampType.

Real World Use Case Scenarios for PySpark DataFrame to_timestamp() in Azure Databricks?

What are the alternatives to the to_timestamp() function in PySpark Azure Databricks?

There are multiple alternatives to the date_format() function, which are as follows:

  • to_date(): used for converting PySpark DateTime format of String Type into DateType format and cast the column to convert it into StringType.

Final Thoughts

In this article, we have learned about the PySpark to_timestamp() method of DataFrame in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

PySpark in Azure Databricks, as explained by Arud Seka Berne S on azurelib.com.

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.