How to work with UNIX Time in PySpark Azure Databricks?

Are you looking to find out how to convert time from Timestamp to UNIX time format in PySpark Azure Databricks cloud or maybe you are looking for a solution, to find out how to convert time from UNIX time to Timestamp format in PySpark Azure Databricks? If you are looking for any of these problem solutions, you have landed on the correct page. I will also help you how to work with different time formats in PySpark Azure Databricks. I will explain it with a practical example. So please don’t waste time let’s start with a step-by-step guide to understand how to work with UNIX time in PySpark.

In this blog, I will teach you the following with practical examples:

  • Syntax of unix_timestamp()
  • What is Unix Time?
  • Current time in UNIX format
  • Convert date time to UNIX time
  • Syntax of from_unixtime() function
  • Convert UNIX time to date time

unix_timestamp() method is used to convert a time string with the specified pattern to a Unix time stamp (in seconds).

Syntax: unix_timestamp()

What is the syntax of the unix_timestamp() function in PySpark Azure Databricks?

The syntax is as follows:

unix_timestamp(date_time_column, pattern)
Parameter NameRequiredDescription
date_time_column (str, Column)OptionalIt represents the date-time column.
pattern (str)YesIt represents the date-time format pattern of the date-time column.
Table 1: unix_timestamp() Method in PySpark Databricks Parameter list with Details

Apache Spark Official documentation link: unix_timestamp()

Important: PySpark’s SQL Timestamp function supports both DataFrame and SQL work, very similar to traditional SQL. If you work with data extraction, transformation, and loading, you should have a good understanding of SQL UNIX functions.

Create a simple DataFrame

Let’s understand how to work with UNIX time with a variety of examples. Let’s start by creating a DataFrame.

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

a) Create manual PySpark DataFrame

data = [
    ("2022-08-08","2022-04-17 17:16:20","19-04-2022 23:02:32"),
    ("2022-04-29","2022-11-07 04:03:11","27-07-2022 18:09:39"),
    ("2022-08-22","2022-02-07 09:15:31","08-11-2022 09:58:34"),
    ("2021-12-28","2022-02-28 02:47:25","03-01-2022 01:59:22"),
    ("2022-02-13","2022-05-22 11:25:29","25-02-2022 04:46:47")
]

df = spark.createDataFrame(data, schema=["date","date_time_1","date_time_2"])
df.printSchema()
df.show(5, truncate=False)

"""
root
 |-- date: string (nullable = true)
 |-- date_time_1: string (nullable = true)
 |-- date_time_2: string (nullable = true)

+----------+-------------------+-------------------+
|date      |date_time_1        |date_time_2        |
+----------+-------------------+-------------------+
|2022-08-08|2022-04-17 17:16:20|19-04-2022 23:02:32|
|2022-04-29|2022-11-07 04:03:11|27-07-2022 18:09:39|
|2022-08-22|2022-02-07 09:15:31|08-11-2022 09:58:34|
|2021-12-28|2022-02-28 02:47:25|03-01-2022 01:59:22|
|2022-02-13|2022-05-22 11:25:29|25-02-2022 04:46:47|
+----------+-------------------+-------------------+
"""

b) Creating a DataFrame by reading files

Download and use the below source file.

# replace the file_path with the source file location which you have downloaded.

df_2 = spark.read.format("csv").option("inferSchema", True).option("header", True).load(file_path)
df_2.printSchema()

"""
root
 |-- date: string (nullable = true)
 |-- date_time_1: string (nullable = true)
 |-- date_time_2: string (nullable = true)
"""

Note: Here, I will be using the manually created DataFrame.

What is UNIX Timestamp?

Unix time, also referred to as Epoch time, indicates the time since “1970-01-01 00:00:00 UTC.”
Epoch time, which is frequently used in Unix operating systems, will not accept fractions of seconds expressed as “.SSS”. Therefore, we haven’t used fractions of seconds in our examples.

How to get the current time in UNIX time format in PySpark Azure Databricks?

In this section, let’s try to get the current time in UNIX time format using the unix_timestamp() function using Azure Databricks with a practical example.

Example:

from pyspark.sql.functions import unix_timestamp

df.select(unix_timestamp().alias("curr_unix_time")).show(1)

"""
Output:

+--------------+
|curr_unix_time|
+--------------+
|    1668503165|
+--------------+

"""

How to convert date time to UNIX time in PySpark Azure Databricks?

In this section, let’s try to convert date time to UNIX time using the unix_timestamp() function in PySpark Azure Databricks with a practical example.

Example:

In this example, we will try to convert the following date-time formats:

  • From proper time format to UNIX time
  • From improper time format to UNIX time
  • From date to UNIX time
from pyspark.sql.functions import unix_timestamp

df2 = df.select(
    unix_timestamp("date_time_1").alias("proper->unix"),
    unix_timestamp("date_time_2", "dd-MM-yyyy HH:mm:ss").alias("improper->unix"),
    unix_timestamp("date", "yyyy-MM-dd").alias("date_only->unix"))

df2.printSchema()
df2.show()

"""
Output:

root
 |-- proper->unix: long (nullable = true)
 |-- improper->unix: long (nullable = true)
 |-- date_only->unix: long (nullable = true)

+------------+--------------+---------------+
|proper->unix|improper->unix|date_only->unix|
+------------+--------------+---------------+
|  1650215780|    1650409352|     1659916800|
|  1667793791|    1658945379|     1651190400|
|  1644225331|    1667901514|     1661126400|
|  1646016445|    1641175162|     1640649600|
|  1653218729|    1645764407|     1644710400|
+------------+--------------+---------------+

"""

In the above example, we have converted date-time from different formats to UNIX time format using unix_timestamp() function.

How to convert UNIX time to date time format in PySpark Azure Databricks?

In this section, let’s try to convert UNIX time to date time using the from_unixtime() function in PySpark Azure Databricks with a practical example.

Example:

from pyspark.sql.functions import from_unixtime

df3 = df2.select(
    from_unixtime("proper->unix", "yyyy-MM-dd").alias("yyyy-MM-dd"),
    from_unixtime("improper->unix", "dd-MM-yyyy").alias("dd-MM-yyyy"),
    from_unixtime("date_only->unix", "dd/MM/yyyy").alias("dd/MM/yyyy")
)

df3.printSchema()
df3.show()

"""
Output:

root
 |-- yyyy-MM-dd: string (nullable = true)
 |-- dd-MM-yyyy: string (nullable = true)
 |-- dd/MM/yyyy: string (nullable = true)

+----------+----------+----------+
|yyyy-MM-dd|dd-MM-yyyy|dd/MM/yyyy|
+----------+----------+----------+
|2022-04-17|19-04-2022|08/08/2022|
|2022-11-07|27-07-2022|29/04/2022|
|2022-02-07|08-11-2022|22/08/2022|
|2022-02-28|03-01-2022|28/12/2021|
|2022-05-22|25-02-2022|13/02/2022|
+----------+----------+----------+

"""

In the above example, we have converted from UNIX time to date-time in different formats format using from_unixtime() function.

How to work with UNIX time in PySpark Azure Databricks using SQL expression?

Let’s see how to work with UNIX time in PySpark Azure Databricks with some practical examples. In order to use a raw SQL expression, we have to create a view of our PySpark DataFrame.

Example 1:

df.createOrReplaceTempView("date_time")

spark.sql('''
SELECT 
    UNIX_TIMESTAMP(date_time_1) AS `proper->unix`,
    UNIX_TIMESTAMP(date_time_2, "dd-MM-yyyy HH:mm:ss") AS `improper->unix`,
    UNIX_TIMESTAMP(date, "yyyy-MM-dd") AS `date_only->unix`
FROM date_time
''').show()

"""
Output:

+------------+--------------+---------------+
|proper->unix|improper->unix|date_only->unix|
+------------+--------------+---------------+
|  1650215780|    1650409352|     1659916800|
|  1667793791|    1658945379|     1651190400|
|  1644225331|    1667901514|     1661126400|
|  1646016445|    1641175162|     1640649600|
|  1653218729|    1645764407|     1644710400|
+------------+--------------+---------------+

"""

Example 2:

# Example 2:
df2.createOrReplaceTempView("unix_time")

spark.sql('''
SELECT 
    FROM_UNIXTIME(`proper->unix`, "yyyy-MM-dd") AS `yyyy-MM-dd`,
    FROM_UNIXTIME(`improper->unix`, "dd-MM-yyyy") AS `dd-MM-yyyy`,
    FROM_UNIXTIME(`date_only->unix`, "dd/MM/yyyy") AS `dd/MM/yyyy`
FROM unix_time
''').show()

"""
Output:

+----------+----------+----------+
|yyyy-MM-dd|dd-MM-yyyy|dd/MM/yyyy|
+----------+----------+----------+
|2022-04-17|19-04-2022|08/08/2022|
|2022-11-07|27-07-2022|29/04/2022|
|2022-02-07|08-11-2022|22/08/2022|
|2022-02-28|03-01-2022|28/12/2021|
|2022-05-22|25-02-2022|13/02/2022|
+----------+----------+----------+

"""

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you convert date-time to UNIX time format in PySpark Azure Databricks?

These could be the possible reasons:

  1. For getting current date time in UNIX format
  2. For finding time difference in seconds, hours, or days
  3. For converting other date time formats into PySpark’s Date TIme format

Real World Use Case Scenarios for converting date-time to UNIX time format in PySpark Azure Databricks?

  • Assume that you have a dataset with lots of different format dates and times. For example: Date in the format of dd-MM-yyyy and you have to convert this in form of PySpark Default DateTime format. You can use PySpark UNIX function for converting data time.
  • Whenever you want to find the time difference between two date time columns use can use the PySpark’s UNIX functions to find the difference in seconds, hours, and days too.

What are the alternatives for converting date-time to UNIX time format in PySpark Azure Databricks?

There are multiple alternatives for converting date-time to UNIX time format in PySpark, which are as follows:

  • to_date(): used for converting date from string to date format
  • to_timestamp(): used for converting date from string to timestamp format

Final Thoughts

In this article, we have learned about how to work with UNIX Time in PySpark Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

Arud Seka Berne S

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.