How to use PySpark SQL Date Functions in Azure Databricks?

Are you looking to find out how to extract, manipulate, or convert PySpark’s DataFrame date columns in Azure Databricks cloud, or maybe you are looking for a solution, to manipulate the DataFrame date column’s value in PySpark Databricks using PySpark’s SQL methods? If you are looking for any of these problem solutions, you have landed on the correct page. I will also show you how to use PySpark SQL functions to extract, manipulate, or convert the date column values of DataFrames in Azure Databricks. I will explain it by taking a practical example. So don’t waste time let’s start with a step-by-step guide to understanding how to extract, manipulate, or convert date columns in PySpark DataFrame.

In this blog, I will teach you the following with practical examples:

  • Introduction to PySpark SQL Functions
  • Commonly used SQL Date functions
  • Improper Date conversion
  • Data type conversion

Create a simple DataFrame

Let’s understand the use of the SQL date functions with a variety of examples. Let’s start by creating a DataFrame.

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

a) Create manual PySpark DataFrame

data = [
    ("2019-12-05","2021-11-30","2022-06","2022-02-24 06:06:39"),
    ("2020-09-20","2022-07-13","2022-05","2022-06-18 04:59:25"),
    ("2021-04-09","2022-05-24","2022-10","2022-01-25 15:35:45"),
    ("2021-05-27","2022-06-05","2021-12","2022-05-11 14:49:42"),
    ("2019-06-30","2022-10-05","2022-01","2022-10-27 02:33:50")
]

columns = ["date_1","date_2","date_3","date_time"]
df = spark.createDataFrame(data, schema=columns)
df.printSchema()
df.show(truncate=False)

"""
root
 |-- date_1: string (nullable = true)
 |-- date_2: string (nullable = true)
 |-- date_3: string (nullable = true)
 |-- date_time: string (nullable = true)

+----------+----------+-------+-------------------+
|date_1    |date_2    |date_3 |date_time          |
+----------+----------+-------+-------------------+
|2019-12-05|2021-11-30|2022-06|2022-02-24 06:06:39|
|2020-09-20|2022-07-13|2022-05|2022-06-18 04:59:25|
|2021-04-09|2022-05-24|2022-10|2022-01-25 15:35:45|
|2021-05-27|2022-06-05|2021-12|2022-05-11 14:49:42|
|2019-06-30|2022-10-05|2022-01|2022-10-27 02:33:50|
+----------+----------+-------+-------------------+
"""

b) Creating a DataFrame by reading files

Download and use the below source file.

# replace the file_path with the source file location which you have downloaded.

df_2 = spark.read.format("csv").option("inferSchema", True).option("header", True).load("file_path")
df_2.printSchema()

"""
root
 |-- date_1: timestamp (nullable = true)
 |-- date_2: timestamp (nullable = true)
 |-- date_3: timestamp (nullable = true)
 |-- date_time: timestamp (nullable = true)
"""

Note: Here, I will be using the manually created DataFrame.

PySpark’s SQL Date function supports both DataFrame and SQL work, very similar to traditional SQL. If you work with data extraction, transformation, and loading, you should have a good understanding of SQL Date functions.

What are the commonly used Date functions in PySpark Azure Databricks?

In PySpark, we have lots of functions for the date column. But, for the sake of simplicity, let me divide these columns into three types. They are:

  • Extract functions
  • Manipulate functions
  • Format functions

How to extract date from date columns in PySpark Azure Databricks?

In this section, I will explain to you all the date extract functions using practical examples.

# 1. Extract month, quarter, and year from date column

from pyspark.sql.functions import month, quarter, year

df \
.withColumn("month", month("date_1")) \
.withColumn("quarter", quarter("date_1")) \
.withColumn("year", year("date_1")) \
.select("date_1", "month", "quarter", "year") \
.show(5, truncate=False)

"""
Output:

+----------+-----+-------+----+
|date_1    |month|quarter|year|
+----------+-----+-------+----+
|2019-12-05|12   |4      |2019|
|2020-09-20|9    |3      |2020|
|2021-04-09|4    |2      |2021|
|2021-05-27|5    |2      |2021|
|2019-06-30|6    |2      |2019|
+----------+-----+-------+----+

"""
# 2. Extract dayofweek, dayofmonth, dayofyear, and weekofyear from date column

from pyspark.sql.functions import dayofweek, dayofmonth, dayofyear, weekofyear

df \
.withColumn("dayofweek", dayofweek("date_1")) \
.withColumn("dayofmonth", dayofmonth("date_1")) \
.withColumn("dayofyear", dayofyear("date_1")) \
.withColumn("weekofyear", weekofyear("date_1")) \
.select("date_1", "dayofweek", "dayofmonth", "dayofyear", "weekofyear") \
.show(5, truncate=False)

"""
Output:

+----------+---------+----------+---------+----------+
|date_1    |dayofweek|dayofmonth|dayofyear|weekofyear|
+----------+---------+----------+---------+----------+
|2019-12-05|5        |5         |339      |49        |
|2020-09-20|1        |20        |264      |38        |
|2021-04-09|6        |9         |99       |14        |
|2021-05-27|5        |27        |147      |21        |
|2019-06-30|1        |30        |181      |26        |
+----------+---------+----------+---------+----------+

"""

How to manipulate date column in PySpark Azure Databricks?

In this section, I will explain to you all the date manipulation functions using practical examples.

# 1. Get today's date

from pyspark.sql.functions import current_date

df \
.withColumn("today", current_date()) \
.select("date_1", "today") \
.show(5, truncate=False)

"""
Output:

+----------+----------+
|date_1    |today     |
+----------+----------+
|2019-12-05|2022-10-28|
|2020-09-20|2022-10-28|
|2021-04-09|2022-10-28|
|2021-05-27|2022-10-28|
|2019-06-30|2022-10-28|
+----------+----------+

"""
# 2. Get next_day(column, dayOfWeek)

from pyspark.sql.functions import next_day

df \
.withColumn("next_day", next_day("date_1", "Mon")) \
.select("date_1", "next_day") \
.show(5, truncate=False)

# where dayOfWeek is case sensitive: “Mon”, “Tue”, “Wed”, “Thu”, “Fri”, “Sat”, “Sun”.

"""
Output:

+----------+----------+
|date_1    |next_day  |
+----------+----------+
|2019-12-05|2019-12-09|
|2020-09-20|2020-09-21|
|2021-04-09|2021-04-12|
|2021-05-27|2021-05-31|
|2019-06-30|2019-07-01|
+----------+----------+

"""
# 3. Get last_day

from pyspark.sql.functions import last_day

df \
.withColumn("last_day", last_day("date_1")) \
.select("date_1", "last_day") \
.show(5, truncate=False)

"""
Output:

+----------+----------+
|date_1    |last_day  |
+----------+----------+
|2019-12-05|2019-12-31|
|2020-09-20|2020-09-30|
|2021-04-09|2021-04-30|
|2021-05-27|2021-05-31|
|2019-06-30|2019-06-30|
+----------+----------+

"""
# 4. Add date, subtract date, and add month of date column

from pyspark.sql.functions import date_add, date_sub, add_months

df \
.withColumn("date_add", date_add("date_1", 1)) \
.withColumn("date_sub", date_sub("date_1", 1)) \
.withColumn("add_months", add_months("date_1", 1)) \
.select("date_1", "date_add", "date_sub", "add_months") \
.show(5, truncate=False)

"""
Output:

+----------+----------+----------+----------+
|date_1    |date_add  |date_sub  |add_months|
+----------+----------+----------+----------+
|2019-12-05|2019-12-06|2019-12-04|2020-01-05|
|2020-09-20|2020-09-21|2020-09-19|2020-10-20|
|2021-04-09|2021-04-10|2021-04-08|2021-05-09|
|2021-05-27|2021-05-28|2021-05-26|2021-06-27|
|2019-06-30|2019-07-01|2019-06-29|2019-07-30|
+----------+----------+----------+----------+

"""
# 5. Difference between two days

from pyspark.sql.functions import datediff, months_between, floor

df \
.withColumn("date_diff", datediff("date_2", "date_1")) \
.withColumn("months_between", floor(months_between("date_2", "date_1"))) \
.select("date_1", "date_diff", "months_between", "date_2") \
.show(5, truncate=False)

"""
Output:

+----------+---------+--------------+----------+
|date_1    |date_diff|months_between|date_2    |
+----------+---------+--------------+----------+
|2019-12-05|726      |23            |2021-11-30|
|2020-09-20|661      |21            |2022-07-13|
|2021-04-09|410      |13            |2022-05-24|
|2021-05-27|374      |12            |2022-06-05|
|2019-06-30|1193     |39            |2022-10-05|
+----------+---------+--------------+----------+

"""
# 6. Rounding date to month or year start

from pyspark.sql.functions import trunc

df \
.withColumn("trunc_month", trunc("date_1", "mon")) \
.withColumn("trunc_year", trunc("date_1", "year")) \
.select("date_1", "trunc_month", "trunc_year") \
.show(5, truncate=False)

"""
Output:

+----------+-----------+----------+
|date_1    |trunc_month|trunc_year|
+----------+-----------+----------+
|2019-12-05|2019-12-01 |2019-01-01|
|2020-09-20|2020-09-01 |2020-01-01|
|2021-04-09|2021-04-01 |2021-01-01|
|2021-05-27|2021-05-01 |2021-01-01|
|2019-06-30|2019-06-01 |2019-01-01|
+----------+-----------+----------+

"""

How to format date column in PySpark Azure Databricks?

In this section, I will explain to you all the date format functions using practical examples.

from pyspark.sql.functions import date_format

# 1. Converts proper date format(yyyy-MM-dd) to other formats(dd-MM-yy)
df \
.withColumn("formatted_date", date_format("date_1", "dd-MM-yy")) \
.select("date_1", "formatted_date").show(5, truncate=False)

"""
Output:

+----------+--------------+
|date_1    |formatted_date|
+----------+--------------+
|2019-12-05|05-12-19      |
|2020-09-20|20-09-20      |
|2021-04-09|09-04-21      |
|2021-05-27|27-05-21      |
|2019-06-30|30-06-19      |
+----------+--------------+

"""
from pyspark.sql.functions import unix_timestamp, from_unixtime

# 2. Convert date to unix format and conver unix to date format
# unix_timestamp -> converts date to unix time
# from_unixtime -> converts unixtime to date

df \
.withColumn("date_to_unix", unix_timestamp("date_1", "yyyy-MM-dd")) \
.withColumn("unix_to_date", from_unixtime("date_to_unix", "yyyy-MM-dd")) \
.select("date_1", "date_to_unix", "unix_to_date").show(5, truncate=False)

"""
Output:

+----------+------------+------------+
|date_1    |date_to_unix|unix_to_date|
+----------+------------+------------+
|2019-12-05|1575504000  |2019-12-05  |
|2020-09-20|1600560000  |2020-09-20  |
|2021-04-09|1617926400  |2021-04-09  |
|2021-05-27|1622073600  |2021-05-27  |
|2019-06-30|1561852800  |2019-06-30  |
+----------+------------+------------+

"""

Note: Spark uses pattern letters for date and timestamp parsing and formatting.

How to convert date column of StringType into DateType in PySpark Azure Databricks?

In this section, I will explain to you all the methods for converting a date column of StringType to DateType using practical examples.

from pyspark.sql.functions import to_date, col

# String
df.select("date_time").printSchema()

"""
Output:

root
 |-- date_time: string (nullable = true)

"""

# Various method to convert string to date format
# Method 1:
df \
.withColumn("date_time", to_date("date_time", "yyyy-MM-dd HH:mm:ss")) \
.select("date_time").printSchema()

# Method 2:
df.withColumn("date_time", col("date_time").cast("Date")).select("date_time").printSchema()

# Method 3:
df.selectExpr("CAST(date_time as DATE)").printSchema()

#The above three example gives the following output.

"""
Output:

root
 |-- date_time: date (nullable = true)

"""

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you use the PySpark SQL Date Functions in Azure Databricks?

These are some of the possible reasons:

  1. To get the current date
  2. To extract month, quarter, and year from a date column
  3. To extract the day of the week, day of the month, day of the year, week of the year
  4. To get the next day and last day of the month
  5. To add and subtract dates
  6. To add months
  7. To find the difference between the two dates
  8. For rounding date values
  9. To format date from one format to another format
  10. To cast the date column from StringType to DateType

Real World Use Case Scenarios for PySpark DataFrame SQL functions in Azure Databricks?

  • Assume that you have a pipeline running one time for each day. The dataset includes some columns and an audit column at the last. For better optimization, you decided to partition the records based on the specific time. You already have an audit column for example (1, “Berne”, “2022-11-30 10.02.15:000”) and you want to create a new column by extracting the date out of it. The PySpark Timestamp year() function helps in extracting this. Similarly, we have month() and quarter() functions too.
  • Assume that you have given a date column and you have manipulated it, maybe getting today’s date or the next day or last day of that month, adding or subtracting dates, adding months, rounding dates to month or year, or finding the difference between two dates.
  • Assume that you have a date time column in StringType for example “2022-06-18”, but you have to column type into DateTime format. You can use the to_date() function method for casting this column.

The PySpark Date SQL functions have been covered in the above section with practical examples.

What are the alternatives of the SQL Date Functions in PySpark Azure Databricks?

You can use the PySpark User Defined Functions (UDF) for handling dates in a PySpark DataFrame. But the PySpark in-built functions are better performing than PySpark UDF, compile-time safe, and should be used instead of creating your own custom functions (UDF). Avoid utilizing custom UDF at all costs if the performance of your PySpark application is crucial because they cannot be guaranteed to perform.

Final Thoughts

In this article, we have learned about the PySpark SQL Date methods to extract, manipulate, and convert the columns of DataFrame in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

PySpark in Azure Databricks, as explained by Arud Seka Berne S on azurelib.com.

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.