Are you looking to find out how to extract, manipulate, or convert PySpark’s DataFrame date columns in Azure Databricks cloud, or maybe you are looking for a solution, to manipulate the DataFrame date column’s value in PySpark Databricks using PySpark’s SQL methods? If you are looking for any of these problem solutions, you have landed on the correct page. I will also show you how to use PySpark SQL functions to extract, manipulate, or convert the date column values of DataFrames in Azure Databricks. I will explain it by taking a practical example. So don’t waste time let’s start with a step-by-step guide to understanding how to extract, manipulate, or convert date columns in PySpark DataFrame.
In this blog, I will teach you the following with practical examples:
- Introduction to PySpark SQL Functions
- Commonly used SQL Date functions
- Improper Date conversion
- Data type conversion
Contents
- 1 Create a simple DataFrame
- 2 What are the commonly used Date functions in PySpark Azure Databricks?
- 3 How to extract date from date columns in PySpark Azure Databricks?
- 4 How to manipulate date column in PySpark Azure Databricks?
- 5 How to format date column in PySpark Azure Databricks?
- 6 How to convert date column of StringType into DateType in PySpark Azure Databricks?
- 7 When should you use the PySpark SQL Date Functions in Azure Databricks?
- 8 Real World Use Case Scenarios for PySpark DataFrame SQL functions in Azure Databricks?
- 9 What are the alternatives of the SQL Date Functions in PySpark Azure Databricks?
- 10 Final Thoughts
Create a simple DataFrame
Let’s understand the use of the SQL date functions with a variety of examples. Let’s start by creating a DataFrame.
Gentle reminder:
In Databricks,
- sparkSession made available as spark
- sparkContext made available as sc
In case, you want to create it manually, use the below code.
from pyspark.sql.session import SparkSession
spark = SparkSession.builder
.master("local[*]")
.appName("azurelib.com")
.getOrCreate()
sc = spark.sparkContext
a) Create manual PySpark DataFrame
data = [
("2019-12-05","2021-11-30","2022-06","2022-02-24 06:06:39"),
("2020-09-20","2022-07-13","2022-05","2022-06-18 04:59:25"),
("2021-04-09","2022-05-24","2022-10","2022-01-25 15:35:45"),
("2021-05-27","2022-06-05","2021-12","2022-05-11 14:49:42"),
("2019-06-30","2022-10-05","2022-01","2022-10-27 02:33:50")
]
columns = ["date_1","date_2","date_3","date_time"]
df = spark.createDataFrame(data, schema=columns)
df.printSchema()
df.show(truncate=False)
"""
root
|-- date_1: string (nullable = true)
|-- date_2: string (nullable = true)
|-- date_3: string (nullable = true)
|-- date_time: string (nullable = true)
+----------+----------+-------+-------------------+
|date_1 |date_2 |date_3 |date_time |
+----------+----------+-------+-------------------+
|2019-12-05|2021-11-30|2022-06|2022-02-24 06:06:39|
|2020-09-20|2022-07-13|2022-05|2022-06-18 04:59:25|
|2021-04-09|2022-05-24|2022-10|2022-01-25 15:35:45|
|2021-05-27|2022-06-05|2021-12|2022-05-11 14:49:42|
|2019-06-30|2022-10-05|2022-01|2022-10-27 02:33:50|
+----------+----------+-------+-------------------+
"""
b) Creating a DataFrame by reading files
Download and use the below source file.
# replace the file_path with the source file location which you have downloaded.
df_2 = spark.read.format("csv").option("inferSchema", True).option("header", True).load("file_path")
df_2.printSchema()
"""
root
|-- date_1: timestamp (nullable = true)
|-- date_2: timestamp (nullable = true)
|-- date_3: timestamp (nullable = true)
|-- date_time: timestamp (nullable = true)
"""
Note: Here, I will be using the manually created DataFrame.
PySpark’s SQL Date function supports both DataFrame and SQL work, very similar to traditional SQL. If you work with data extraction, transformation, and loading, you should have a good understanding of SQL Date functions.
What are the commonly used Date functions in PySpark Azure Databricks?
In PySpark, we have lots of functions for the date column. But, for the sake of simplicity, let me divide these columns into three types. They are:
- Extract functions
- Manipulate functions
- Format functions
How to extract date from date columns in PySpark Azure Databricks?
In this section, I will explain to you all the date extract functions using practical examples.
# 1. Extract month, quarter, and year from date column
from pyspark.sql.functions import month, quarter, year
df \
.withColumn("month", month("date_1")) \
.withColumn("quarter", quarter("date_1")) \
.withColumn("year", year("date_1")) \
.select("date_1", "month", "quarter", "year") \
.show(5, truncate=False)
"""
Output:
+----------+-----+-------+----+
|date_1 |month|quarter|year|
+----------+-----+-------+----+
|2019-12-05|12 |4 |2019|
|2020-09-20|9 |3 |2020|
|2021-04-09|4 |2 |2021|
|2021-05-27|5 |2 |2021|
|2019-06-30|6 |2 |2019|
+----------+-----+-------+----+
"""
# 2. Extract dayofweek, dayofmonth, dayofyear, and weekofyear from date column
from pyspark.sql.functions import dayofweek, dayofmonth, dayofyear, weekofyear
df \
.withColumn("dayofweek", dayofweek("date_1")) \
.withColumn("dayofmonth", dayofmonth("date_1")) \
.withColumn("dayofyear", dayofyear("date_1")) \
.withColumn("weekofyear", weekofyear("date_1")) \
.select("date_1", "dayofweek", "dayofmonth", "dayofyear", "weekofyear") \
.show(5, truncate=False)
"""
Output:
+----------+---------+----------+---------+----------+
|date_1 |dayofweek|dayofmonth|dayofyear|weekofyear|
+----------+---------+----------+---------+----------+
|2019-12-05|5 |5 |339 |49 |
|2020-09-20|1 |20 |264 |38 |
|2021-04-09|6 |9 |99 |14 |
|2021-05-27|5 |27 |147 |21 |
|2019-06-30|1 |30 |181 |26 |
+----------+---------+----------+---------+----------+
"""
How to manipulate date column in PySpark Azure Databricks?
In this section, I will explain to you all the date manipulation functions using practical examples.
# 1. Get today's date
from pyspark.sql.functions import current_date
df \
.withColumn("today", current_date()) \
.select("date_1", "today") \
.show(5, truncate=False)
"""
Output:
+----------+----------+
|date_1 |today |
+----------+----------+
|2019-12-05|2022-10-28|
|2020-09-20|2022-10-28|
|2021-04-09|2022-10-28|
|2021-05-27|2022-10-28|
|2019-06-30|2022-10-28|
+----------+----------+
"""
# 2. Get next_day(column, dayOfWeek)
from pyspark.sql.functions import next_day
df \
.withColumn("next_day", next_day("date_1", "Mon")) \
.select("date_1", "next_day") \
.show(5, truncate=False)
# where dayOfWeek is case sensitive: “Mon”, “Tue”, “Wed”, “Thu”, “Fri”, “Sat”, “Sun”.
"""
Output:
+----------+----------+
|date_1 |next_day |
+----------+----------+
|2019-12-05|2019-12-09|
|2020-09-20|2020-09-21|
|2021-04-09|2021-04-12|
|2021-05-27|2021-05-31|
|2019-06-30|2019-07-01|
+----------+----------+
"""
# 3. Get last_day
from pyspark.sql.functions import last_day
df \
.withColumn("last_day", last_day("date_1")) \
.select("date_1", "last_day") \
.show(5, truncate=False)
"""
Output:
+----------+----------+
|date_1 |last_day |
+----------+----------+
|2019-12-05|2019-12-31|
|2020-09-20|2020-09-30|
|2021-04-09|2021-04-30|
|2021-05-27|2021-05-31|
|2019-06-30|2019-06-30|
+----------+----------+
"""
# 4. Add date, subtract date, and add month of date column
from pyspark.sql.functions import date_add, date_sub, add_months
df \
.withColumn("date_add", date_add("date_1", 1)) \
.withColumn("date_sub", date_sub("date_1", 1)) \
.withColumn("add_months", add_months("date_1", 1)) \
.select("date_1", "date_add", "date_sub", "add_months") \
.show(5, truncate=False)
"""
Output:
+----------+----------+----------+----------+
|date_1 |date_add |date_sub |add_months|
+----------+----------+----------+----------+
|2019-12-05|2019-12-06|2019-12-04|2020-01-05|
|2020-09-20|2020-09-21|2020-09-19|2020-10-20|
|2021-04-09|2021-04-10|2021-04-08|2021-05-09|
|2021-05-27|2021-05-28|2021-05-26|2021-06-27|
|2019-06-30|2019-07-01|2019-06-29|2019-07-30|
+----------+----------+----------+----------+
"""
# 5. Difference between two days
from pyspark.sql.functions import datediff, months_between, floor
df \
.withColumn("date_diff", datediff("date_2", "date_1")) \
.withColumn("months_between", floor(months_between("date_2", "date_1"))) \
.select("date_1", "date_diff", "months_between", "date_2") \
.show(5, truncate=False)
"""
Output:
+----------+---------+--------------+----------+
|date_1 |date_diff|months_between|date_2 |
+----------+---------+--------------+----------+
|2019-12-05|726 |23 |2021-11-30|
|2020-09-20|661 |21 |2022-07-13|
|2021-04-09|410 |13 |2022-05-24|
|2021-05-27|374 |12 |2022-06-05|
|2019-06-30|1193 |39 |2022-10-05|
+----------+---------+--------------+----------+
"""
# 6. Rounding date to month or year start
from pyspark.sql.functions import trunc
df \
.withColumn("trunc_month", trunc("date_1", "mon")) \
.withColumn("trunc_year", trunc("date_1", "year")) \
.select("date_1", "trunc_month", "trunc_year") \
.show(5, truncate=False)
"""
Output:
+----------+-----------+----------+
|date_1 |trunc_month|trunc_year|
+----------+-----------+----------+
|2019-12-05|2019-12-01 |2019-01-01|
|2020-09-20|2020-09-01 |2020-01-01|
|2021-04-09|2021-04-01 |2021-01-01|
|2021-05-27|2021-05-01 |2021-01-01|
|2019-06-30|2019-06-01 |2019-01-01|
+----------+-----------+----------+
"""
How to format date column in PySpark Azure Databricks?
In this section, I will explain to you all the date format functions using practical examples.
from pyspark.sql.functions import date_format
# 1. Converts proper date format(yyyy-MM-dd) to other formats(dd-MM-yy)
df \
.withColumn("formatted_date", date_format("date_1", "dd-MM-yy")) \
.select("date_1", "formatted_date").show(5, truncate=False)
"""
Output:
+----------+--------------+
|date_1 |formatted_date|
+----------+--------------+
|2019-12-05|05-12-19 |
|2020-09-20|20-09-20 |
|2021-04-09|09-04-21 |
|2021-05-27|27-05-21 |
|2019-06-30|30-06-19 |
+----------+--------------+
"""
from pyspark.sql.functions import unix_timestamp, from_unixtime
# 2. Convert date to unix format and conver unix to date format
# unix_timestamp -> converts date to unix time
# from_unixtime -> converts unixtime to date
df \
.withColumn("date_to_unix", unix_timestamp("date_1", "yyyy-MM-dd")) \
.withColumn("unix_to_date", from_unixtime("date_to_unix", "yyyy-MM-dd")) \
.select("date_1", "date_to_unix", "unix_to_date").show(5, truncate=False)
"""
Output:
+----------+------------+------------+
|date_1 |date_to_unix|unix_to_date|
+----------+------------+------------+
|2019-12-05|1575504000 |2019-12-05 |
|2020-09-20|1600560000 |2020-09-20 |
|2021-04-09|1617926400 |2021-04-09 |
|2021-05-27|1622073600 |2021-05-27 |
|2019-06-30|1561852800 |2019-06-30 |
+----------+------------+------------+
"""
Note: Spark uses pattern letters for date and timestamp parsing and formatting.
How to convert date column of StringType into DateType in PySpark Azure Databricks?
In this section, I will explain to you all the methods for converting a date column of StringType to DateType using practical examples.
from pyspark.sql.functions import to_date, col
# String
df.select("date_time").printSchema()
"""
Output:
root
|-- date_time: string (nullable = true)
"""
# Various method to convert string to date format
# Method 1:
df \
.withColumn("date_time", to_date("date_time", "yyyy-MM-dd HH:mm:ss")) \
.select("date_time").printSchema()
# Method 2:
df.withColumn("date_time", col("date_time").cast("Date")).select("date_time").printSchema()
# Method 3:
df.selectExpr("CAST(date_time as DATE)").printSchema()
#The above three example gives the following output.
"""
Output:
root
|-- date_time: date (nullable = true)
"""
I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.
When should you use the PySpark SQL Date Functions in Azure Databricks?
These are some of the possible reasons:
- To get the current date
- To extract month, quarter, and year from a date column
- To extract the day of the week, day of the month, day of the year, week of the year
- To get the next day and last day of the month
- To add and subtract dates
- To add months
- To find the difference between the two dates
- For rounding date values
- To format date from one format to another format
- To cast the date column from StringType to DateType
Real World Use Case Scenarios for PySpark DataFrame SQL functions in Azure Databricks?
- Assume that you have a pipeline running one time for each day. The dataset includes some columns and an audit column at the last. For better optimization, you decided to partition the records based on the specific time. You already have an audit column for example (1, “Berne”, “2022-11-30 10.02.15:000”) and you want to create a new column by extracting the date out of it. The PySpark Timestamp year() function helps in extracting this. Similarly, we have month() and quarter() functions too.
- Assume that you have given a date column and you have manipulated it, maybe getting today’s date or the next day or last day of that month, adding or subtracting dates, adding months, rounding dates to month or year, or finding the difference between two dates.
- Assume that you have a date time column in StringType for example “2022-06-18”, but you have to column type into DateTime format. You can use the to_date() function method for casting this column.
The PySpark Date SQL functions have been covered in the above section with practical examples.
What are the alternatives of the SQL Date Functions in PySpark Azure Databricks?
You can use the PySpark User Defined Functions (UDF) for handling dates in a PySpark DataFrame. But the PySpark in-built functions are better performing than PySpark UDF, compile-time safe, and should be used instead of creating your own custom functions (UDF). Avoid utilizing custom UDF at all costs if the performance of your PySpark application is crucial because they cannot be guaranteed to perform.
Final Thoughts
In this article, we have learned about the PySpark SQL Date methods to extract, manipulate, and convert the columns of DataFrame in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.
Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.
- For Azure Study material Join Telegram group : Telegram group link:
- Azure Jobs and other updates Follow me on LinkedIn: Azure Updates on LinkedIn
- Azure Tutorial Videos: Videos Link
- Azure Databricks Lesson 1
- Azure Databricks Lesson 2
- Azure Databricks Lesson 3
- Azure Databricks Lesson 4
- Azure Databricks Lesson 5
- Azure Databricks Lesson 6
- Azure Databricks Lesson 7
As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.