How to format dates in PySpark Azure Databricks?

Are you looking to find out how to convert a date of PySpark datetime format to a string value in PySpark using Azure Databricks cloud or maybe you are looking for a solution, to convert a time column of PySpark DateTime format to a string value in PySpark Databricks using the date_format() function? If you are looking for any of these problem solutions, you have landed on the correct page. I will also help you how to use PySpark date_format() function with multiple examples in Azure Databricks. I will explain it by taking a practical example. So please don’t waste time let’s start with a step-by-step guide to understand how to use the date_format() function in PySpark.

In this blog, I will teach you the following with practical examples:

  • Syntax of date_format()
  • Convert Date column to string using DataFrame
  • Convert Date column to string using SQL expression
  • Check what happen when we pass different format of dates

The Pyspark date_format() function is used to converts a date, timestamp, or string of PySpark datetime format to a string value with the formatting defined by the date format indicated by the second parameter.

Syntax:

date_format()

What is the syntax of the date_format() function in PySpark Azure Databricks?

The syntax is as follows:

date_format(date, format)
Parameter NameRequiredDescription
date (Date, Timestamp, str)YesIt represents the date or timestamp column name.
formatYesIt represents the pattern of date or timestamp.
Table 1: date_format() Method in PySpark Databricks Parameter list with Details

Apache Spark Official Documentation Link: date_format()

Note: PySpark’s DateTime function supports both DataFrame and SQL work, very similar to traditional SQL. If you work with data extraction, transformation, and loading, you should have a good understanding of SQL Date functions.

Create a simple DataFrame

Let’s understand the use of the date_format() function with various examples. Let’s start by creating a DataFrame.

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

a) Create manual PySpark DataFrame

data = [
    ("2022-08-02","29-01-2022","2022-04-03 21:08:37"),
    ("2022-03-28","13-10-2022","2022-10-07 00:07:36"),
    ("2022-02-17","11-11-2021","2022-05-06 13:06:57"),
    ("2022-09-08","07-10-2022","2022-08-26 17:09:36"),
    ("2022-09-04","22-07-2022","2022-09-23 22:58:26")
]

columns = ["pyspark_date","other_date","pyspark_timestamp"]
df = spark.createDataFrame(data, schema=columns)
df.printSchema()
df.show(truncate=False)

"""
root
 |-- pyspark_date: string (nullable = true)
 |-- other_date: string (nullable = true)
 |-- pyspark_timestamp: string (nullable = true)

+------------+----------+-------------------+
|pyspark_date|other_date|pyspark_timestamp  |
+------------+----------+-------------------+
|2022-08-02  |29-01-2022|2022-04-03 21:08:37|
|2022-03-28  |13-10-2022|2022-10-07 00:07:36|
|2022-02-17  |11-11-2021|2022-05-06 13:06:57|
|2022-09-08  |07-10-2022|2022-08-26 17:09:36|
|2022-09-04  |22-07-2022|2022-09-23 22:58:26|
+------------+----------+-------------------+
"""

b) Creating a DataFrame by reading files

Download and use the below source file.

# replace the file_path with the source file location which you have downloaded.

df_2 = spark.read.format("csv").option("header", True).load(file_path)
df_2.printSchema()

"""
root
 |-- pyspark_date: string (nullable = true)
 |-- other_date: string (nullable = true)
 |-- pyspark_timestamp: string (nullable = true)
"""

Note: Here, I will be using the manually created DataFrame.

How to format date column into other date formats in PySpark Azure Databricks?

Let’s see how to convert Date column to different date formats of StringType in PySpark Azure Databricks.

Example 1:

# using select()

from pyspark.sql.functions import date_format

df.select("pyspark_date", 
          date_format("pyspark_date", "MMM-yyyy").alias("fmt_date"),
          "pyspark_timestamp", 
          date_format("pyspark_timestamp", "dd-MMM-yyyy").alias("fmt_time"),
         ).show()

"""
Output:

+------------+--------+-------------------+-----------+
|pyspark_date|fmt_date|  pyspark_timestamp|   fmt_time|
+------------+--------+-------------------+-----------+
|  2022-08-02|Aug-2022|2022-04-03 21:08:37|03-Apr-2022|
|  2022-03-28|Mar-2022|2022-10-07 00:07:36|07-Oct-2022|
|  2022-02-17|Feb-2022|2022-05-06 13:06:57|06-May-2022|
|  2022-09-08|Sep-2022|2022-08-26 17:09:36|26-Aug-2022|
|  2022-09-04|Sep-2022|2022-09-23 22:58:26|23-Sep-2022|
+------------+--------+-------------------+-----------+

"""

Example 2:

# Using withColumn()

from pyspark.sql.functions import date_format

df \
.withColumn("fmt_date", date_format("pyspark_date", "dd/MMM/yyyy")) \
.withColumn("fmt_time", date_format("pyspark_timestamp", "dd/MMM")) \
.select("pyspark_date", "fmt_date", "pyspark_timestamp", "fmt_time") \
.show()


"""
Output:

+------------+-----------+-------------------+--------+
|pyspark_date|   fmt_date|  pyspark_timestamp|fmt_time|
+------------+-----------+-------------------+--------+
|  2022-08-02|02/Aug/2022|2022-04-03 21:08:37|  03/Apr|
|  2022-03-28|28/Mar/2022|2022-10-07 00:07:36|  07/Oct|
|  2022-02-17|17/Feb/2022|2022-05-06 13:06:57|  06/May|
|  2022-09-08|08/Sep/2022|2022-08-26 17:09:36|  26/Aug|
|  2022-09-04|04/Sep/2022|2022-09-23 22:58:26|  23/Sep|
+------------+-----------+-------------------+--------+

"""

Note: Spark uses pattern letters for date and timestamp parsing and formatting.

How to format date column into other date formats in PySpark Azure Databricks using SQL expression?

Let’s see how to convert a date column to a different date format of StringType using SQL expression in PySpark Azure Databricks.

Example:

In order to use raw SQ expression, we have to convert our Dataframe into SQL view.

df.createOrReplaceTempView("datetime")

spark.sql("""
SELECT
    pyspark_date,
    date_format(pyspark_date, 'dd-MM-yyyy') AS fmt_date,
    pyspark_timestamp,
    date_format(pyspark_timestamp, 'dd-MMM-yyyy') AS fmt_time
FROM datetime
""").show()

"""
Output:

+------------+----------+-------------------+-----------+
|pyspark_date|  fmt_date|  pyspark_timestamp|   fmt_time|
+------------+----------+-------------------+-----------+
|  2022-08-02|02-08-2022|2022-04-03 21:08:37|03-Apr-2022|
|  2022-03-28|28-03-2022|2022-10-07 00:07:36|07-Oct-2022|
|  2022-02-17|17-02-2022|2022-05-06 13:06:57|06-May-2022|
|  2022-09-08|08-09-2022|2022-08-26 17:09:36|26-Aug-2022|
|  2022-09-04|04-09-2022|2022-09-23 22:58:26|23-Sep-2022|
+------------+----------+-------------------+-----------+

"""

What happens when we use date_format() function on date column other than PySpark DateTime format?

When we use the date_format() function on the date column other than PySpark date format the function returns null as a value. Let’s try to understand using an example.

Example:

from pyspark.sql.functions import date_format

df.select("other_date", date_format("other_date", "MMM-yyyy").alias("fmt_date")).show()

"""
Output:

+----------+--------+
|other_date|fmt_date|
+----------+--------+
|29-01-2022|    null|
|13-10-2022|    null|
|11-11-2021|    null|
|07-10-2022|    null|
|22-07-2022|    null|
+----------+--------+

"""

I have attached the complete code used in this blog in a notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you use the PySpark date_format() in Azure Databricks?

These could be the possible reasons:

  1. To format the PySpark date of StringType into another StringType date format
  2. To format the PySpark date of DateType into another StringType date format
  3. To format the PySpark date time of TimestampType into another StringType date format

Real World Use Case Scenarios for PySpark DataFrame date_format() in Azure Databricks?

  • Assume that you have a PySpark date format column in the form of “yyyy-MM-dd” and you want to extract the year part. You can use the date_format() function to format it by passing the date column and output pattern format.

  • Assume that you have a PySpark timestamp format column in the form of “yyyy-MM-dd HH.mm.ss” and you need to format it into “dd-MM-yyyy” for easy reference or understanding. You can use the date_format() function to format it by passing the date column and output pattern format.

What are the alternatives to the date_format() function in PySpark Azure Databricks?

There are multiple alternatives to the date_format() function, which are as follows:

  • to_date(): used for converting PySpark DateTime format of String Type into DateType format and cast the column to convert it into StringType.
  • to_timestamp(): used for converting PySpark Timestamp format of String Type into TimestampType format and cast the column to convert it into StringType.

Final Thoughts

In this article, we have learned about the PySpark date_format() method of DataFrame in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

PySpark in Azure Databricks, as explained by Arud Seka Berne S on azurelib.com.

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.