How to add new columns in PySpark Azure Databricks?

Are you looking to find how to add columns of DataFrame in PySpark Azure Databricks cloud or maybe you are looking for a solution, to change the existing value or data type of a Dataframe column in PySpark Databricks using the withColumn() method? If you are looking for any of these problem solutions, then you have landed on the correct page. I will also show you how to use PySpark to add or change the old column value of a Dataframe in Azure Databricks. I will explain it by taking a practical example. So don’t waste time let’s start step by step guide to understanding how to add or change the existing value of columns in PySpark Dataframe.

In this blog, I will teach you the following with practical examples:

  • Syntax of withColumn()
  • Adding new column
  • Changing existing column value
  • Changing existing column DataType

withColumn() method used to add a column or replace the existing column that has the same name.

Syntax: dataframe_name.withColumn( column_name, expression)

What is the syntax of the withColumn() function in PySpark Azure Databricks?

The syntax is as follows:

dataframe_name.withColumn(colName , col)
Parameter NameRequiredDescription
colName (str)YesName of the new column.
col (pyspark.sql.column.Column)YesA column expression for the new column.
Table 1: withColumn() Method in PySpark Databricks Parameter list with Details

Official Apache Spark documentation link: withColumn()

Create a simple DataFrame

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

a) Create manual PySpark DataFrame

data = [    
    (1,"Jeanie","fm",446833),
    (2,"Poul","m",597923),
    (3,"Jonis","m",405672),
    (4,"Nellie","fm",742674),
    (5,"Sibbie","fm",35890)
]

df = spark.createDataFrame(data, schema=["id", "name", "gender", "salary"])
df.printSchema()
df.show()

"""
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: long (nullable = true)

+---+------+------+------+
| id|  name|gender|salary|
+---+------+------+------+
|  1|Jeanie|    fm|446833|
|  2|  Poul|     m|597923|
|  3| Jonis|     m|405672|
|  4|Nellie|    fm|742674|
|  5|Sibbie|    fm| 35890|
+---+------+------+------+
"""

b) Creating a DataFrame by reading files

Download and use the below source file.

# replace the file_paths with the source file location which you have downloaded.

df_2 = spark.read.format("csv").option("header", True).load(file_path)
df_2.printSchema()

"""
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: long (nullable = true)
"""

Note: Here, I will be using the manually created DataFrame.

How to add column in PySpark Azure Databricks using the withColumn() function?

The PySpark withColumn() function is a transformation function of DataFrame which is used to create a new column.

Example:

In this example, we are trying to create a new column called ‘country’ with a literal value. If you want to provide a literal value for a column, use the lit() function, which creates a literal value in PySpark.

from pyspark.sql.functions import lit

df.withColumn('country', lit('India')).show()

"""
Output:

+---+------+------+------+-------+
| id|  name|gender|salary|country|
+---+------+------+------+-------+
|  1|Jeanie|    fm|446833|  India|
|  2|  Poul|     m|597923|  India|
|  3| Jonis|     m|405672|  India|
|  4|Nellie|    fm|742674|  India|
|  5|Sibbie|    fm| 35890|  India|
+---+------+------+------+-------+

"""

How to change the value of an existing column in PySpark Azure Databricks?

To change the value of an existing DataFrame, use the withColumn() function.

Example:

In this example, we are trying to change the gender column value from lowercase to uppercase using the upper() function.

from pyspark.sql.functions import col, upper

df.withColumn('gender', upper(col('gender'))).show()

"""
Output:

+---+------+------+------+
| id|  name|gender|salary|
+---+------+------+------+
|  1|Jeanie|    FM|446833|
|  2|  Poul|     M|597923|
|  3| Jonis|     M|405672|
|  4|Nellie|    FM|742674|
|  5|Sibbie|    FM| 35890|
+---+------+------+------+

"""

How to change the DataType of an existing column in PySpark Azure Databricks?

To change the data type of an existing DataFrame, use the withColumn() function.

Example:

As you can see the column type of salary column of String type.

df.select('salary').printSchema()

"""
Output:

root
 |-- salary: long (nullable = true)

"""

In this example, we are trying to change the data type of the salary column from LongType to DoubleType using the withColumn() function and you can see the result of the salary column in decimal format.

from pyspark.sql.functions import col

df = df.withColumn("salary", col("salary").cast("Double"))  
df.select("Salary").printSchema()
df.show(truncate=False)

"""
Output:

root
 |-- salary: double (nullable = true)

+---+------+------+--------+
|id |name  |gender|salary  |
+---+------+------+--------+
|1  |Jeanie|fm    |446833.0|
|2  |Poul  |m     |597923.0|
|3  |Jonis |m     |405672.0|
|4  |Nellie|fm    |742674.0|
|5  |Sibbie|fm    |35890.0 |
+---+------+------+--------+

"""

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you use PySpark withColumn() function in Azure Databricks?

In PySpark withColumn() is a transformation function of DataFrame that is used to change the value, convert the datatype of an existing column, create a new column, and many more.

Real World Use Case Scenarios for PySpark Dataframe withColumnRenamed() function in Azure Databricks?

  • Assume that in the existing DataFrame you want to add the constant column or derived column then withColumn can be useful. For example assume that in the Employee DataFrame which contains the Country column, want to add another column that tells employee id Indian or Non-India. Then we can add withColumn using the boolean expression.

Final Thoughts

In this article, we have learned about the PySpark withColumn() method to add a new column, change the value of the existing DataFrame column, and change the existing DataFrame column data type in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

PySpark in Azure Databricks, as explained by Arud Seka Berne S on azurelib.com.

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.