Are you wondering on how to change the column datatype or may be you want to modify the value of the exisiting column of the dataframe in the Azure Databricks. Then you have reached to right blog post. In this aricle I will take you through step by step guide on how you can use the withColumn funtion in the pyspark to add, modify column of dataframe. We will also see how you can add or drop the column in the Azure Databricks pyspark dataframe. SO let’s start.
Azure Databricks Spark Tutorial for beginner to advance level – Lesson 1
Contents
- 1 How to use WithColumn() function in Azure Databricks pyspark?
- 2 1. Change DataType using withColumn() in Databricks
- 3 2. Update Value of an Existing Column in Databricks pyspark
- 4 3. Create a Column from an Existing One in Databricks
- 5 4. Add a New Column using withColumn() in Databricks
- 6 5. Rename Column Name in Databricks
- 7 6. Drop Column From DataFrame in Databricks
- 8 7. WithColumn() Complete Example In Azure Databricks pysprk
- 9 Conclusion :
How to use WithColumn() function in Azure Databricks pyspark?
WithColumn() is a transformation function of DataFrame in Databricks which is used to change the value, convert the datatype of an existing column, create a new column, and many more. In this post, we will walk you through commonly used DataFrame column operations using withColumn() examples.
First, let’s create a DataFrame to work with.
data = [('James','','Smith','1991-04-01','M',3000),
('Michael','Rose','','2000-05-19','M',4000),
('Robert','','Williams','1978-09-05','M',4000),
('Maria','Anne','Jones','1967-12-01','F',4000),
('Jen','Mary','Brown','1980-02-17','F',-1)
]
columns = ["firstname","middlename","lastname","dob","gender","salary"]
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
df = spark.createDataFrame(data=data, schema = columns)
1. Change DataType using withColumn() in Databricks
By using withColumn() on a DataFrame, we can change or cast the data type of a column. In order to change data type, we would also need to use cast() function along with withColumn(). The below statement changes the datatype from String to Integer for salary column.
df.withColumn("salary",col("salary").cast("Integer")).show()
2. Update Value of an Existing Column in Databricks pyspark
WithColumn() function of DataFrame can also be used to change the value of an existing column. In order to change the value, pass an existing column name as a first argument and a value to be assigned as a second argument to withColumn() function.
Note : The second argument should be Column type . Also, see Different Ways to Update DataFrame Column.
df.withColumn("salary",col("salary")*100).show()
This snippet multiplies the value of “salary” with 100 and updates value back to “salary” column.
3. Create a Column from an Existing One in Databricks
To add/create a new column, specify the first argument with a name you want your new column to be and use the second argument to assign a value by applying an operation on an existing column.
df.withColumn("CopiedColumn",col("salary")* -1).show()
This snippet creates a new column “CopiedColumn” by multiplying “salary” column with value -1.
4. Add a New Column using withColumn() in Databricks
In order to create a new column, pass the column name you wanted to the first argument of withColumn() transformation function. Make sure this new column not already present on DataFrame, if it presents it updates the value of that column.
On below snippet, lit() function is used to add a constant value to a DataFrame column. We can also chain in order to add the multiple columns.
df.withColumn("Country", lit("USA")).show()
df.withColumn("Country", lit("USA")) \
.withColumn("anotherColumn",lit("anotherValue")) \
.show()
5. Rename Column Name in Databricks
Though you cannot rename a column using withColumn, still I wanted to cover this as renaming is one of the common operations we perform on DataFrame. To rename an existing column use withColumnRenamed() function on a DataFrame.
df.withColumnRenamed("gender","sex") \
.show(truncate=False)
6. Drop Column From DataFrame in Databricks
Use “drop” function to drop a specific column from the DataFrame.
df.drop("salary") \
.show()
Note: Note that all of these functions return new DataFrame after applying the functions instead of updating DataFrame.
7. WithColumn() Complete Example In Azure Databricks pysprk
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit
from pyspark.sql.types import StructType, StructField, StringType,IntegerType
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [(John,'','Smith','1991-04-01','M',3000),
(James,'Rose','','2000-05-19','M',4000),
('Robert','','Williams','1978-09-05','M',4000),
('Maria','Anne','Jones','1967-12-01','F',4000),
('Jen','Mary','Brown','1980-02-17','F',-1)
]
columns = ["firstname","middlename","lastname","dob","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df.printSchema()
df.show(truncate=False)
df2 = df.withColumn("salary",col("salary").cast("Integer"))
df2.printSchema()
df2.show(truncate=False)
df3 = df.withColumn("salary",col("salary")*100)
df3.printSchema()
df3.show(truncate=False)
df4 = df.withColumn("CopiedColumn",col("salary")* -1)
df4.printSchema()
df5 = df.withColumn("Country", lit("USA"))
df5.printSchema()
df6 = df.withColumn("Country", lit("USA")) \
.withColumn("anotherColumn",lit("anotherValue"))
df6.printSchema()
df.withColumnRenamed("gender","sex") \
.show(truncate=False)
df4.drop("CopiedColumn") \
.show(truncate=False)
Databricks Official Documentation Link
Conclusion :
In this article, you have learn about the usage of WithColumn() function with some examples in databricks. I hope this will helped you to get good knowledge about the function.
- For Azure Study material Join Telegram group : Telegram group link:
- Azure Jobs and other updates Follow me on LinkedIn: Azure Updates on LinkedIn
- Azure Tutorial Videos: Videos Link
How to Select Columns From DataFrame in Databricks
How to Collect() – Retrieve data from DataFrame in Databricks