How to use Column class of PySpark in Azure Databricks?

Are you looking to find out how to use the PySpark Column class in Azure Databricks cloud or maybe you are looking for a solution, to access or modify columns using the Column functions? If you are looking for any of these problem solutions, you have landed on the correct page. I will also help you how to use the Column class and perform operations on top it Azure Databricks. I will explain it by taking a practical example. So please don’t waste time let’s start with a step-by-step guide to understand how to use the Column class and create records out of it in PySpark.

In this blog, I will teach you the following with practical examples:

  • Creating Column Class Object
  • Accessing columns
  • Column operations
  • Common column functions

The PySpark Column class helps with accessing columns, manipulating column values, performing operations, and providing some common column function.

Create a simple DataFrame

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

a) Create manual PySpark DataFrame

data = [
    ("Talbert","Taysbil","21-07-1992",44,"Male","6024 Onsgard Way","IL"),
    ("Dyana","Tomini","10-03-1963",59,"Female","082 Loomis Pass","TX"),
    ("Fredia","Cuschieri","03-03-1957",23,"Female","681 Mayer Lane","NV"),
    ("Kyle","Helleckas","08-12-1956",44,"Female","054 Manitowish Crossing","FL"),
    ("Linet","Stroyan","23-09-1996",12,"Female","8242 Sunfield Plaza","NC"),
    ("Nev","Borsi","04-01-1992",None,"Male","59 Utah Street","NY"),
    ("Ogdan","Baraja","04-01-1982",44,"Male","567 Oriole Park","FL"),
    ("Darcee","Draper","03-02-1979",None,"Female","5406 Monterey Terrace","AL"),
    ("Delainey","Caley","16-07-1975",47,"Male","83 Fremont Plaza","NC"),
    ("Siward","Calder","04-03-1958",None,"Male","22 Cody Trail","CA")
]

columns = ["f_name","l_name","dob","age","gender","street","state"]
df = spark.createDataFrame(data, schema=columns)
df.printSchema()
df.show(truncate=False)

"""
root
 |-- f_name: string (nullable = true)
 |-- l_name: string (nullable = true)
 |-- dob: string (nullable = true)
 |-- age: long (nullable = true)
 |-- gender: string (nullable = true)
 |-- street: string (nullable = true)
 |-- state: string (nullable = true)

+--------+---------+----------+----+------+-----------------------+-----+
|f_name  |l_name   |dob       |age |gender|street                 |state|
+--------+---------+----------+----+------+-----------------------+-----+
|Talbert |Taysbil  |21-07-1992|44  |Male  |6024 Onsgard Way       |IL   |
|Dyana   |Tomini   |10-03-1963|59  |Female|082 Loomis Pass        |TX   |
|Fredia  |Cuschieri|03-03-1957|23  |Female|681 Mayer Lane         |NV   |
|Kyle    |Helleckas|08-12-1956|44  |Female|054 Manitowish Crossing|FL   |
|Linet   |Stroyan  |23-09-1996|12  |Female|8242 Sunfield Plaza    |NC   |
|Nev     |Borsi    |04-01-1992|null|Male  |59 Utah Street         |NY   |
|Ogdan   |Baraja   |04-01-1982|44  |Male  |567 Oriole Park        |FL   |
|Darcee  |Draper   |03-02-1979|null|Female|5406 Monterey Terrace  |AL   |
|Delainey|Caley    |16-07-1975|47  |Male  |83 Fremont Plaza       |NC   |
|Siward  |Calder   |04-03-1958|null|Male  |22 Cody Trail          |CA   |
+--------+---------+----------+----+------+-----------------------+-----+
"""

b) Creating a DataFrame by reading files

Download and use the below source file.

# replace the file_path with the source file location which you have downloaded.

df_2 = spark.read.format("csv").option("header", True).load(file_path)
df_2.printSchema()

"""
root
 |-- f_name: string (nullable = true)
 |-- l_name: string (nullable = true)
 |-- dob: string (nullable = true)
 |-- age: long (nullable = true)
 |-- gender: string (nullable = true)
 |-- street: string (nullable = true)
 |-- state: string (nullable = true)
"""

Note: Here, I will be using the manually created DataFrame.

How to create new Column instance of PySpark DataFrame on Azure Databricks using lit() function?

The PySpark lit() function helps in creating a new column instance with literal values. Let’s try to create a new column on top of our dataframe.

Example:

from pyspark.sql.functions import lit

df.select("f_name", "age", lit("azurelib.com").alias("url")).show(5)

"""
Output:

+-------+---+------------+
| f_name|age|         url|
+-------+---+------------+
|Talbert| 44|azurelib.com|
|  Dyana| 59|azurelib.com|
| Fredia| 23|azurelib.com|
|   Kyle| 44|azurelib.com|
|  Linet| 12|azurelib.com|
+-------+---+------------+

"""

How to access the column of PySpark DataFrame on Azure Databricks?

Let’s see how to access columns of a PySpark DataFrame on Azure Databricks. In this section, we will see what are the different ways in which we can access the column of DataFrame.

Example:

from pyspark.sql.functions import col, column

# Method 1: Using column name as string
df.select("f_name").show(1)

# Method 2: Using dot notation
df.select(df.f_name).show(1)

# Method 3: using column name as key
df.select(df['f_name']).show(1)

# Method 4: Using col() function
df.select(col("f_name")).show(1)

# Method 5: Using column() function
df.select(column("f_name")).show(1)

# The above codes generate the following output.

"""
Output:

+-------+
| f_name|
+-------+
|Talbert|
+-------+

"""

How to perform a mathematical operation on PySpark columns in Azure Databricks?

Let’s see how to perform a mathematical operation on PSpark columns in Azure Databricks. In this section, we will perform addition, subtraction, multiplication, division, and modulus operations on top of the numeric column ‘age’.

Example:

from pyspark.sql.functions import col

df.select("age",
  (col("age")+1).alias("age + 1"),
  (col("age")-1).alias("age - 1"),
  (col("age")*10).alias("age * 10"),
  (col("age")/10).alias("age / 10"),
  (col("age")%2).alias("age % 2")
).show(2)

"""
Output:

+---+-------+-------+--------+--------+-------+
|age|age + 1|age - 1|age * 10|age / 10|age % 2|
+---+-------+-------+--------+--------+-------+
| 44|     45|     43|     440|     4.4|      0|
| 59|     60|     58|     590|     5.9|      1|
+---+-------+-------+--------+--------+-------+

"""

Note: We can only perform mathematical operations on top of numerical columns only.

How to use column functions of PySpark DataFrame using Azure Databricks?

The PySpark provides various functions that are used to manipulate DataFrame columns. We have data manipulation functions, data extraction functions, and so on. Let’s explore some of the commonly used column functions using practical examples.

1. Renaming column

from pyspark.sql.functions import col

df.select(
    col("f_name").alias("first_name"),
    col("l_name").name("last_name")
).show(5, truncate=False)

"""
Output:

+----------+---------+
|first_name|last_name|
+----------+---------+
|Talbert   |Taysbil  |
|Dyana     |Tomini   |
|Fredia    |Cuschieri|
|Kyle      |Helleckas|
|Linet     |Stroyan  |
+----------+---------+

"""

In this example, we have used the alias() and name() functions to rename columns. There is no difference between both functions. It is useful for people coming from an SQL background.

2. Modifying column datatype

from pyspark.sql.functions import col

df.select(
    col("age"),
    col("age").cast("INT"),
    col("age").astype("STRING")
).printSchema()

"""
Output:

root
 |-- age: long (nullable = true)
 |-- age: integer (nullable = true)
 |-- age: string (nullable = true)

"""

In this example, we have used the astype() and cast() functions for changing column datatypes. There is no difference between both functions. It is useful for people coming from an SQL background.

3. Extracting column values

# Substring
from pyspark.sql.functions import col

df \
.withColumn("date", col("dob").substr(1, 2)) \
.withColumn("month", col("dob").substr(4, 2)) \
.withColumn("year", col("dob").substr(7, 4)) \
.select("dob","date","month","year") \
.show(5, truncate=False)

"""
Output:

+----------+----+-----+----+
|dob       |date|month|year|
+----------+----+-----+----+
|21-07-1992|21  |07   |1992|
|10-03-1963|10  |03   |1963|
|03-03-1957|03  |03   |1957|
|08-12-1956|08  |12   |1956|
|23-09-1996|23  |09   |1996|
+----------+----+-----+----+

"""

The PySpark substr() function is very similar to the traditional SQL substring function. This helps into fetching values of a column on index basics.

4. Ordering column

from pyspark.sql.functions import col

# a) Ascending
df.orderBy(col("age").asc()).show(5)
df.orderBy(col("age").asc_nulls_first()).show(5)
df.orderBy(col("age").asc_nulls_last()).show(5)

"""
Output:

+------+---------+----------+----+------+--------------------+-----+
|f_name|   l_name|       dob| age|gender|              street|state|
+------+---------+----------+----+------+--------------------+-----+
|   Nev|    Borsi|04-01-1992|null|  Male|      59 Utah Street|   NY|
|Siward|   Calder|04-03-1958|null|  Male|       22 Cody Trail|   CA|
|Darcee|   Draper|03-02-1979|null|Female|5406 Monterey Ter...|   AL|
| Linet|  Stroyan|23-09-1996|  12|Female| 8242 Sunfield Plaza|   NC|
|Fredia|Cuschieri|03-03-1957|  23|Female|      681 Mayer Lane|   NV|
+------+---------+----------+----+------+--------------------+-----+

"""
from pyspark.sql.functions import col

# b) Descending
df.orderBy(col("age").desc()).show(5)
df.orderBy(col("age").desc_nulls_first()).show(5)
df.orderBy(col("age").desc_nulls_last()).show(5)

"""
Output:

+------+---------+----------+----+------+--------------------+-----+
|f_name|   l_name|       dob| age|gender|              street|state|
+------+---------+----------+----+------+--------------------+-----+
|Siward|   Calder|04-03-1958|null|  Male|       22 Cody Trail|   CA|
|   Nev|    Borsi|04-01-1992|null|  Male|      59 Utah Street|   NY|
|Darcee|   Draper|03-02-1979|null|Female|5406 Monterey Ter...|   AL|
| Linet|  Stroyan|23-09-1996|  12|Female| 8242 Sunfield Plaza|   NC|
|Fredia|Cuschieri|03-03-1957|  23|Female|      681 Mayer Lane|   NV|
+------+---------+----------+----+------+--------------------+-----+

"""

We can use desc, desc_nulls_first, and desc_nulls_last functions for ordering columns in descending order. These can be achieved by these functions in orderBy(). The desc() and desc_nulls_first() functions to order the value in descending order starting from the null value. The desc_nulls_last() function orders the value in descending order starting from the larger value. The same goes for asc() functions too.

5. Column operations

from pyspark.sql.functions import col

# a) Using matematically operation
df.select("age",
  (col("age")+1).alias("age + 1"),
  (col("age")-1).alias("age - 1"),
  (col("age")*10).alias("age * 10"),
  (col("age")/10).alias("age / 10"),
  (col("age")%2).alias("age % 2")
).show(2)

"""
Output:

+---+-------+-------+--------+--------+-------+
|age|age + 1|age - 1|age * 10|age / 10|age % 2|
+---+-------+-------+--------+--------+-------+
| 44|     45|     43|     440|     4.4|      0|
| 59|     60|     58|     590|     5.9|      1|
+---+-------+-------+--------+--------+-------+

"""

These operations can be done only on numerical columns.

# b) Using greater than operator
df.filter(col("age")<40).show(5)

"""
Output:

+------+---------+----------+---+------+-------------------+-----+
|f_name|   l_name|       dob|age|gender|             street|state|
+------+---------+----------+---+------+-------------------+-----+
|Fredia|Cuschieri|03-03-1957| 23|Female|     681 Mayer Lane|   NV|
| Linet|  Stroyan|23-09-1996| 12|Female|8242 Sunfield Plaza|   NC|
+------+---------+----------+---+------+-------------------+-----+

"""

6. Boolean expression

# a) Starts with

from pyspark.sql.functions import col
df.filter(col("f_name").startswith("D")).show(5)

"""
Output:

+--------+------+----------+----+------+--------------------+-----+
|  f_name|l_name|       dob| age|gender|              street|state|
+--------+------+----------+----+------+--------------------+-----+
|   Dyana|Tomini|10-03-1963|  59|Female|     082 Loomis Pass|   TX|
|  Darcee|Draper|03-02-1979|null|Female|5406 Monterey Ter...|   AL|
|Delainey| Caley|16-07-1975|  47|  Male|    83 Fremont Plaza|   NC|
+--------+------+----------+----+------+--------------------+-----+

"""

The PySpark startswith() function lists all the first names starts with the value ‘S’. Note the parameter passed inside the startswith() function is case-sensitive.

# b) Ends with

from pyspark.sql.functions import col
df.filter(col("f_name").endswith("e")).show(5)

"""
Output:

+------+---------+----------+----+------+--------------------+-----+
|f_name|   l_name|       dob| age|gender|              street|state|
+------+---------+----------+----+------+--------------------+-----+
|  Kyle|Helleckas|08-12-1956|  44|Female|054 Manitowish Cr...|   FL|
|Darcee|   Draper|03-02-1979|null|Female|5406 Monterey Ter...|   AL|
+------+---------+----------+----+------+--------------------+-----+

"""

The PySpark endswith() function lists all the first names ends with the value ‘e’. Note the parameter passed inside the endswith() function is case-sensitive.

# c) Contains

from pyspark.sql.functions import col
df.filter(col("f_name").contains("ce")).show(5)

"""
Output:

+------+------+----------+----+------+--------------------+-----+
|f_name|l_name|       dob| age|gender|              street|state|
+------+------+----------+----+------+--------------------+-----+
|Darcee|Draper|03-02-1979|null|Female|5406 Monterey Ter...|   AL|
+------+------+----------+----+------+--------------------+-----+

"""

The PySpark contains() function looks for the specified value in a column. Note the parameter passed inside the contains() function is case-sensitive.

# d) Between

from pyspark.sql.functions import col
df.filter(col("age").between(40, 50)).show(5)

"""
Output:

+--------+---------+----------+---+------+--------------------+-----+
|  f_name|   l_name|       dob|age|gender|              street|state|
+--------+---------+----------+---+------+--------------------+-----+
| Talbert|  Taysbil|21-07-1992| 44|  Male|    6024 Onsgard Way|   IL|
|    Kyle|Helleckas|08-12-1956| 44|Female|054 Manitowish Cr...|   FL|
|   Ogdan|   Baraja|04-01-1982| 44|  Male|     567 Oriole Park|   FL|
|Delainey|    Caley|16-07-1975| 47|  Male|    83 Fremont Plaza|   NC|
+--------+---------+----------+---+------+--------------------+-----+

"""

The PySpark between() function looks for a numeric or date value between a range.

# e) Like

from pyspark.sql.functions import col
df.filter(col("l_name").like("T%")).show(5)

"""
Output:

+-------+-------+----------+---+------+----------------+-----+
| f_name| l_name|       dob|age|gender|          street|state|
+-------+-------+----------+---+------+----------------+-----+
|Talbert|Taysbil|21-07-1992| 44|  Male|6024 Onsgard Way|   IL|
|  Dyana| Tomini|10-03-1963| 59|Female| 082 Loomis Pass|   TX|
+-------+-------+----------+---+------+----------------+-----+

"""

The PySpark like() function acts very similar to SQL-like function. Here, it looks for a last name start with character ‘T’.

# f) isin

from pyspark.sql.functions import col
df.filter(col("state").isin(['IL','FL'])).show(5)

"""
Output:

+-------+---------+----------+---+------+--------------------+-----+
| f_name|   l_name|       dob|age|gender|              street|state|
+-------+---------+----------+---+------+--------------------+-----+
|Talbert|  Taysbil|21-07-1992| 44|  Male|    6024 Onsgard Way|   IL|
|   Kyle|Helleckas|08-12-1956| 44|Female|054 Manitowish Cr...|   FL|
|  Ogdan|   Baraja|04-01-1982| 44|  Male|     567 Oriole Park|   FL|
+-------+---------+----------+---+------+--------------------+-----+

"""

The PySpark isin() function looks for column values present in an array or not.

# g) Checking for null and not null values

from pyspark.sql.functions import col

# isNull
df.filter(col("age").isNull()).show(5)

"""
Output:

+------+------+----------+----+------+--------------------+-----+
|f_name|l_name|       dob| age|gender|              street|state|
+------+------+----------+----+------+--------------------+-----+
|   Nev| Borsi|04-01-1992|null|  Male|      59 Utah Street|   NY|
|Darcee|Draper|03-02-1979|null|Female|5406 Monterey Ter...|   AL|
|Siward|Calder|04-03-1958|null|  Male|       22 Cody Trail|   CA|
+------+------+----------+----+------+--------------------+-----+

"""

# isNotNull
df.filter(col("age").isNotNull()).show(5)

"""
Output:

+-------+---------+----------+---+------+--------------------+-----+
| f_name|   l_name|       dob|age|gender|              street|state|
+-------+---------+----------+---+------+--------------------+-----+
|Talbert|  Taysbil|21-07-1992| 44|  Male|    6024 Onsgard Way|   IL|
|  Dyana|   Tomini|10-03-1963| 59|Female|     082 Loomis Pass|   TX|
| Fredia|Cuschieri|03-03-1957| 23|Female|      681 Mayer Lane|   NV|
|   Kyle|Helleckas|08-12-1956| 44|Female|054 Manitowish Cr...|   FL|
|  Linet|  Stroyan|23-09-1996| 12|Female| 8242 Sunfield Plaza|   NC|
+-------+---------+----------+---+------+--------------------+-----+

"""

The PySpark isNull() and isNotNull() functions are helpful for identifying or fetching the null and not records respectively.

# h) Using greater than operator

from pyspark.sql.functions import col
df.filter(col("age")<40).show(5)

"""
Output:

+------+---------+----------+---+------+-------------------+-----+
|f_name|   l_name|       dob|age|gender|             street|state|
+------+---------+----------+---+------+-------------------+-----+
|Fredia|Cuschieri|03-03-1957| 23|Female|     681 Mayer Lane|   NV|
| Linet|  Stroyan|23-09-1996| 12|Female|8242 Sunfield Plaza|   NC|
+------+---------+----------+---+------+-------------------+-----+

"""

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you use the Column class for creating columns in Azure Databricks?

These could be the possible reasons:

  1. To access columns
  2. To make use of column functions
  3. To perform arithmetic operation

Real World Use Case Scenarios for PySpark Column class in Azure Databricks?

  • Assume that you have n number of columns in a DataFrame and you want to fetch only some records out of it. You can use the col() function inside the select() function to fetch.
  • Whenever you want to perform the arithmetic operation as adding, subtracting, and so on.
  • The col() function provides a lot of functions that help and are used commonly in day-to-day life. Using the col() function we can make use of the in-built functions.

What are the alternatives for creating PySpark columns in Azure Databricks?

There are multiple alternatives for creating PySpark columns, which are as follows:

  • lit(): used for creating new column from a literal value
  • withColumn(): used for adding new column from existing column into DataFrame

Final Thoughts

In this article, we have learned about creating, accessing, and using column common functions of PySpark Column class in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

PySpark in Azure Databricks, as explained by Arud Seka Berne S on azurelib.com.

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.