How to Select Columns From DataFrame in Databricks

In this article I will explain how to use the select function in the pyspark Azure Databricks. I will also show you how to select the specific column from dataframe in Databricks. Let’s get into the step by step details of using the select function in Databricks

Azure Databricks Spark Tutorial for beginner to advance level – Lesson 1

What is use of Select() function in pyspark Databricks ?

Select() is a function which is used to select single, multiple, column by index, all columns from the list and the nested columns from a DataFrame and it is a transformation function hence it returns a new DataFrame with the selected columns.

First, let’s create a Dataframe.

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName(Azurelib.com').getOrCreate()
data = [("John","Smith","USA","CA"),
    ("Rakesh","Tiwari","USA","NY"),
    ("Mohan","Williams","USA","CA"),
    ("Raj","kumar","USA","FL")
  ]
columns = ["firstname","lastname","country","state"]
df = spark.createDataFrame(data = data, schema = columns)
df.show(truncate=False)

1. Select Single & Multiple Columns in Databricks

We can select the single or multiple columns of the DataFrame by passing the column names that you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with selected columns. The show() function is used to show the Dataframe contents. Below are the ways to select single, multiple or all columns.

df.select("firstname","lastname").show()
df.select(df.firstname,df.lastname).show()
df.select(df["firstname"],df["lastname"]).show()

#By using the function col()
from pyspark.sql.functions import col
df.select(col("firstname"),col("lastname")).show()

#Select columns by regular expression
df.select(df.colRegex("`^.*name*`")).show()

2. Select All the Columns From List in Azure Databricks

Sometimes we may need to select all DataFrame columns from a Python list. In the below given example, we have all columns in the columns list object.

# Select All columns from List
df.select(*columns).show()

# Select All columns
df.select([col for col in df.columns]).show()
df.select("*").show()

3. Select the Columns by Index in Azure Databricks

Using a python list features, we can select the columns with the help of index.

#Selects Starting 3 columns and top 3 rows
df.select(df.columns[:3]).show(3)

#Selects columns 2 to 4  and top 3 rows
df.select(df.columns[2:4]).show(3)

4. Select the Nested Struct Columns in Azure Databricks

If we have a nested struct (StructType) column on DataFrame, we need to use an explicit column qualifier in order to select.

First, let’s create a new Databricks DataFrame with a struct type.

data = [
        (("John",None,"Smith"),"OH","M"),
        (("Jones","Rose",""),"NY","F"),
        (("Anna","","Williams"),"OH","F"),
        (("Mike","Anne","Jones"),"NY","M"),
        (("Tisan","Mary","Brown"),"NY","M"),
        (("Maria","Mary","Williams"),"OH","M")
        ]

from pyspark.sql.types import StructType,StructField, StringType        
schema = StructType([
    StructField('name', StructType([
         StructField('firstname', StringType(), True),
         StructField('middlename', StringType(), True),
         StructField('lastname', StringType(), True)
         ])),
     StructField('state', StringType(), True),
     StructField('gender', StringType(), True)
     ])
df2 = spark.createDataFrame(data = data, schema = schema)
df2.printSchema()
df2.show(truncate=False) # shows all columns

Yields below schema output. If you notice the column name is a struct type which consists of columns firstname, middlename, lastname.

root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- state: string (nullable = true)
 |-- gender: string (nullable = true)

+----------------------+-----+------+
|name                  |state|gender|
+----------------------+-----+------+
|[John,, Smith]       |OH   |M     |
|[James, Rose, ]        |NY   |F     |
|[Anna, , Williams]   |OH   |F     |
|[Mike, Anne, Jones]  |NY   |M     |
|[Tisen, Mary, Brown]    |NY   |M     |
|[Maria, Mary, Williams]|OH   |M     |

Now, let’s select struct column.

df2.select("name").show(truncate=False)

This returns struct column name as is.

+----------------------+
|name                  |
+----------------------+
|[John,, Smith]       |
|[Jones, Rose, ]        |
|[Anna, , Williams]   |
|[Mike, Anne, Jones]  |
|[Tisan, Mary, Brown]    |
|[Maria, Mary, Williams]|

In order to select the specific column from a nested struct, we need to explicitly qualify the nested struct column name.

df2.select("name.firstname","name.lastname").show(truncate=False)

This outputs firstname and lastname from the name struct column.


+---------+--------+
|firstname|lastname|
+---------+--------+
|John    |Smith   |
|Jones     |        |
|Anna    |Williams|
|Mike    |Jones   |
|Tisen      |Brown   |
|Maria     |Williams|
+---------+--------+

In order to get all the columns from struct column.

df2.select("name.*").show(truncate=False)

This yields below output.

+---------+----------+--------+
|firstname|middlename|lastname|
+---------+----------+--------+
|John    |null      |Smith   |
|James     |Rose      |        |
|Anna    |          |Williams|
|Mike    |Anne      |Jones   |
|Tisen      |Mary      |Brown   |
|Maria     |Mary      |Williams|
+---------+----------+--------+

5. Select column in Databricks Full Practical Example

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

data = [("John","Smith","USA","CA"),
    ("Maria","Rose","USA","NY"),
    ("Robert","Williams","USA","CA"),
    ("Mary","Jones","USA","FL")
  ]

columns = ["firstname","lastname","country","state"]
df = spark.createDataFrame(data = data, schema = columns)
df.show(truncate=False)

df.select("firstname").show()

df.select("firstname","lastname").show()

#Using Dataframe object name
df.select(df.firstname,df.lastname).show()

# Using col function
from pyspark.sql.functions import col
df.select(col("firstname"),col("lastname")).show()

data = [
        (("John",None,"Smith"),"OH","M"),
        (("Jones","Rose",""),"NY","F"),
        (("Anna","","Williams"),"OH","F"),
        (("Mike","Anne","Jones"),"NY","M"),
        (("Tisan","Mary","Brown"),"NY","M"),
        (("Maria","Mary","Williams"),"OH","M")
        ]

from pyspark.sql.types import StructType,StructField, StringType        
schema = StructType([
    StructField('name', StructType([
         StructField('firstname', StringType(), True),
         StructField('middlename', StringType(), True),
         StructField('lastname', StringType(), True)
         ])),
     StructField('state', StringType(), True),
     StructField('gender', StringType(), True)
     ])

df2 = spark.createDataFrame(data = data, schema = schema)
df2.printSchema()
df2.show(truncate=False) # shows all columns

df2.select("name").show(truncate=False)
df2.select("name.firstname","name.lastname").show(truncate=False)
df2.select("name.*").show(truncate=False)

Databricks Official Documentation Link

6. Conclusion

In this article, you have learned about select() is a transformation function of the DataFrame and is used to select single, multiple columns, select all columns from the list, select by index, and finally select nested struct columns, you have also learned how to select nested elements from the DataFrame.