How to select DataFrame columns in PySpark Azure Databricks?

Are you looking to find how to select the column of PySpark DataFrame into Azure Databricks cloud or maybe you are looking for a solution, to show the DataFrame column name and details in PySpark Databricks using the select() methods? If you are looking for any of these problem solutions, then you have landed on the correct page. I will also show you what and how to use PySpark to select the column name of DataFrame in Azure Databricks. I will explain it by taking a practical example. So don’t waste time let’s start step by step guide to understanding how to select columns in PySpark DataFrame.

In this blog, I will teach you the following with practical examples:

  • Syntax of select() function
  • Selecting single column
  • Selecting multiple columns
  • Selecting entire column
  • Selecting column by index
  • Selecting column in reverse order

select() method used to get the specified columns of the DataFrame in PySpark Azure Databricks.

Syntax: dataframe_name.select()

What is the syntax of the select() function in PySpark Azure Databricks?

The syntax is as follows:

dataframe_name.select(*cols)
Parameter NameRequiredDescription
cols (str, Column, or list)YesIt represents the list of column names.
Table 1: select() Method in PySpark Databricks Parameter list with Details

Apache Spark Official documentation link: select()

Create a simple DataFrame

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

a) Create manual PySpark DataFrame

data = [    
    (1,"Sascha","1998-09-03"),
    (2,"Lise","2008-09-17"),
    (3,"Nola","2008-08-23"),
    (4,"Demetra","1997-06-02"),
    (5,"Lowrance","2006-07-02")
]

df = spark.createDataFrame(data, schema=["id","name","dob"])
df.printSchema()
df.show()

"""
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- dob: string (nullable = true)

+---+--------+----------+
| id|    name|       dob|
+---+--------+----------+
|  1|  Sascha|1998-09-03|
|  2|    Lise|2008-09-17|
|  3|    Nola|2008-08-23|
|  4| Demetra|1997-06-02|
|  5|Lowrance|2006-07-02|
+---+--------+----------+
"""

b) Creating a DataFrame by reading files

Download and use the below source file.

# replace the file_paths with the source file location which you have downloaded.

df_2 = spark.read.format("csv").option("header", True).load(file_path)
df_2.printSchema()

"""
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- dob: string (nullable = true)
"""

Note: Here, I will be using the manually created DataFrame.

How to select a single column in PySpark Azure Databricks using the select() function?

By providing the column name to the select() function, you can choose one or more columns from the DataFrame. This generates a new DataFrame with the chosen columns because DataFrame is immutable. To display the contents of the Dataframe, use the show() function.

Examples:

from pyspark.sql.functions import col

# 1. Selecting columns using String
df.select("name").show()

# 2. Selecting columns using python Dot Notation
df.select(df.name).show()

# 3. Selecting columns using column name as Key
df.select(df["name"]).show()

# 4. Selecting columns using col() function
df.select(col("name")).show()

# The above four example gives the following output.

"""
Output:

+--------+
|    name|
+--------+
|  Sascha|
|    Lise|
|    Nola|
| Demetra|
|Lowrance|
+--------+

"""

How to select multiple columns in PySpark Azure Databricks using the select() function?

By providing the column names to the select() function, you can choose one or more columns from the DataFrame. This generates a new DataFrame with the chosen columns because DataFrame is immutable. To display the contents of the Dataframe, use the show() function.

Examples:

from pyspark.sql.functions import col

# 1. Selecting columns using String
df.select("id", "name").show()

# 2. Selecting columns using python Dot Notation
df.select(df.id, df.name).show()

# 3. Selecting columns using column name as Key
df.select(df["id"], df["name"]).show()

# 4. Selecting columns using col() function
df.select(col("id"), col("name")).show()

# The above four example gives the following output.

"""
Output:

+---+--------+
| id|    name|
+---+--------+
|  1|  Sascha|
|  2|    Lise|
|  3|    Nola|
|  4| Demetra|
|  5|Lowrance|
+---+--------+

"""

How to select entire columns in PySpark Azure Databricks using the select() function?

There are multiples way in which you can select the entire column.

Examples:

# 1. Selecting all columns using "*" symbol
df.select("*").show()

# 2. Selecting all columns list
df.select(["id", "name", "dob"]).show()

# 3. Selecting all columns using columns field
df.select(df.columns).show()

# The above three example gives the following output.

"""
Output:

+---+--------+----------+
| id|    name|       dob|
+---+--------+----------+
|  1|  Sascha|1998-09-03|
|  2|    Lise|2008-09-17|
|  3|    Nola|2008-08-23|
|  4| Demetra|1997-06-02|
|  5|Lowrance|2006-07-02|
+---+--------+----------+   

"""

How to select columns in PySpark Azure Databricks using column index?

Here is the complete example

Examples to select columns in PySpark Azure Databricks using column index:

# 1. Selecting the first column
df.select(df.columns[0]).show()

"""
Output:

+---+
| id|
+---+
|  1|
|  2|
|  3|
|  4|
|  5|
+---+

"""

# 2. Selecting all the columns from second column
df.select(df.columns[1:]).show()

"""
Output:

+--------+----------+
|    name|       dob|
+--------+----------+
|  Sascha|1998-09-03|
|    Lise|2008-09-17|
|    Nola|2008-08-23|
| Demetra|1997-06-02|
|Lowrance|2006-07-02|
+--------+----------+

"""

# 3. Selecting the columns on every two steps
df.select(df.columns[::2]).show()

"""
Output:

+---+----------+
| id|       dob|
+---+----------+
|  1|1998-09-03|
|  2|2008-09-17|
|  3|2008-08-23|
|  4|1997-06-02|
|  5|2006-07-02|
+---+----------+

"""

How to reverse columns in PySpark Azure Databricks using the select() function?

Here is the complete example

Example:

df.select(df.columns[::-1]).show()

"""
Output:

+----------+--------+---+
|       dob|    name| id|
+----------+--------+---+
|1998-09-03|  Sascha|  1|
|2008-09-17|    Lise|  2|
|2008-08-23|    Nola|  3|
|1997-06-02| Demetra|  4|
|2006-07-02|Lowrance|  5|
+----------+--------+---+

"""

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you use the PySpark select() function in Azure Databricks?

These could be the possible reasons:

  1. The select() function is the most popular one, that is used to select one or multiple columns, nested columns, column by Index, all columns, from the list, by regular expression from a DataFrame.
  2. Whenever you don’t want to retrieve all the columns from the data frames.

Real World Use Case Scenarios for PySpark DataFrame select() function in Azure Databricks?

  • Assume that you have uploaded the customer data as a CSV file in pyspark and created the data frame out of it. It contains 100 columns and for your use case only 10 columns are relevant in this case, we can select a specific column out of DataFrame using the select function in pyspark.
  • Your DataFrame contains a huge number of records and has many redundant and unwanted columns. Hence by using the select function we can remove those extra columns and improve the performance.

What are the alternatives of the select() function in PySpark Azure Databricks?

There are multiple alternatives to the select() function, which are as follows:

  • SelectExpr(): This is the combination of select and exp function

Use the selectExpr() function in PySpark, which has additional SQL functionality.

Final Thoughts

In this article, we have learned about the PySpark select() method to select the columns of DataFrame in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

PySpark in Azure Databricks, as explained by Arud Seka Berne S on azurelib.com.

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.