How to use Row class of PySpark in Azure Databricks?

Are you looking to find out how to use the PySpark Row class in Azure Databricks cloud or maybe you are looking for a solution, to creating records using Row class? If you are looking for any of these problem solutions, you have landed on the correct page. I will also help you how to use the Row class of PySpark in Azure Databricks. I will explain it by taking a practical example. So please don’t waste time let’s start with a step-by-step guide to understand how to use Row class and create records out of it in PySpark.

In this blog, I will teach you the following with practical examples:

  • What is PySpark Row
  • Creating new records using positional arguments
  • Creating new records using named arguments
  • Creating rows from Another Row object
  • Passing null values
  • Row class methods
  • Using Row on RDDs
  • Using Row on DataFrames

The PySpark Row class helps in creating new records by using positional and named arguments.

Apache Spark Official Documentation Link: Row()

Gentle reminder:

In Databricks,

  • sparkSession made available as spark
  • sparkContext made available as sc

In case, you want to create it manually, use the below code.

from pyspark.sql.session import SparkSession

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("azurelib.com") 
    .getOrCreate()

sc = spark.sparkContext

How to create new rows using positional arguments of Row class in PySpark on Azure Databricks?

Let’s see how to create new rows using positional arguments of the Row class in PySpark on Azure Databricks.

Example:

from pyspark.sql import Row

row = Row("Alex", "IT", 1000)
row

"""
Output:

<Row('Alex', 'IT', 1000)>

"""

How to create new rows using named arguments of Row class in PySpark on Azure Databricks?

Let’s see how to create new rows using named arguments of the Row class in PySpark on Azure Databricks.

Example:

from pyspark.sql import Row

row = Row(name="Alex", dept="IT", salary=1000)
row

"""
Output:

Row(name='Alex', dept='IT', salary=1000)

"""

How to create new rows from an existing Row object in PySpark on Azure Databricks?

Let’s see how to create new rows from an existing Row object in PySpark on Azure Databricks.

Example:

from pyspark.sql import Row

emp = Row("name", "dept", "salary")
emp1 = emp("Alex", "IT", 1000)
emp2 = emp("Bruno", "Sales", 2000)
emp2

"""
Output:

Row(name='Bruno', dept='Sales', salary=2000)

"""

In the above example, the emp Row object values acts as keys.

How to pass None values in Row class of PySpark using Azure Databricks?

Let’s see how to pass None values in Row class of PySpark using Azure Databricks. In thsi example, let us try to pass None using named and positional argument and see the result.

Examples:

from pyspark.sql import Row

# Method 1: Using positional argument
null_row_1 = Row("Alex", None, 1000)
print(null_row_1)

"""
Output:

<Row('Alex', None, 1000)>

"""
from pyspark.sql import Row

# Method 2: Named arguments
null_row_2 = Row(name="Bruno", dept="Sales", salary=None)
print(null_row_2)

"""
Output:

Row(name='Bruno', dept='Sales', salary=None)

"""

How to access row values in PySpark on Azure Databricks?

In this section, we will learn about how to access row values in PySpark on Azure Databricks. Let’s try to access the values inside Row class using variuos methods.

Example:

from pyspark.sql import Row

# a) Using index position

row = Row(name="Bruno", dept="Sales", salary=None)
print(row[0])

"""
Output:

Bruno

"""
from pyspark.sql import Row

# b) Using like an attribute

row = Row(name="Bruno", dept="Sales", salary=None)
print(row.dept)

"""
Output:

Sales

"""
from pyspark.sql import Row

# c) Using like a key

row = Row(name="Bruno", dept="Sales", salary=None)

print(row['salary'])

"""
Output:

None

"""

As mentioned above, We can access the values inside the Row class on index position, key, and attributes.

How to use Row class methods of PySpark using Azure Databricks?

Let’s see how to use the Row class methods of PySpark using Azure Databricks. We have three methods:

  • count() -> Used for counting of the specified value
  • index() -> Used for finding the first index position of the specified value
  • asDict() -> Used for converting records into a dictionary

Examples:

# Method 1:

from pyspark.sql import Row
row = Row("Alex", "IT", 1000, 1000, None)
print(row.count(1000))

"""
Output:

2

"""
# Method 2:

from pyspark.sql import Row
row = Row("Alex", "IT", 1000, 1000, None)
print(row.index(None))

"""
Output:

3

"""
# Method 3:

row = Row(id=1, f_name="Shalini", l_name="Shree")
nested_row = Row(id=1, name=Row(f_name="Shalini", l_name="Shree"))

print(row.asDict() == {'id':1,'f_name':'Shalini', 'l_name':'Shree'})
print(nested_row.asDict() == {'id':1,'name':Row(f_name="Shalini", l_name="Shree")})
print(nested_row.asDict(recursive=True) == {'id':1,'name':{'f_name':'Shalini', 'l_name':'Shree'}})

"""
Output:

True
True
True

"""

In the above example, we have checked the record with a dictionary value. If the records is in nested formated set the recursive to True and check accordingly.

How to create an PySpark RDD using Row class on Azure Databricks?

Let’s see how to create a PySpark RDD using the Row class on Azure Databricks.

Example:

employee_data = [
    Row("Alex", "IT", 1000),
    Row("Bruno", "Sales", 2000)
]

rdd = sc.parallelize(employee_data)
rdd.collect()

"""
Output:

[<Row('Alex', 'IT', 1000)>, <Row('Bruno', 'Sales', 2000)>]

"""

In the above code, we have used the collect() function to show the records.

How to create an PySpark DataFrame using Row class on Azure Databricks?

Let’s see how to create a PySpark DataFrame using the Row class on Azure Databricks.

Example:

employee_data = [
    Row(name="Alex", dept="IT", salary=1000),
    Row(name="Bruno", dept="Sales", salary=2000)
]

df = spark.createDataFrame(employee_data)
df.show()

"""
Output:

+-----+-----+------+
| name| dept|salary|
+-----+-----+------+
| Alex|   IT|  1000|
|Bruno|Sales|  2000|
+-----+-----+------+

"""

In the above code, we have used the show() function to show the records.

I have attached the complete code used in this blog in notebook format to this GitHub link. You can download and import this notebook in databricks, jupyter notebook, etc.

When should you use the Row class for creating records in Azure Databricks?

These could be the possible reasons:

  1. Creating a Row
  2. Creating RDD from the Row collection

Real World Use Case Scenarios for PySpark Row class in Azure Databricks?

Assume you were given a Dataaframe and where were asked to check match a column with a given dictionary value, you can make use of Row functions to perform this activity as mentioned above.

What are the alternatives for creating PySpark records in Azure Databricks?

There are multiple alternatives for creating PySpark records, which are as follows:

  • From PySpark Row class
  • From Python List
  • From Python Tuple
  • From Python Dictionary

Final Thoughts

In this article, we have learned about creating, accessing, and using a PySpark Row class in Azure Databricks along with the examples explained clearly. I have also covered different scenarios with practical examples that could be possible. I hope the information that was provided helped in gaining knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.

PySpark in Azure Databricks, as explained by Arud Seka Berne S on azurelib.com.

As a big data engineer, I design and build scalable data processing systems and integrate them with various data sources and databases. I have a strong background in Python and am proficient in big data technologies such as Hadoop, Hive, Spark, Databricks, and Azure. My interest lies in working with large datasets and deriving actionable insights to support informed business decisions.