Explained: Create Dataframe in Databricks

Are you looking to find how to use the Dataframe within the Databricks cloud or maybe you are looking for a solution, to how to use the Dataframe in the Databricks? If you are looking for any of these problem solutions then you have landed on the correct page. I will also show you what and how to use Dataframe. I will explain the Dataframe by taking a practical example. So don’t waste time let’s start step by step guide to understanding what is the Dataframe in Databricksonlysxm.com lucianosousa.net maison-metal.com bežecká obuv panske teplaky onlysxm.com team easy on tøj til salg maison-metal.com onlinebijuta.com חליפות מידות גדולות נשים janwoodharrisart.com bezecke topanky onlinebijuta.com panske teplaky bežecká obuv

What is Dataframe in Databricks?

DataFrames are tables that contain rows and columns of data in Databricks. There is a two-dimensional structure, in which each column contains values of a specific variable, whereas each row contains one set of values from each column.

How can we create Dataframe in Databricks?

An existing RDD can be used to create a Databricks DataFrame manually. To create a Spark RDD, we must first call the parallelize() function from the SparkContext. All the examples below will require this RDD object.

from pyspark.sql import SparkSession

columns = ["Name","Age"]
data = [("Rajeev", "25"), ("Sham", "23"), ("Roj", "26")]

spark = SparkSession.builder.appName('azurelib.com').getOrCreate()
rdd = spark.sparkContext.parallelize(data)

  • Databricks allows us to create a DataFrame from an existing RDD by using its toDF() method. The DataFrame with column names “_1” and “_2” is created since RDD doesn’t have columns.
dfFromRDD1 = rdd.toDF(columns)
dfFromRDD1.printSchema()

root 
|-- Name: string (nullable = true) 
|-- Age: string (nullable = true)
  • It is also possible to manually create a data frame using createDataFrame() from SparkSession, which takes an rdd object as an argument and chain with toDF() to specify column names.
dfFromRDD2 = spark.createDataFrame(rdd).toDF(*columns)

  • Another way to create Databricks DataFrame manually is to call createDataFrame() from SparkSession, which takes a list object as an argument and then chain with toDF() to specify column names.
dfFromData = spark.createDataFrame(data).toDF(*columns)

  • In Databricks, createDataFrame() has another signature that takes a collection of Row type and schema for column names as arguments. First, we need to convert our “data” object from a list into a list of rows.

  • The schema for StructType is created first, and then the DataFrame can be assigned this schema along with the column names and data types.
from pyspark.sql.types import StructType,StructField, StringType, IntegerType

data2 = [("Ravi","","Kumar","25441","M",3500),
    ("Rajesh","Nayak","","25778","M",4100)
  ]

schema = StructType([ 
    StructField("firstname",StringType(),True), 
    StructField("middlename",StringType(),True), 
    StructField("lastname",StringType(),True), 
    StructField("id", StringType(), True), 
    StructField("gender", StringType(), True), 
    StructField("salary", IntegerType(), True) 
  ])
 
df = spark.createDataFrame(data=data2,schema=schema)
df.printSchema()
df.show(truncate=False)

What is the Syntax for Dataframe in Databricks?

#Using RDD

from pyspark.sql import SparkSession

--columns = ["Name","Age"]
data = [("Rajeev", "25"), ("Sham", "23"), ("Roj", "26")]

spark = SparkSession.builder.appName('azurelib.com').getOrCreate()
rdd = spark.sparkContext.parallelize(data)


#Using createDataFrame()

spark.createDataFrame(data).toDF(*columns)


Dataframe Argument Details :

DataThe Actual Data
ColumnsColumn Names

Examples of creating Dataframe in Databricks:

#Example-1

from pyspark.sql import SparkSession

columns = ["Name","Age"]
data = [("Rajeev", "25"), ("Sham", "23"), ("Roj", "26")]

spark = SparkSession.builder.appName('azurelib.com').getOrCreate()
rdd = spark.sparkContext.parallelize(data)

As we can see above example, the Dataframe was created with the help of RDD and parallelize().

#Example-2

from pyspark.sql import SparkSession

columns = ["Name","Age"]
data = [("Rajeev", "25"), ("Sham", "23"), ("Roj", "26")]

spark = SparkSession.builder.appName('azurelib.com').getOrCreate()

dfFromData = spark.createDataFrame(data).toDF(*columns)

As we can see above example, the Dataframe was created with the help of createDatafFrame().

FULL Example of Dataframe in Databricks:

from pyspark.sql.types import StructType,StructField, StringType, IntegerType

data2 = [("Ravi","","Kumar","25441","M",3500),
    ("Rajesh","Nayak","","25778","M",4100),
    ("Sashi","","Ragupudi","29885","F",4300),
    ("Rajya","laxmi","","28148","F",5000),
    ("Shiva","Ram","Raju","34718","M",-1)
  ]

schema = StructType([ 
    StructField("firstname",StringType(),True), 
    StructField("middlename",StringType(),True), 
    StructField("lastname",StringType(),True), 
    StructField("id", StringType(), True), 
    StructField("gender", StringType(), True), 
    StructField("salary", IntegerType(), True) 
  ])
 
df = spark.createDataFrame(data=data2,schema=schema)
df.printSchema()
df.show(truncate=False)


#OUTPUT

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|id   |gender|salary|
+---------+----------+--------+-----+------+------+
|Ravi     |          |Kumar   |25441|M     |3500  |
|Rajesh   |Nayak     |        |25778|M     |4100  |
|Sashi    |          |Ragupudi|29885|F     |4300  |
|Rajya    |laxmi     |        |28148|F     |5000  |
|Shiva    |Ram       |Raju    |34718|M     |-1    |
+---------+----------+--------+-----+------+------+


How to create DataFrame from CSV?

Create a DataFrame from a CSV file using the csv() method of the DataFrameReader object. Also, you can specify which delimiter to use, whether you want quoted data, date formats, infer schema, and many other options.  

df2 = spark.read.csv("/src/resources/file.csv")

How to create DataFrame from TXT?

In the same way, you can create a DataFrame from a Text file by using the text() method of the DataFrameReader.  

df2 = spark.read.text("/src/resources/file.txt")


How to create DataFrame from JSON?

Python-based Databricks can also handle semi-structured data files, like JSON files. You can read a JSON file into a DataFrame using the json() method of the DataFrameReader. Here is an example.  

df2 = spark.read.json("/src/resources/file.json")

Similarly, we can also read other formats as well.

When you should use Dataframe in Databricks?

There are certain use case scenarios when it is recommended to use the Dataframe within the Databricks cloud data warehouse which is as follows:

  • If we want to create a combination of columns and data which looks like a table, then we can make use of Dataframe. After creating Dataframe, we can also apply many filters to the data.

Real World Use Case Scenarios for Dataframe in Databricks?

  • Online railway tickets have the data like ID, name, age, number followed by column names.
  • The dealer will have the information of customers, who purchased the products in the form of a table that looks like a Dataframe.

Dataframe Official Documentation Link

Final Thoughts

In this article, we have learned about Dataframe and their uses with the examples explained clearly. I have also covered different scenarios with a practical example that could be possible. I hope the information that was provided is helped in gaining the knowledge.

Please share your comments and suggestions in the comment section below and I will try to answer all your queries as time permits.