Databricks Certified Associate Developer for Apache Spark 3.0 Preparation Guide

Are you planning for Databricks certification or looking for material and certification preparation guide which can help you on how you can crack the exam efficiently and effectively. If your answer is Yes then you have landed up at the right page. In this article I will take you through all the important tips and tricks which you need to to cover before going for your certification exam. I will also guide you to the right direction where you can get the material for preparation and some sample questions to work on. Let’s dive into it now.

What is Databricks Certified Associate Developer for Apache Spark 3.0 Exam

Like many IT giants Microsoft, Oracle, AWS, Databricks has also launched their certification program. In Databricks certification program path you don’t have so many certifications available. For data engineers this is one of the most popular certifications in the Databricks community. Hence if you are a data engineer and you work day in day out in the Databricks either using Scala or Python, this certification is meant for you. This is an associate level certification and it is good for professionals who are having in the range of 4 to 10 years of experience in the IT industry.

Azure Databricks Interview Questions and Answers

Azure Databrciks Tutorial

Prerequisite

There is no prerequisite for attempting the Databricks certification exam, however there are couple of recommendation which has been shared by the Databricks itself, the recommendations are as follows :

  • In this exam you will be tested on your Spark skills, you can either opt for Scala or Python based on your choice. Hence it is recommended that it should be good if you have at least 6 months of experience in using the Spark DataFrame API
  • Candidates should have some basic understanding about the Spark architecture, how it will let it out, how you can use the Databricks in interactive and in the scheduling mode.
  • You should know at least basic functionalities of Spark dataframe which would involve selecting filtering transforming joining In the data frame. You should be aware of spark SQL writing your own functions that are UDF sound knowledge about the partitions .

Databricks Certification Exam Details

  • Exam cost is $200 as of now while writing this blog.
  • There will be total 60 questions in the exam all will be multiple choice questions
  • Time limit for the exam is 120 minutes you have to finish your exam in this 120 minutes only
  • For getting the certification you need to get 70% in the exam, in terms of the number of the questions you have to clear 42 questions out of 60.
  • No need to say that this exam will be conducted online using proctor where you will be monitored throughout the exam and you are not allowed to use any reference material outside.

  • However the good thing is that you will be provided with Spark documentation based upon the type of the language which you have selected for example if you have opted out for Scala then you will have access to Scala documentation for spark and if you have chosen the Python then accordingly you get the access for the Python documentation for spark

Syllabus for the Exam

There is as such no crystal clear syllabus has been defined by the Databricks just like you can get it in the AWS or Microsoft Azure certification. However they have outline the important topics around which they will going to take the exam these are as follows :

  • The exam consists of 60 multiple-choice questions. There are three main categories:
  • Spark Architecture: Conceptual understanding (~17%)
  • Spark Architecture: Applied understanding (~11%)
  • Spark DataFrame API Applications (~72%)
  • Official Databricks has provided this much information only about the syllabus for the exam but based on the experience of the various candidates these are topics mostly e around which question has been ask as follows:
  • Manipulating columns, filtering data, dropping columns in data sorting, Dataframe aggregation, handling missing data, combining data frame, reading data frame, writing dataframe, Dataframe partitioning, dataframe partitioning with schema.

Topics not included in the exam :

  • GraphX API
  • SparkMlib API
  • Spark Streaming
  • Preparation Guide:

Study Material

  • Highly recommended to make a free trial account and go through this video course : Free Video
  • Books : As all the certification revolves around the spark and need to master the spark concept, the following book would be very useful. There is e-book and print both versions available for this.

Spark Book 1

Spark Book 2

Practice Questions for Databricks Certified Associate Developer for Apache Spark 3.0

Question 1: Which of the following operations can be used to split an array column into an individual DataFrame row for each element in the array?


A. extract()
B. split()
C. explode()
D. arrays_zip()
E. unpack()

Question 2: Which of the following operations can be used to convert a DataFrame column from one type to another type?

A. col().cast()
B. convert()
C. castAs()
D. col().coerce()
E. col()

Question 3: Which of the following code blocks returns a new DataFrame where column storeCategory is an all-lowercase version of column storeCategory in DataFrame storesDF? Assume DataFrame storesDF is the only defined language variable.

A. storesDF.withColumn(“storeCategory”, lower(col(“storeCategory”)))
B. storesDF.withColumn(“storeCategory”, col(“storeCategory”).lower())
C. storesDF.withColumn(“storeCategory”, tolower(col(“storeCategory”)))
D. storesDF.withColumn(“storeCategory”, lower(“storeCategory”))
E. storesDF.withColumn(“storeCategory”, lower(storeCategory))

Question 4: Which of the following code blocks returns a new DataFrame from DataFrame storesDF where column numberOfManagers is the constant integer 1?

A. storesDF.withColumn(“numberOfManagers”, col(1))
B. storesDF.withColumn(“numberOfManagers”, 1)
C. storesDF.withColumn(“numberOfManagers”, lit(1))
D. storesDF.withColumn(“numberOfManagers”, lit(“1”))
E. storesDF.withColumn(“numberOfManagers”, IntegerType(1))

Question 5: The code block shown below contains an error. The code block is intended to return a new DataFrame where column division from DataFrame storesDF has been renamed to column state and column managerName from DataFrame storesDF has been renamed to column managerFullName. Identify the error. Code block: (storesDF.withColumnRenamed(“state”, “division”) .withColumnRenamed(“managerFullName”, “managerName”))

A. Both arguments to operation withColumnRenamed() should be wrapped in the col() operation.
B. The operations withColumnRenamed() should not be called twice, and the first argument should be [“state”, “division”] and the second argument should be [“managerFullName”, “managerName”].
C. The old columns need to be explicitly dropped.
D. The first argument to operation withColumnRenamed() should be the old column name and the second argument should be the new column name.
E. The operation withColumnRenamed() should be replaced with withColumn().

Question 6 Which of the following operations fails to return a DataFrame where every row is unique?

A. DataFrame.distinct()
B. DataFrame.drop_duplicates(subset = None)
C. DataFrame.drop_duplicates()
D. DataFrame.dropDuplicates()
E. DataFrame.drop_duplicates(subset = “all”)

Question 7: Which of the following code blocks returns a DataFrame where rows in DataFrame storesDF containing missing values in every column have been dropped?

A. storesDF.nadrop(“all”)
B. storesDF.na.drop(“all”, subset = “sqft”)
C. storesDF.dropna()
D. storesDF.na.drop()
E. storesDF.na.drop(“all”)

Question 8: Which of the following describes a broadcast variable?

A. A broadcast variable is a Spark object that needs to be partitioned onto multiple worker nodes because it’s too large to fit on a single worker node.
B. A broadcast variable can only be created by an explicit call to the broadcast() operation.
C. A broadcast variable is entirely cached on the driver node so it doesn’t need to be present on any worker nodes.
D. A broadcast variable is entirely cached on each worker node so it doesn’t need to be shipped or shuffled between nodes with each stage.
E. A broadcast variable is saved to the disk of each worker node to be easily read into memory when needed.

Question 9 Which of the following statements about the Spark driver is incorrect?

A. The Spark driver is the node in which the Spark application’s main method runs to coordinate the Spark application.
B. The Spark driver is horizontally scaled to increase overall processing throughput.
C. The Spark driver contains the SparkContext object.
D. The Spark driver is responsible for scheduling the execution of data by various worker nodes in cluster mode. E. The Spark driver should be as close as possible to worker nodes for optimal performance.

Question 10: Which of the following statements about slots is true?

A. There must be more slots than executors.
B. There must be more tasks than slots.
C. Slots are the most granular level of execution in the Spark execution hierarchy.
D. Slots are not used in cluster mode.
E. Slots are resources for parallelization within a Spark application

Question 11: Which of the following describes nodes in cluster-mode Spark?

A. Nodes are the most granular level of execution in the Spark execution hierarchy.
B. There is only one node and it hosts both the driver and executors.
C. Nodes are another term for executors, so they are processing engine instances for performing computations.
D. There are driver nodes and worker nodes, both of which can scale horizontally.
E. Worker nodes are machines that host the executors responsible for the execution of tasks

Question 12: Which of the following is a combination of a block of data and a set of transformers that will run on a single executor?

A. Executor
B. Node
C. Job
D. Task
E. Slot

Question 13: Which of the following operations will trigger evaluation?

A. DataFrame.filter()
B. DataFrame.distinct()
C. DataFrame.intersect()
D. DataFrame.join()
E. DataFrame.count()

Question 14 Which of the following is a group of tasks that can be executed in parallel to compute the same set of operations on potentially multiple machines?
A. Job
B. Slot
C. Executor
D. Task
E. Stage

Question 15 Which of the following is a combination of a block of data and a set of transformers that will run on a single executor?

A. Executor
B. Node
C. Job
D. Task
E. Slot

Question 16 Which of the following describes a shuffle?

A. A shuffle is the process by which data is compared across partitions.
B. A shuffle is the process by which data is compared across executors.
C. A shuffle is the process by which partitions are allocated to tasks.
D. A shuffle is the process by which partitions are ordered for write.
E. A shuffle is the process by which tasks are ordered for execution.

Question 17 DataFrame df is very large with a large number of partitions, more than there are executors in the cluster. Based on this situation, which of the following is incorrect? Assume there is one core per executor.

A. Performance will be suboptimal because not all executors will be utilized at the same time.
B. Performance will be suboptimal because not all data can be processed at the same time.
C. There will be a large number of shuffle connections performed on DataFrame df when operations inducing a shuffle are called.
D. There will be a lot of overhead associated with managing resources for data processing within each task. E. There might be risk of out-of-memory errors depending on the size of the executors in the cluster.

Question 18 Which of the following describes the difference between transformations and actions?
A. Transformations work on DataFrames/Datasets while actions are reserved for native language objects.
B. There is no difference between actions and transformations.
C. Actions are business logic operations that do not induce execution while transformations are execution triggers focused on returning results.
D. Actions work on DataFrames/Datasets while transformations are reserved for native language objects.
E. Transformations are business logic operations that do not induce execution while actions are execution triggers focused on returning results.

Question 19: Which of the following DataFrame operations is always classified as a narrow transformation?

A. DataFrame.sort()
B. DataFrame.distinct()
C. DataFrame.repartition()
D. DataFrame.select()
E. DataFrame.join()

Question 20: Spark has a few different execution/deployment modes: cluster, client, and local. Which of the following describes Spark’s execution/deployment mode?

A. Spark’s execution/deployment mode determines where the driver and executors are physically located when a Spark application is run
B. Spark’s execution/deployment mode determines which tasks are allocated to which executors in a cluster
C. Spark’s execution/deployment mode determines which node in a cluster of nodes is responsible for running the driver program
D. Spark’s execution/deployment mode determines exactly how many nodes the driver will connect to when a Spark application is run
E. Spark’s execution/deployment mode determines whether results are run interactively in a notebook environment or in batch

Question 21: Which of the following describes out-of-memory errors in Spark?

A. An out-of-memory error occurs when either the driver or an executor does not have enough memory to collect or process the data allocated to it.
B. An out-of-memory error occurs when Spark’s storage level is too lenient and allows data objects to be cached to both memory and disk.
C. An out-of-memory error occurs when there are more tasks than are executors regardless of the number of worker nodes.
D. An out-of-memory error occurs when the Spark application calls too many transformations in a row without calling an action regardless of the size of the data object on which the transformations are operating.
E. An out-of-memory error occurs when too much data is allocated to the driver for computational purposes.

Question 22: Which of the following is the default storage level for persist() for a non-streaming DataFrame/Dataset?

A. MEMORY_AND_DISK
B. MEMORY_AND_DISK_SER
C. DISK_ONLY
D. MEMORY_ONLY_SER
E. MEMORY_ONLY

Question 23 Which of the following operations is most likely to induce a skew in the size of your data’s partitions?

A. DataFrame.collect()
B. DataFrame.cache()
C. DataFrame.repartition(n)
D. DataFrame.coalesce(n)
E. DataFrame.persist()

Question 24 Which of the following data structures are Spark DataFrames built on top of?

A. Arrays
B. Strings
C. RDDs
D. Vectors
E. SQL Tables

Question 25 Which of the following code blocks returns a DataFrame containing only column storeId and column division from DataFrame storesDF?

A. storesDF.select(“storeId”).select(“division”)
B. storesDF.select(storeId, division)
C. storesDF.select(“storeId”, “division”)
D. storesDF.select(col(“storeId”, “division”))
E. storesDF.select(storeId).select(division)

Question 26 The below code shown block contains an error. The code block is intended to return a DataFrame containing only the rows from DataFrame storesDF where the value in DataFrame storesDF’s “sqft” column is less than or equal to 25,000. Assume DataFrame storesDF is the only defined language variable. Identify the error. Code block: storesDF.filter(sqft <= 25000)

A. The column name sqft needs to be quoted like storesDF.filter(“sqft” <= 25000).

B. The column name sqft needs to be quoted and wrapped in the col() function like storesDF.filter(col(“sqft”) <= 25000).

C. The sign in the logical condition inside filter() needs to be changed from <= to >.

D. The sign in the logical condition inside filter() needs to be changed from <= to >=.

E. The column name sqft needs to be wrapped in the col() function like storesDF.filter(col(sqft) <= 25000).

Question 27 The code block shown below should return a DataFrame containing only the rows from DataFrame storesDF where the value in column sqft is less than or equal to 25,000 OR the value in column customerSatisfaction is greater than or equal to 30. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: storesDF.1(2 3 4)

A. 1. filter 2. (col(“sqft”) <= 25000) 3. | 4. (col(“customerSatisfaction”) >= 30)
B. 1. drop 2. (col(sqft) <= 25000) 3. | 4. (col(customerSatisfaction) >= 30)
C. 1. filter 2. col(“sqft”) <= 25000 3. | 4. col(“customerSatisfaction”) >= 30
D. 1. filter 2. col(“sqft”) <= 25000 3. or 4. col(“customerSatisfaction”) >= 30
E. 1. filter 2. (col(“sqft”) <= 25000) 3. or 4. (col(“customerSatisfaction”) >= 30)

Question 28 Which of the following code blocks returns a new DataFrame with a new column sqft100 that is 1/100th of column sqft in DataFrame storesDF? Note that column sqft100 is not in the original DataFrame storesDF.

A. storesDF.withColumn(“sqft100”, col(“sqft”) * 100)
B. storesDF.withColumn(“sqft100”, sqft / 100)
C. storesDF.withColumn(col(“sqft100”), col(“sqft”) / 100)
D. storesDF.withColumn(“sqft100”, col(“sqft”) / 100)
E. storesDF.newColumn(“sqft100”, sqft / 100)

Correct Answers

  1. C
  2. A
  3. A
  4. C
  5. D
  6. E
  7. E
  8. D
  9. B
  10. E
  11. E
  12. D
  13. E
  14. E
  15. D
  16. A
  17. A
  18. E
  19. D
  20. A
  21. A
  22. A
  23. D
  24. C
  25. C
  26. B
  27. A
  28. D

Final Thoughts

Databricks Certified Associate Developer for Apache Spark 3.0 is one of the most popular exams to certify your Spark big data analytics capabilities. It could help you in gaining and validating your spark skills. If you have some time to spend, you should go for it. I have shared with you the recipe to crack the exam and probably you may not need any dumps as well to clear this exam. If you have religiously followed the syllabus and my recommendation then I can assure you, you will pass for sure.

Good luck, in case if you have any queries or may be looking for some extra material please feel free to drop your comments in the below comment section and I will revert definitely.

DeepakGoyal

Deepak Goyal is certified Azure Cloud Solution Architect. He is having around decade and half experience in designing, developing and managing enterprise cloud solutions. He is also Big data certified professional and passionate cloud advocate.