Azure Data Engineer Interview Questions and Answers

In this Azure Data engineer series, I am going to share the most frequently asked interview question for freshers and experienced professionals up to 2,3,4,5,6,7,8,9,10 + years. Azure Data engineer is one of the most hot job profiles in today’s cloud era, hence if you are planning to make a career as an Azure data engineer, then it is one of the must to go articles.  In these set of questions, the focus would be real time scenario based questions, azure data engineer interview questions for freshers, azure data engineer interview questions for experienced professionals, interview questions for data engineers and developers, and interview questions for azure data architects. I have also covered the real time and real world scenarios based Azure Data engineer interview questions and answers in this preparation guide.

Real Time Scenario based Azure Data engineer Interview Questions and Answers

Contents

Assume that you are working as a data engineer lead. You wanted to go and check how many SQL requests have been executed so far in the SQL pool by your team. Where you can go and check all the historical queries executed in the Azure synapse Analytics workspace?

Whenever you run any query in the Azure synapse Analytics the history log has been created for that query. You have to go to the manage tab and there you can see the SQL request. You can filter out based on the time and see all the execution with their results.

Assume that you have newly started working for XYZ company. Your manager has asked you to write some business analytics logic in the azure notebook by reusing some of the generic functionality code developed by another team member. How would you do it?

I would highly recommend going through this quick video as well, as it has explained the Azure Databricks concepts in a very well manner, which would definitely help you in the interview.

AZURE DATABRICKS Quick Concepts video:

Whenever we want to reuse the code in databricks, we need to import that code into our notebook. Now importing the code can also be done in two ways. Assume that if the code is available in the same workspace we may be able to directly import them. If the code is outside of the workspace, in that case you may need to create a jar/module of it and import the jar or the module in the databricks cluster.

Assume that you have around 1 TB of data stored in Azure blob storage. This data is in multiple CSV files. You are asked to do a couple of transformations on this data as per business logic and needs,  before moving this data into the staging container.  How would you plan and architect the solution for this given scenario? Explain with the details.

First of all, we need to analyze the situation. Here if you closely look at the size of the data, you find that it is very huge in the size. Hence directly doing the transformation on such a huge size of data could be a very cumbersome and time consuming process. Hence we should think about the big data processing mechanism where we can leverage the parallel and distributed computing advantages.. Here we have two choices.

  1. We can use the Hadoop MapReduce through HDInsight capability for doing the transformation.
  2. We can also think of using the spark through the Azure databricks for doing the transformation on such a huge scale of data.

Out of these two, Spark on Azure databricks is a better choice because Spark is much faster than Hadoop due to in-memory computation. So let’s choose the Azure databricks as the option.

Next, we need to create the pipeline in the Azure data factory. A pipeline should use the databricks notebook as an activity.  

We can write all the business related transformation logic into the Spark notebook. A notebook can be executed using either python, scala, or java language. 

When you execute the pipeline it will trigger the Azure databricks notebook and your analytics algorithm logic runs and do transformations as you defined into the Notebook. In the notebook itself, you can write the logic to store the output into the blob storage Staging area.

That’s how you can solve the problem statement.

Assume that you are working as a data engineer for Azurelib.com. Your application is storing the data with the cloud as your blob storage. Application is generating some reports which need to be accessible to third-party applications. However, you want this has to be accessible only for the next 7 days. After that, it should automatically not be allowed access to these reports. How could you solve this problem?

Application is generating the data into the Azure blob storage. We have SAS token available with azure storage solution. We can create a SAS token for these reports where we can mention the time duration of the next 7 days for this token. Share this SAS token with other applications so that they can use the token to get the reports. After the 7 days, this token automatically gets expires and will not allow to access anyone after the seven days.

Assume that you are working as the Azure Data engineer for Azurelib.com. Your organization has decided to move to the cloud. You have data in SQL Server, MongoDB, log files data in the existing on-premise datacenter. What are the different services of the Azure you might need as part of migration?

I shall need the following list of services.

Service NameDescription
Azure Data Lake Storage To keep the log files data which we have to move to the cloud.
Azure Data factory Needed to create the scheduled pipeline to pull the data from on-prem to cloud.
Azure Databricks This could be used based on the business requirement. (Tentative)
Azure SQL DB Need to migrate on-prem SQL server with the cloud-based SQL.
Azure Cosmos DB with MongoDB API It will be needed to move the MongoDB to Cosmos DB as it is a better solution compared to plain mongo DB.
Azure Key Vault To keep all the secrets like passwords or maybe server URLs.
Azure Active Directory For user management.
Azure Log Analytics For logging various activities and monitor.
Azure DevOps To automate the CI/CD pipelines.

Assume that you are working as the Azure data engineer lead at Azurelib.com, You have been asked to review the Azure function written by one of the team members. How would you do the code review with cost optimization in mind?

In the Azure function, the cost is mainly occurred because of two factors. The first is Memory it takes for a single execution run. The second is the total execution time it takes. Azure function cost is based on these two factors only. Hence when u review the Az function just check for those two factors from a cost optimization perspective.

You are a data engineer junior who works with Azurelib.com. Assume that there are a couple of existing logic apps workflows that are failing due to some reasons. You have been asked to check, where you can find the details for them?

I will go to the Azure logic app account in the Azure portal. Under the Logic App overview section, you will see the  Run History tab. Click on it and then you can review all the logic app run history.

Azure Data Engineer Technical InterviewQuestion

What is the Azure Data Factory service?

Azure Data factory is the service that helps you to do the migrations and orchestrate the work. For example, if you want to move the data from on-premise to cloud, maybe it is incremental or it will be lift and shift work there you can leverage the Azure Data factory to do the work. It support around hundred of the different data sources from which we can pull the data.

What is a pipeline in the Azure data factory?

A pipeline in an Azure data factory is a single workflow or set of activities connected in a specific order to serve the business purpose. Every pipeline has one unique name within the data factory account and based on the requirement you can schedule the pipeline to execute. We can also track the history of the pipeline execution and check for any error that occurred while execution.

What is the linked service in the Azure data factory?

Linked service is one of the components in the Azure data factory which is used to make a connection hence to connect to any of the data sources you have to first create the linked service based upon the type of data source. The linked service could have different parameters for example in the case of the SQL Server linked service you probably have to give the server name, username, and password but for connecting to the Azure blob storage you have to give the storage location details.

What is the dataset in the Azure Data factory?

Dataset needs to read-write data to any data source using the ADF. Dataset is the representation of the type of data holds by the data source.

What are the parameters in the ADF?

There are parameters at the multiple levels in the ADF. Parameters are basically used to make the ADF components dynamic rather than static. For example, using a parameter same dataset can be used to read multiple tables. Different parameters are as follows:

Linked Service Parameters

Dataset Parameters

Pipeline Parameters

Global Parameters

How you can schedule the pipeline in the ADF?

Sometimes you have a situation where you have wanted to execute the same pipeline again and again based on some condition then you can schedule the pipeline using triggers. There are multiple types of triggers available like time-based triggers, storage event-based triggers.

How to check the history of pipeline execution run in ADF?

In the Azure Data factory, we can check the pipeline execution run by going to the monitor tab. There we can search all the pipelines run history. You can search the history based on various parameters like the name of the pipeline, time duration, the status of execution run (Pass/fail) and etc.

Why you will use the data flow from the Azure data factory

Data flow is used for a no-code transformation. For example when you are doing any ETL operation that you wanted to do a couple of transformations and put some logic on your input data. You may have not found it comfortable to type the query or when you are using the files as input then in that case you cannot write the query at all. Hence data flow will come as a Savior in this situation. Using the data flow you can just do drag and drop and write almost all your business logic without writing any code. Behind the scene, data flow get converted into the spark code and it will run on the cluster.

What is the foreach activity in the data factory?

In the Azure data factory whenever you have to do some of the work repetitively then probably you will be going to use the foreach activity. In the foreach activity, you pass an array and the foreach loop will run for all the items of this array. As of now nested foreach activity is not allow which means you cannot have one foreach activity into another for each activity.

 What is the get metadata activity in the Azure data factory?

 In the Azure data factory get metadata activity is used to get the information about the files. For example, in some cases, you want to know the name of the file you want to know, the size of the file, or maybe you want to know the last modified date of a file. So all those kinds of metadata information about the file you can get it using the get method activity.

What is the custom activity in the Azure data factory?

Custom activity is used in the Azure data factory to execute the Python or a PowerShell script. Assume that you have some code that is written in the Python or PowerShell script and you wanted to execute as a part of your pipeline. Then you can use a custom activity that will help you to execute the code.

What is the if activity in the Azure data factory?

 If activity is used in the Azure data factory as a control activity. For example in some cases, you want to check whether steps to take or not based upon some condition. Then you can use if activity. You can pass the Boolean condition in the if activity and if it is true then it will take the step accordingly and if it is false then otherwise it will take the step accordingly.

How you can connect the Azure data factory with the Azure Databricks?

To connect to Azure databricks we have to create a linked service that will point to the Azure databricks account. Next in the pipeline, you will be going to use the notebook activity there you will provide the linked service created for Databricks.  You will also be going to provide the notebook path available in the Azure Databricks workspace. That’s how you can use the Databricks from the Data factory.

Is it possible to connect MongoDB DB from the Azure data factory?

Yes, it is possible to connect MongoDB from the Azure data factory. You have to provide the proper connection information about the MongoDB server. In case if this MongoDB server is residing outside the Azure workspace then probably you have to create a self-hosted integration runtime, and through which you can connect to the Mongo DB server.

Can Azure data factory directly connect to the different Azure services?

Yes, the Azure data factory can connect with various other Azure services like Azure blob storage, Azure functions, logic app. However, for all of them, they have to provide the proper roles using the RBAC.

How to connect Azure data factory to GitHub?

Azure data factory can connect to GitHub using the GIT integration. You probably have using the Azure DevOps which has git repo. We can configure the GIT repository path into the Azure data factory. So that all the changes we do in the Azure data factory get automatically sync with the GitHub repository.

How you can move your changes from one environment to another environment for the Azure data factory?

We can migrate the code from one environment to another environment for the Azure data factory using the ARM template. ARM template is the JSON representation of the pipeline that we have created.

What is Azure Synapse analytics?

Azure Synapse is an enterprise analytics service that accelerates time to insight across data warehouses and big data systems. Azure Synapse brings together the best of SQL technologies used in enterprise data warehousing, Spark technologies used for big data, Pipelines for data integration and ETL/ELT, and deep integration with other Azure services such as Power BI, CosmosDB, and AzureML.

What is Synapse SQL?

Synapse SQL is the ability to do T-SQL based analytics in the Synapse workspace. Synapse SQL has two consumption models: dedicated and serverless. For the dedicated model, use dedicated SQL pools. A workspace can have any number of these pools. To use the serverless model, use the serverless SQL pools. Every workspace has one of these pools.

Inside Synapse Studio, you can work with SQL pools by running SQL scripts.

How you can use Apache Spark through Azure Synapse Analytics?

In Azure analytics, you can run the spark code either using the notebook or you can create a job that will run the spark code. For running the Spark code you need a Spark pool which is nothing just a cluster of the nodes having a spark installed on it.

What are the different types of Synapse SQL pools available?

Azure Synapse Analytics is an analytics service that brings together enterprise data warehousing and Big Data analytics. Dedicated SQL pool (formerly SQL DW) refers to the enterprise data warehousing features that are available in Azure Synapse Analytics. There are two types of Synapse SQL pool 

  • Serverless SQL pool
  • Dedicated SQL pool

How we can copy the data in the synapse SQL table from the Azure blob storage file?

COPY INTO [dbo].[Trip] FROM 'https://nytaxiblob.blob.core.windows.net/2013/Trip2013/'
WITH (
   FIELDTERMINATOR='|',
   ROWTERMINATOR='0x0A'
) OPTION (LABEL = 'COPY: dbo.trip');

What is Delta Lake?

Delta Lake is an open-source storage layer that brings ACID (atomicity, consistency, isolation, and durability) transactions to Apache Spark and big data workloads.

The current version of Delta Lake included with Azure Synapse has language support for Scala, PySpark, and .NET. 

What is Azure Synapse Runtime?

Apache Spark pools in Azure Synapse use runtimes to tie together essential component versions, Azure Synapse optimizations, packages, and connectors with a specific Apache Spark version. These runtimes will be upgraded periodically to include new improvements, features, and patches.These runtimes have the following advantages:

  • Faster session startup times
  • Tested compatibility with specific Apache Spark versions
  • Access to popular, compatible connectors and open-source packages

Can you run machine learning algorithm using Azure Synapse Analytics?

Yes, it is possible to run the machine learning algorithm using Azure synapse Analytics. In the Azure synapse analytics, we have an Apache Spark and there we can write the machine learning code and which can be executed on the Spark cluster.

What what is Azure Databricks?

Databricks is the organization that provides the Spark based Cluster over the cloud. Azure has a tie-up with Databricks. So you can use the Databricks cluster with Azure.  and it is called the Azure Databricks.

 What are the two different types of execution modes provided by the Databricks?

You can run the Spark code in two modes that are interactive mode or job scheduled mode. In the interactive mode, you can run the code line by line and see the output. In the job mode, it will run all code together, and then you will see the output.

What are the two different types of the cluster provided by the Databricks?

Two different types of the cluster provided by the Databricks are the interactive cluster and the job cluster. To run the interactive notebook you will be going to use an interactive cluster and to run the job, we will use the job cluster.

What is the different language supported by the Azure Databricks?

We can write the code in four languages in the Azure Databricks. SCALA, python, R, SQL

How you can run the python code in the SCALA notebook in the Azure Databricks?

We can run the python code in the SCALA notebook by typing ‘%python’ at the start of the cell. This will prompt the notebook to treat this cell as the python code instead of the SCALA code.

Microsoft Official Document for Azure Data Factory Link

Microsoft Official Document for Azure Synapse Analytics Link

Recommendations

Most of the Azure Data engineer finds it little difficult to understand the real world scenarios from the Azure Data engineer’s perspective and faces challenges in designing the complete Enterprise solution for it. Hence I would recommend you to go through these links to have some better understanding of the Azure Data factory.

Azure Data Factory Insights

Azure Data Engineer Real World scenarios

Azure Databricks Spark Tutorial for beginner to advance level

Latest Azure DevOps Interview Questions and Answers

You can also checkout and pinned this great Youtube channel for learning Azure Free by industry experts

IT Skills Upgrade – YouTube

Final Thoughts

In this blog, I have tried to assemble up a couple of Azure data engineer interview questions and answers. This is one of the very important guides for freshers and experienced professional Azure data engineers, developers, and leads. In this list of interview preparation guide, my main focus was to cover all the question which is frequently and mostly asked on Azure data engineers role.

Hope you have liked this Interview preparation guide and will help you to crack the Azure Engineer interview. You can also go through my various interview preparation guide for Azure data factoryAzure data bricksAzure Functions, Azure logic app, and many more.

Please share your suggestions and feedback and you can ask your question and update this guide based on your interview experience.

Leave a Comment