You might be wandering how to deal with the pyspark code when you move over the cloud. In case if you have written the Apache Spark code in the language like SCALA, python or java and want to execute in the Azure cloud without the aid of the Azure Databricks then there are multiple option to do that. In this article I will take you through the step by step process to execute the pyspark code in the Azure cloud without using the Databricks. Let’s start dive deep into it.
There are multiple ways to run pyspark code in Azure cloud without Databricks:
1. Create a Spark cluster using HDInsight and then run spark the code there.
2. Create a Azure Synapse account and execute Spark code there.
3. Manually install Spark on Azure VMs and then run Spark code on it.
Let’s see each option in details.
- 1 How to run Spark or pysaprk code using Azure HDInsight Service?
- 2 How to run Spark or pysaprk code using Azure Synapse Analytics Service?
- 3 How to run Spark or pysaprk code using Azure VM Service?
- 4 Recommendations
- 5 Final Thoughts
How to run Spark or pysaprk code using Azure HDInsight Service?
Azure HDInsight is a managed, full-spectrum, open-source analytics service in the cloud for enterprises. You can use open-source frameworks such as Hadoop, Apache Spark, Apache Hive, LLAP, Apache Kafka, Apache Storm, R, and more.
Create an Apache Spark cluster in HDInsight
Go to Azure portal and search for the Azure HDInsights
- From the Basics tab, provide the following information:
|Subscription||From the drop-down list, select the Azure subscription that’s used for the cluster.|
|Resource group||From the drop-down list, select your existing resource group, or select Create new.|
|Cluster name||Enter a globally unique name.|
|Region||From the drop-down list, select a region where the cluster is created.|
|Cluster type||Select Select cluster type to open a list. From the list, select Spark.|
|Cluster version||This field will auto-populate with the default version once the cluster type has been selected.|
|Cluster login username||Enter the cluster login username. The default name is admin. You use this account to login in to the Jupyter Notebook later in the quickstart.|
|Cluster login password||Enter the cluster login password.|
|Secure Shell (SSH) username||Enter the SSH username. The SSH username used for this quickstart is sshuser. By default, this account shares the same password as the Cluster Login username account.|
Once you provide all these details and you will be successfully be able to create Azure HDInsights cluster.
Execute the pyspark code in the HDInsights
- Jupyter Notebook is an interactive notebook environment that supports various programming languages. The notebook allows you to interact with your data, combine code with markdown text and perform simple visualizations.
- Go to
CLUSTERNAMEis the name of your cluster. If prompted, enter the cluster login credentials for the cluster.
- Create the python notebook and copy all your pyspark code here and execute it.
How to run Spark or pysaprk code using Azure Synapse Analytics Service?
Azure Synapse is an enterprise analytics service that accelerates time to insight across data warehouses and big data systems. Azure Synapse brings together the best of SQL technologies used in enterprise data warehousing, Spark technologies used for big data, Pipelines for data integration and ETL/ELT, and deep integration with other Azure services such as Power BI, CosmosDB, and AzureML.
Steps would be as follows:
- Create Azure synapse analytics workspace.
- Create a spark pool cluster in the azure synapse workspace.
- Upload the spark code and execute it using the spark pool created.
Micorosoft Synapse official Documentation Link
How to run Spark or pysaprk code using Azure VM Service?
Steps would be as follows:
- Create a Azure Virtual Machine or cluster of azure virtual machine.
- Install spark and pyspark on each machine manually.
- Configure the spark cluster.
- Now upload the pyspark code file which you wanted to executed.
- Run the pyspark code.
Most of the Azure Data engineer finds it little difficult to understand the real world scenarios from the Azure Data engineer’s perspective and faces challenges in designing the complete Enterprise solution for it. Hence I would recommend you to go through these links to have some better understanding of the Azure Data factory.
You can also checkout and pinned this great Youtube channel for learning Azure Free by industry experts
In this article we have learned that we can run the pyspark code in the Azure cloud even without using the Azure Databricks. We have learned about the Spark code execution using the Azure HDInsight service, Azure Synapse service and Azure VM. I hope you have enjoyed this post and have gained some new knowledge in the Azure and Spark world.
In case if you still face any issue in running the pyspark code or any other Azure cloud related issues please don’t forget to share below in the comment section.
Deepak Goyal is certified Azure Cloud Solution Architect. He is having around decade and half experience in designing, developing and managing enterprise cloud solutions. He is also Big data certified professional and passionate cloud advocate.