Reader small image

You're reading from  Cloud Scale Analytics with Azure Data Services

Product typeBook
Published inJul 2021
PublisherPackt
ISBN-139781800562936
Edition1st Edition
Right arrow
Author (1)
Patrik Borosch
Patrik Borosch
author image
Patrik Borosch

Patrik Borosch is a cloud solution architect for data and AI at Microsoft Switzerland GmbH. He has more than 25 years of BI and analytics development, engineering, and architecture experience and is a Microsoft Certified Data Engineer and a Microsoft Certified AI Engineer. Patrik has worked on numerous significant international data warehouse, data integration, and big data projects. Through this, he has built and extended his experience in all facets, from requirements engineering to data modeling and ETL, all the way to reporting and dashboarding. At Microsoft Switzerland, he supports customers in their journey into the analytical world of the Azure Cloud.
Read more about Patrik Borosch

Right arrow

Chapter 6: Using Synapse Spark Pools

In your modern data warehouse project, you may use Azure Data Factory ETL pipelines (see Chapter 5, Integrating Data into Your Modern Data Warehouse) to integrate and transform incoming data according to your needs. However, chances are that you are a more code-oriented developer, that you are already very proficient with Spark, or that your transformational needs reach beyond the functionality or the available compute power of Data Factory.

Maybe you need to train and implement machine learning models as part of your project, and you want a Spark engine that can scale to your needs and offers suitable libraries and tight integration with all the other tools that you plan to use on Azure.

This chapter will discuss Synapse Spark pools and how to implement them on Azure. You will learn about their architecture and how jobs are handled when they are dispatched to a cluster. You will examine how to implement notebooks and Spark jobs and integrate...

Technical requirements

To follow this chapter, you will need the following:

  • An Azure subscription for which you have at least contributor rights.
  • The right to provision a Synapse workspace.
  • The right to provision a Synapse Spark pool.
  • The right to use Synapse Studio.
  • An Azure DevOps Git or GitHub account. This is optional and to be used if you want to integrate your work with a DevOps repository.
  • Your Azure Data Factory from Chapter 5, Integrating Data into Your Modern Data Warehouse.
  • Visual Studio Code (optional, if you wish to follow the batch example later in the chapter): https://code.visualstudio.com/Download.

Setting up a Synapse Spark pool

Now, let's examine the basic steps to spin up a Synapse Spark pool in this section.

This task is very easy to handle in a Synapse workspace:

  1. Please navigate to the Management pane and there, in the Analytics pools section, select Apache Spark pools.
  2. In the Details pane, click + New. The configuration blade for a new Apache Spark pool is displayed:

    Figure 6.1 – Create Apache Spark pool – The Basics blade

  3. Here you will name your new Spark pool and configure the node size value, enable Autoscale, and set the lower and upper boundaries for the autoscaling feature, if enabled. The last row in this view shows the potential cost of the lowest and the highest autoscaling setting. Click Next: Additional settings.
  4. In the upper area of the Additional settings blade, you can now configure Auto-pause and Number of minutes idle, which sets the amount of idle time that will elapse before the cluster pauses. In the Component...

Examining the Synapse Spark architecture

With Synapse Spark pools, Microsoft adds another scalable parallel processing engine to the Synapse ecosystem. The Microsoft implementation of Spark adds in-memory processing capabilities that support languages such as Python, Scala, Java, and even .NET for Spark and SQL.

The engine comes with built-in compatibility with Azure Data Lake Gen2 and Azure Storage. This enables the Spark Core engine, via the YARN layer (which is a JobTracker, resource management, and job scheduling/monitoring tool), to access the data that you have brought to Azure. This way, Spark Core exposes the storage components to libraries such as Spark SQL for interactive querying, MLib for machine learning, and GraphX for graph computation at scale.

Spark implements in-memory computation algorithms that can run your Spark jobs or notebooks in parallel on defined clusters. As mentioned previously, clusters will hold the data to be computed in memory in a distributed...

Programming with Synapse Spark pools

Now that you understand how to provision a Spark pool and how resources are used, let's proceed and examine the different interfaces that you can use to program against a Spark instance.

Understanding Synapse Spark notebooks

Notebooks are the rising star when it comes to interactive data analysis. They offer a step-by-step programming experience where you receive immediate feedback for single code steps. You can enter one line or a block of code into a cell and you can run it directly using an available Spark instance and have the results displayed below the cell.

To create a new notebook, you want to navigate to the Develop hub in Synapse Studio. Here, you can hit the + icon in the navigation pane (next to the word Develop) and select Notebook (Figure 6.11):

Figure 6.11 – Creating a new notebook

Alternatively, you can right-click on the Notebooks section and select New Notebook. An empty notebook will...

Using additional libraries with your Spark pool

There are so many cases where you need to rely on additional functionality from third-party libraries. Synapse Spark supports the addition of libraries to your Spark pool and will make them available when the pool is instantiated. There are different options available for you to use this functionality.

Using public libraries

In the case of PyPi packages, you would create a file named requirements.txt and add it to the configuration of your Spark pool. Within this file, you can list all the packages that you want to include upon starting a Spark instance. The format for how you name the packages follows the pip freeze format and will include the package version next to the package name:

packagename==1.2.1

The requirements.txt file can be uploaded to the Packages section of the Spark pool properties during creation. You can do this later, too, if you need to.

You'll find the location to upload your file in Figure 6.16...

Handling security

When you access the data lake storage that was configured during the setting up of the Synapse workspace, you don't need to worry about using the TokenLibrary. The Spark instance will use an Azure Active Directory credential pass-through to access the data in the data lake. This makes it easy for you to integrate your environment and set up detailed control as described in Chapter 3, Understanding the Data Lake Storage Layer. You have been using this throughout this chapter to access your data lake:

Figure 6.20 – Security setup with credential pass-through

There are other options when it comes to accessing Azure Data Lake Storage Gen2. You might have additional Azure Data Lake Storage Gen2 accounts that you have added as linked services to your Synapse workspace. In this case, you have several authentication options when it comes to using the storage:

  • If a linked service uses a storage account key, you will need to create...

Monitoring your Synapse Spark pools

When you're developing your Spark application, you will sometimes need to get your hands deep into the engine to dig into the details of your jobs and the environment you run them in.

To ascertain details regarding your environment, navigate to the Synapse management hub. Your first stop is the Apache Spark pools section. You will see a list of all Spark pools, and by clicking on them, you can get an overview page with information about occupied vCores, allocated memory, and active Spark applications:

Figure 6.21 – Synapse Spark pools overview

The next level of detail to investigate will then be the application itself. You will find the Applications overview in the Management pane in Synapse Studio. You will be taken to a list of all applications that are present in your Synapse environment. By clicking on the application name on the line you're interested in, you'll get to the application details...

Summary

In this chapter, you have seen how to provision a Synapse Spark pool. You have learned about Spark's architecture in general and Synapse Spark's architecture.

You have learned about the difference between Synapse Spark pools and Synapse Spark instances. You implemented your first Synapse notebook for interactive analytics and learned how to implement a Spark application that can be run as a batch job.

You have seen how to use a Spark pool from an IDE such as Visual Studio Code and you have investigated how to use additional libraries from public sources and your own libraries.

Finally, you saw how you can interact with storage securely, before learning how monitoring works with your Synapse Spark environment.

In Chapter 7, Using Databricks Spark Clusters, you will learn about an alternative Spark environment that Microsoft offers on Azure.

Further reading

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Cloud Scale Analytics with Azure Data Services
Published in: Jul 2021Publisher: PacktISBN-13: 9781800562936
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Patrik Borosch

Patrik Borosch is a cloud solution architect for data and AI at Microsoft Switzerland GmbH. He has more than 25 years of BI and analytics development, engineering, and architecture experience and is a Microsoft Certified Data Engineer and a Microsoft Certified AI Engineer. Patrik has worked on numerous significant international data warehouse, data integration, and big data projects. Through this, he has built and extended his experience in all facets, from requirements engineering to data modeling and ETL, all the way to reporting and dashboarding. At Microsoft Switzerland, he supports customers in their journey into the analytical world of the Azure Cloud.
Read more about Patrik Borosch