You're reading from Cloud Scale Analytics with Azure Data Services

Product typeBook

Published inJul 2021

PublisherPackt

ISBN-139781800562936

Edition1st Edition

Tools

Azure Pack

Concepts

Data Streaming

Author (1)

Patrik Borosch

Chapter 6: Using Synapse Spark Pools

In your modern data warehouse project, you may use Azure Data Factory ETL pipelines (see Chapter 5, Integrating Data into Your Modern Data Warehouse) to integrate and transform incoming data according to your needs. However, chances are that you are a more code-oriented developer, that you are already very proficient with Spark, or that your transformational needs reach beyond the functionality or the available compute power of Data Factory.

Maybe you need to train and implement machine learning models as part of your project, and you want a Spark engine that can scale to your needs and offers suitable libraries and tight integration with all the other tools that you plan to use on Azure.

This chapter will discuss Synapse Spark pools and how to implement them on Azure. You will learn about their architecture and how jobs are handled when they are dispatched to a cluster. You will examine how to implement notebooks and Spark jobs and integrate...

Technical requirements

To follow this chapter, you will need the following:

An Azure subscription for which you have at least contributor rights.
The right to provision a Synapse workspace.
The right to provision a Synapse Spark pool.
The right to use Synapse Studio.
An Azure DevOps Git or GitHub account. This is optional and to be used if you want to integrate your work with a DevOps repository.
Your Azure Data Factory from Chapter 5, Integrating Data into Your Modern Data Warehouse.
Visual Studio Code (optional, if you wish to follow the batch example later in the chapter): https://code.visualstudio.com/Download.

Setting up a Synapse Spark pool

Now, let's examine the basic steps to spin up a Synapse Spark pool in this section.

This task is very easy to handle in a Synapse workspace:

Please navigate to the Management pane and there, in the Analytics pools section, select Apache Spark pools.
In the Details pane, click + New. The configuration blade for a new Apache Spark pool is displayed:
Figure 6.1 – Create Apache Spark pool – The Basics blade
Here you will name your new Spark pool and configure the node size value, enable Autoscale, and set the lower and upper boundaries for the autoscaling feature, if enabled. The last row in this view shows the potential cost of the lowest and the highest autoscaling setting. Click Next: Additional settings.
In the upper area of the Additional settings blade, you can now configure Auto-pause and Number of minutes idle, which sets the amount of idle time that will elapse before the cluster pauses. In the Component...

Examining the Synapse Spark architecture

With Synapse Spark pools, Microsoft adds another scalable parallel processing engine to the Synapse ecosystem. The Microsoft implementation of Spark adds in-memory processing capabilities that support languages such as Python, Scala, Java, and even .NET for Spark and SQL.

The engine comes with built-in compatibility with Azure Data Lake Gen2 and Azure Storage. This enables the Spark Core engine, via the YARN layer (which is a JobTracker, resource management, and job scheduling/monitoring tool), to access the data that you have brought to Azure. This way, Spark Core exposes the storage components to libraries such as Spark SQL for interactive querying, MLib for machine learning, and GraphX for graph computation at scale.

Spark implements in-memory computation algorithms that can run your Spark jobs or notebooks in parallel on defined clusters. As mentioned previously, clusters will hold the data to be computed in memory in a distributed...

Programming with Synapse Spark pools

Now that you understand how to provision a Spark pool and how resources are used, let's proceed and examine the different interfaces that you can use to program against a Spark instance.

Understanding Synapse Spark notebooks

Notebooks are the rising star when it comes to interactive data analysis. They offer a step-by-step programming experience where you receive immediate feedback for single code steps. You can enter one line or a block of code into a cell and you can run it directly using an available Spark instance and have the results displayed below the cell.

To create a new notebook, you want to navigate to the Develop hub in Synapse Studio. Here, you can hit the + icon in the navigation pane (next to the word Develop) and select Notebook (Figure 6.11):

Figure 6.11 – Creating a new notebook

Alternatively, you can right-click on the Notebooks section and select New Notebook. An empty notebook will...

Using additional libraries with your Spark pool

There are so many cases where you need to rely on additional functionality from third-party libraries. Synapse Spark supports the addition of libraries to your Spark pool and will make them available when the pool is instantiated. There are different options available for you to use this functionality.

Using public libraries

In the case of PyPi packages, you would create a file named requirements.txt and add it to the configuration of your Spark pool. Within this file, you can list all the packages that you want to include upon starting a Spark instance. The format for how you name the packages follows the pip freeze format and will include the package version next to the package name:

packagename==1.2.1

The requirements.txt file can be uploaded to the Packages section of the Spark pool properties during creation. You can do this later, too, if you need to.

You'll find the location to upload your file in Figure 6.16...

Handling security

When you access the data lake storage that was configured during the setting up of the Synapse workspace, you don't need to worry about using the TokenLibrary. The Spark instance will use an Azure Active Directory credential pass-through to access the data in the data lake. This makes it easy for you to integrate your environment and set up detailed control as described in Chapter 3, Understanding the Data Lake Storage Layer. You have been using this throughout this chapter to access your data lake:

Figure 6.20 – Security setup with credential pass-through

There are other options when it comes to accessing Azure Data Lake Storage Gen2. You might have additional Azure Data Lake Storage Gen2 accounts that you have added as linked services to your Synapse workspace. In this case, you have several authentication options when it comes to using the storage:

If a linked service uses a storage account key, you will need to create...

Monitoring your Synapse Spark pools

When you're developing your Spark application, you will sometimes need to get your hands deep into the engine to dig into the details of your jobs and the environment you run them in.

To ascertain details regarding your environment, navigate to the Synapse management hub. Your first stop is the Apache Spark pools section. You will see a list of all Spark pools, and by clicking on them, you can get an overview page with information about occupied vCores, allocated memory, and active Spark applications:

Figure 6.21 – Synapse Spark pools overview

The next level of detail to investigate will then be the application itself. You will find the Applications overview in the Management pane in Synapse Studio. You will be taken to a list of all applications that are present in your Synapse environment. By clicking on the application name on the line you're interested in, you'll get to the application details...

Summary

In this chapter, you have seen how to provision a Synapse Spark pool. You have learned about Spark's architecture in general and Synapse Spark's architecture.

You have learned about the difference between Synapse Spark pools and Synapse Spark instances. You implemented your first Synapse notebook for interactive analytics and learned how to implement a Spark application that can be run as a batch job.

You have seen how to use a Spark pool from an IDE such as Visual Studio Code and you have investigated how to use additional libraries from public sources and your own libraries.

Finally, you saw how you can interact with storage securely, before learning how monitoring works with your Synapse Spark environment.

In Chapter 7, Using Databricks Spark Clusters, you will learn about an alternative Spark environment that Microsoft offers on Azure.

Requesting a quota increase: https://docs.microsoft.com/en-us/azure/azure-portal/supportability/per-vm-quota-requests#request-a-standard-quota-increase-from-help--support
a) For the Service type, select: Azure Synapse Analytics
b) For the Quota details, select: Apache Spark (vCore) per workspace
Autoscaling behavior: https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-autoscale
Notebook navigation: https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-development-using-notebooks
Developing Spark applications with IntelliJ: https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/intellij-tool-synapse
Developing Spark applications with Visual Studio Code: https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/vscode-tool-synapse
TokenLibrary for other linked services: https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-secure-credentials-with-tokenlibrary?pivots...

The rest of the chapter is locked

You have been reading a chapter from

Cloud Scale Analytics with Azure Data Services

Published in: Jul 2021Publisher: PacktISBN-13: 9781800562936

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Patrik Borosch

Patrik Borosch is a cloud solution architect for data and AI at Microsoft Switzerland GmbH. He has more than 25 years of BI and analytics development, engineering, and architecture experience and is a Microsoft Certified Data Engineer and a Microsoft Certified AI Engineer. Patrik has worked on numerous significant international data warehouse, data integration, and big data projects. Through this, he has built and extended his experience in all facets, from requirements engineering to data modeling and ETL, all the way to reporting and dashboarding. At Microsoft Switzerland, he supports customers in their journey into the analytical world of the Azure Cloud.
Read more about Patrik Borosch

Other recommended products

Related to this chapter

Limitless Analytics with Azure Synapse

This book helps you understand the basic concepts and techniques of using Azure Synapse step-by-step. You'll gradually gain the skills you need to work with data and develop analytics solutions using the Azure analytics platform even with no prior knowledge of Azure.

BookJun 2021392 pages

Azure Data Engineering Cookbook

This book will help you design and implement modern ETL workflows along with data management, monitoring, and security aspects to meet the current organization's needs. You will use various services such as Azure Data Factory, Azure Databricks, Azure Stream Analytics, and Azure Data Explorer to design efficient data processing solutions.

BookApr 2021454 pages

Azure Data Factory Cookbook

With the help of well-structured and practical recipes, this book will teach you how to integrate data from the cloud and on-premise. You’ll learn how to transform, clean, and consolidate data into a single data platform and get to grips with using ADF as the main ETL and orchestration tool for your data warehouse or data platform project.

BookDec 2020382 pages

Azure Databricks Cookbook

The Azure Databricks Cookbook shows you how to work with the latest as well as older versions of Apache Spark and integrate with various Azure resources for orchestrating, deploying, and monitoring big data solutions. You'll use Azure Databricks to build end-to-end solutions and address challenges in securing, productionizing, and monitoring them.

BookSep 2021452 pages

Data Modeling for Azure Data Services

Data modeling for Azure Data Services teaches you the core concepts of setting up different types of databases for different use cases. With this hands-on guide, you'll learn how to implement the resulting data model in Azure efficiently.

BookJul 2021428 pages

Cloud Analytics with Microsoft Azure

Cloud Analytics with Microsoft Azure enables you to understand the design and business considerations that you must keep in mind while planning to adopt the cloud analytics model for your business.

BookJan 2021184 pages

Hands-On Data Warehousing with Azure Data Factory

Azure Data Factory (ADF) is a Microsoft Azure PaaS solution which supports data movement between many on premises and cloud data sources. This book covers custom tailored tutorials to help you develop , maintain and troubleshoot data movement processes and environments using Azure Data Factory V2 and SQL Server Integration Services 2017

BookMay 2018284 pages

Cloud Analytics with Microsoft Azure

Cloud Analytics with Microsoft Azure is an end-to-end guide to processing and analyzing big data using a range of Microsoft Azure features. This book covers everything you need to build your own data warehouse and learn numerous techniques to gain useful insights by analyzing big data.

BookNov 2019242 pages

Stream Analytics with Microsoft Azure

This book is your guide to understanding the basics of how Azure Stream Analytics works, and build your own analytics solution using its capabilities. By the end of this book, you will be well-versed in using Azure Stream Analytics to develop an efficient analytics solution which can work with any type of data.

BookDec 2017322 pages

ETL with Azure Cookbook

This book will take you through hand-on recipes for extracting, transforming, and loading data using big data tools and Azure services such as Data Factory and Azure Databricks. You will learn how to interact effectively with Azure services, along with covering automation with BIML and data profiling in Azure.

BookSep 2020446 pages

Distributed Data Systems with Azure Databricks

This book helps you to learn how to extract, transform, and orchestrate massive amounts of data to develop robust data pipelines. You'll perform complex machine learning tasks using advanced Azure Databricks features, and also explore model tuning, deployment, and control using Databricks functionalities such as AutoML and Delta Lake with TensorFlow.

BookMay 2021414 pages

Introducing Microsoft SQL Server 2019

Introducing Microsoft SQL Server 2019 takes you through what’s new in SQL Server 2019 and why it matters. After reading this book, you’ll be well placed to explore exactly how you can make MIcrosoft SQL Server 2019 work best for you.

BookApr 2020488 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Cloud Scale Analytics with Azure Data Services

Chapter 6: Using Synapse Spark Pools

Technical requirements

Setting up a Synapse Spark pool

Examining the Synapse Spark architecture

Programming with Synapse Spark pools

Understanding Synapse Spark notebooks

Using additional libraries with your Spark pool

Using public libraries

Handling security

Monitoring your Synapse Spark pools

Summary

Further reading

Unlock this book and the full library FREE for 7 days

Author (1)

Limitless Analytics with Azure Synapse

This book helps you understand the basic concepts and techniques of using Azure Synapse step-by-step. You'll gradually gain the skills you need to work with data and develop analytics solutions using the Azure analytics platform even with no prior knowledge of Azure.

Azure Data Engineering Cookbook

Azure Data Factory Cookbook

Azure Databricks Cookbook

Data Modeling for Azure Data Services

Data modeling for Azure Data Services teaches you the core concepts of setting up different types of databases for different use cases. With this hands-on guide, you'll learn how to implement the resulting data model in Azure efficiently.

Cloud Analytics with Microsoft Azure

Cloud Analytics with Microsoft Azure enables you to understand the design and business considerations that you must keep in mind while planning to adopt the cloud analytics model for your business.

Hands-On Data Warehousing with Azure Data Factory

Cloud Analytics with Microsoft Azure

Cloud Analytics with Microsoft Azure is an end-to-end guide to processing and analyzing big data using a range of Microsoft Azure features. This book covers everything you need to build your own data warehouse and learn numerous techniques to gain useful insights by analyzing big data.

Stream Analytics with Microsoft Azure

ETL with Azure Cookbook

Distributed Data Systems with Azure Databricks

Introducing Microsoft SQL Server 2019

Introducing Microsoft SQL Server 2019 takes you through what’s new in SQL Server 2019 and why it matters. After reading this book, you’ll be well placed to explore exactly how you can make MIcrosoft SQL Server 2019 work best for you.

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook