You're reading from Limitless Analytics with Azure Synapse

Product type Book

Published in Jun 2021

Publisher Packt

ISBN-13 9781800205659

Pages 392 pages

Edition 1st Edition

Languages

Python

Concepts

Data Science

Author (1):

Prashant Kumar Mishra

Table of Contents (20) Chapters

Preface

Section 1: The Basics and Key Concepts

Chapter 1: Introduction to Azure Synapse

Chapter 2: Considerations for Your Compute Environment

Section 2: Data Ingestion and Orchestration

Chapter 3: Bringing Your Data to Azure Synapse

Chapter 4: Using Synapse Pipelines to Orchestrate Your Data

Chapter 5: Using Synapse Link with Azure Cosmos DB

Section 3: Azure Synapse for Data Scientists and Business Analysts

Chapter 6: Working with T-SQL in Azure Synapse

Chapter 7: Working with R, Python, Scala, .NET, and Spark SQL in Azure Synapse

Chapter 8: Integrating a Power BI Workspace with Azure Synapse

Chapter 9: Perform Real-Time Analytics on Streaming Data

Chapter 10: Generate Powerful Insights on Azure Synapse Using Azure ML

Section 4: Best Practices

Chapter 11: Performing Backup and Restore in Azure Synapse Analytics

Chapter 12: Securing Data on Azure Synapse

Chapter 13: Managing and Monitoring Synapse Workloads

Chapter 14: Coding Best Practices

Other Books You May Enjoy

Chapter 2: Considerations for Your Compute Environment

This chapter covers the analytics runtimes available with Azure Synapse. You will learn about the concepts of SQL Pool, SQL on-demand, and Spark pool. After completing this chapter, you will be able to decide which analytics runtime will be suitable for solving your business problem.

SQL Pool and SQL on-demand are both part of the Structured Query Language (SQL) engine, but they differ in terms of provisioning. When you create a SQL pool, you will provision databases under a logical server in your subscription; this means you will be paying for running the SQL engine all the time until SQL pool is paused. However, SQL on-demand is created when you want to leverage the SQL engine for running your workloads only for a short duration.

On the other hand, Spark pool works with the Apache Spark engine, deeply integrated with Azure Synapse. This gives you the option to configure your Spark pool with just a few clicks, along with...

Technical requirements

In order to follow the instructions in the following sections, you need to have met certain prerequisites before we proceed, outlined here:

You need to have your Azure subscription, or access to any other subscription with contributor-level access.
You need to have your Synapse workspace on this subscription. You can follow the instructions from Chapter 1, Introduction to Azure Synapse, to create your Synapse workspace.

Introducing SQL Pool

SQL Pool uses a scale-out, node-based architecture with one control node and multiple compute nodes for distributed computational processing. Control nodes are a single point of contact for end users to interact with all compute nodes. The control node runs the Massively Parallel Processing (MPP) engine, which passes an operation to multiple compute nodes to do their work in parallel. MPP databases are optimized for analytical workloads, such as aggregating and processing large datasets. In this type of architecture, each compute node (which are also called processing units) works independently, with its own operating system and dedicated memory.

In this section, you will learn about the architecture of SQL Pool, which will help you in understanding data distribution across various nodes in SQL Pool. We will cover how to create a SQL pool using both the Azure portal and Synapse Studio in the following section.

Creating a SQL pool

In this section, you will...

Understanding Synapse SQL on-demand

SQL on-demand is a serverless distributed data processing system that enables you to analyze your big data faster. There is no need to set up infrastructure or maintain a cluster to start using SQL on-demand, so you can start querying data as soon the workspace is created.

In this section, we are going to talk about the architecture and components of Synapse SQL on-demand, the benefits of using SQL on-demand, and how you can query files in your Azure Storage accounts using SQL on-demand.

SQL on-demand architecture and components

SQL on-demand is serverless, so scaling automatically accommodates the resource requirements for any query. The SQL on-demand architecture also has a control node, a compute node, DMS, and Azure Storage, but it does not have an MPP engine; instead, it uses a Distributed Query Processing (DQP) engine.

The architecture, as illustrated in the following screenshot, explains how a control node leverages a DQP engine...

Understanding Spark pool

Apache Spark is a very fast unified analytics engine for big data and machine learning.

Synapse Spark Pool is one of Microsoft's implementations of Apache Spark in Azure. Synapse Analytics workspace has a Spark engine built in, along with Notebook support. Because Synapse Spark supports C#, we can write Spark .NET directly within notebooks. You can also write your code in Python, Scala, C#, and SQL.

One Spark pool can be accessed by multiple users, but for every user, one new Spark instance will be created. A Spark instance is also dependent on the Spark pool capacity: if there is enough capacity in the pool to run multiple queries, the existing instance will be able to process the job; otherwise, a new instance will be created to process the job.

The following diagram displays different components of Apache Spark on Azure Synapse:

Figure 2.17 – Apache Spark in Azure Synapse Analytics

Let&apos...

Summary

In this chapter, we covered the concepts of Synapse SQL and Synapse Spark. After going through this chapter, you have learned how to create your SQL pool, how to use SQL on-demand, and how to use Spark pool, as well as learning how to change DWUs for your SQL pool using both the Azure portal and Synapse Studio.

You can refer to other books to learn more about Apache Spark. In this chapter, we have tried to cover the Apache Spark concepts that are most relevant to Synapse.

We have used Azure Data Studio in a couple of places, to give you an idea of how it works. We will be seeing Azure Data Studio again, later on. I personally like to use Azure Data Studio because it offers a very smooth SQL coding experience with built-in features such as multiple tab windows, a rich SQL editor, code navigation, and source control integration.

In the next chapter, we are going to talk about various ways to bring your data to Azure Synapse.