Reader small image

You're reading from  Cloud Scale Analytics with Azure Data Services

Product typeBook
Published inJul 2021
PublisherPackt
ISBN-139781800562936
Edition1st Edition
Right arrow
Author (1)
Patrik Borosch
Patrik Borosch
author image
Patrik Borosch

Patrik Borosch is a cloud solution architect for data and AI at Microsoft Switzerland GmbH. He has more than 25 years of BI and analytics development, engineering, and architecture experience and is a Microsoft Certified Data Engineer and a Microsoft Certified AI Engineer. Patrik has worked on numerous significant international data warehouse, data integration, and big data projects. Through this, he has built and extended his experience in all facets, from requirements engineering to data modeling and ETL, all the way to reporting and dashboarding. At Microsoft Switzerland, he supports customers in their journey into the analytical world of the Azure Cloud.
Read more about Patrik Borosch

Right arrow

Chapter 4: Understanding Synapse SQL Pools and SQL Options

In this chapter, you will learn what Massively Parallel Processing (MPP) means in terms of a cloud PaaS database service. You will examine the concepts of distributing and replicating data in a database. Furthermore, you will see how to manage the workload in this database to your benefit by avoiding early scaling and leveraging all the performance capabilities of the service. Partitioning will extend your options when it comes to massive amounts of data when you need to grow from terabytes to petabytes. You will also learn how to load data efficiently into your database service.

Finally, we will have a look at the next evolutionary steps of the SQL pools in Azure Synapse and other SQL components, such as SQL on-demand compute.

At the end, we will compare other SQL database services and their options in Azure and how they may fit into your architecture.

You will find the following sections covered in this chapter...

Uncovering MPP in the cloud – the power of 60

SQL Synapse Analytics – the MPP database service formerly called Azure SQL Data Warehouse – is already well known. It has already reached its second generation and the third evolutional step is right around the corner. We will see Gen 3 in 2021. So, maybe at the time when this book is published, it will already be available, at least as a preview.

When we examine Synapse Analytics, we find some interesting concepts. Sure, the idea of MPP is not new and it has been implemented already with several appliances out there. The formerly known Microsoft Parallel Data Warehouse (PDW), now called Analytics Platform System (APS), https://docs.microsoft.com/en-us/sql/analytics-platform-system/home-analytics-platform-system-aps-pdw?view=aps-pdw-2016-au7, is one of them. There are some quite successful on-premises solutions from other vendors on the market – Teradata, Greenplum, and DB2 UDB being just a few of them.

...

Provisioning a Synapse dedicated SQL pool

You have two options when it comes to creating a Synapse dedicated SQL pool: a standalone version and a version that is integrated with the Synapse Analytics workspace. As we will need the Synapse workspace several times again later in this book, for example, when you examine the Synapse Spark pools in Chapter 6, Using Synapse Spark Pools, we will go and provision a workspace directly:

  1. Click + Create a resource in your Azure portal and type Synapse into the search field. From the quick results, you can now already select Synapse Analytics or you hit Enter and select Synapse Analytics from the search results. On the Overview blade, hit Create and start provisioning.
  2. On the following Basics blade, fill in the corresponding fields. You need to select the subscription where you want to create your environment. In the Resource Group field, either select an existing one or create a new one.
  3. In the Managed Resource Group field, add...

Talking about partitioning

When you need to load massive amounts of data to your database, partitioning might be another optimization option. But you really should be confronted with massive amounts of data when you start considering partitioning.

Do you remember the math of the CCI and why it will only perform when you load around 63 to 100 million rows to your database (see the preceding section, Understanding CCI)? Now, you need to add another factor to this equation: the number of partitions that you are planning for your database.

Let's assume that you want to have one partition for every month (the most typical usage of partitions) in your table, and you plan to load data for 5 years to your database. This will add another 60 as a factor to your preceding term: 60 distributions x 1,048,578 rows per distribution x 60 months in the database. This results in 3,774,880,800 rows that your table needs to hold as a minimum in order for the CCI to be built over all the partitions...

Implementing workload management

You have read about control nodes and compute nodes in the preceding sections, and you will learn that you can add more compute nodes when your workload needs more power in the Scaling the database section. But keep in mind that adding compute nodes will also add cost to your bill at the end of the month.

Tip

Try and identify time slots where your database is not "hammered on" big time and either scale the database back down or, when you can afford to, hibernate it. You can save a lot of money by doing so. Development and testing environments are also good candidates for switching the database off when they aren't needed.

Fortunately, the Synapse product group has added some options to the database that will help you with optimizing the usage of a given compute environment before you need to scale the database. Workload management will enable you to set different priorities (workload importance) for different processes, to isolate...

Scaling the database

In Figure 4.13, in the Understanding concurrency and memory settings section, you already learned about the available resource settings that you can configure within Synapse Analytics. But how do we get there?

When you browse to your Synapse Analytics service in your portal, you can scale the database from there. In the Overview blade, you will find the Scale link right next to the Resume or Pause button (displayed depending on the running state of the database):

Figure 4.15 – The Resume and Scale buttons on the Overview blade

When you click on the Scale button, you are taken to the Scaling blade of the database. By dragging the slider in the Scale your system section, you can add or remove resources in your database:

Figure 4.16 – Scaling your database

If your database contains workload groups for workload management, they will also be displayed here with an overview of their possible resource...

Loading data

With all the parallel options that the database can offer to you, you want to use them when you load data to your database, too. Remember the purpose of the control and the compute nodes? When loading data to your database, you want to use a technique that makes use of the compute nodes as much as possible.

Using the COPY statement

The COPY statement will support you in doing so. It will talk directly to the compute nodes and will therefore use the whole parallelism that the database can offer. It comes as part of the T-SQL dialect of the Synapse Analytics database and offers many options to influence the loading of data to the database.

When you talk to the control node, in contrast to the capability of the COPY statement, you will create a bottleneck during your load. The load would be single-threaded instead and all the rows that need to be written to the database would first flow through the control node and would then be spread to the distributions using...

Understanding other SQL options in Azure

In Chapter 2, Connecting Requirements and Technology, we talked about different size options for your modern data warehouse. When we look at the M size, we have added an Azure SQL database for the presentation layer. When we compare the Azure SQL database to Synapse Analytics databases, the main difference is the SMP character of the Azure SQL database. See also Figure 4.1 for a comparison of SMP versus MPP.

In a SQL database, data is, by default, stored in a row orientation as it is done in SQL Server. In general, you can think of the Azure SQL database as a single database that you would spin up and use. Almost all of the functionality of a SQL Server database is available with an Azure SQL database as well.

You will also have the option to create CCIs on your tables. This will give you high analytical performance on your data stored there. In comparison to Synapse Analytics, you won't get the same scale-out architecture with...

Summary

In this chapter, you examined the Synapse Analytics database relational storage option. You learned about the MPP architecture and the control and compute nodes, as well as how tables can be distributed or replicated in a database and how partitioning influences the data in the database.

You read about the CCI and how you can benefit from its performance.

Furthermore, you learned about resource allocation, concurrency, and the limits of the DWUc configurations. We touched on workload management and how you can optimize your database workload before you need to scale the database and pay more money for it. But when you need to scale for more concurrency and memory, and therefore more performance, you now know where to search for this functionality.

At the end, you covered the basics of loading data and why it is a good idea to use the COPY statement. You then learned how to maintain your statistics and how to rebuild your CCI to optimize it.

Finally, we compared...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Cloud Scale Analytics with Azure Data Services
Published in: Jul 2021Publisher: PacktISBN-13: 9781800562936
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Patrik Borosch

Patrik Borosch is a cloud solution architect for data and AI at Microsoft Switzerland GmbH. He has more than 25 years of BI and analytics development, engineering, and architecture experience and is a Microsoft Certified Data Engineer and a Microsoft Certified AI Engineer. Patrik has worked on numerous significant international data warehouse, data integration, and big data projects. Through this, he has built and extended his experience in all facets, from requirements engineering to data modeling and ETL, all the way to reporting and dashboarding. At Microsoft Switzerland, he supports customers in their journey into the analytical world of the Azure Cloud.
Read more about Patrik Borosch