Limitless Analytics with Azure Synapse

By Prashant Kumar Mishra
    Advance your knowledge in tech with a Packt subscription

  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Chapter 1: Introduction to Azure Synapse

About this book

Azure Synapse Analytics, which Microsoft describes as the next evolution of Azure SQL Data Warehouse, is a limitless analytics service that brings enterprise data warehousing and big data analytics together. With this book, you'll learn how to discover insights from your data effectively using this platform.

The book starts with an overview of Azure Synapse Analytics, its architecture, and how it can be used to improve business intelligence and machine learning capabilities. Next, you'll go on to choose and set up the correct environment for your business problem. You'll also learn a variety of ways to ingest data from various sources and orchestrate the data using transformation techniques offered by Azure Synapse. Later, you'll explore how to handle both relational and non-relational data using the SQL language. As you progress, you'll perform real-time streaming and execute data analysis operations on your data using various languages, before going on to apply ML techniques to derive accurate and granular insights from data. Finally, you'll discover how to protect sensitive data in real time by using security and privacy features.

By the end of this Azure book, you'll be able to build end-to-end analytics solutions while focusing on data prep, data management, data warehousing, and AI tasks.

Publication date:
June 2021
Publisher
Packt
Pages
392
ISBN
9781800205659

 

Chapter 1: Introduction to Azure Synapse

Azure Synapse Analytics, formerly known as Azure SQL Data Warehouse, is not a mere data warehouse anymore. Azure Synapse is an amalgamation of big data analytics with an enterprise data warehouse. It provides two different types of compute environments for different workloads: one is the SQL compute environment, which is called a SQL pool, and the other one is the Spark compute environment, which is called a Spark pool. Now developers can choose their compute environment as per their business needs. Azure Synapse also provides a unified portal called Synapse Studio for developers that creates a workspace for data prep, data management, data exploration, data warehousing, big data, and AI tasks.

This chapter covers an introduction to Azure Synapse and guides you on starting to use Synapse Studio. You will learn how to create an Azure Synapse workspaces and get acquainted with the components of Azure Synapse. You can start using Synapse with the sample data and queries provided in the Azure portal itself.

In this chapter, our topics will include the following:

  • Introducing the components of Azure Synapse
  • Creating a Synapse workspace
  • Understanding Azure Data Lake
  • Exploring Synapse Studio
 

Technical requirements

In this chapter, you are going to learn how to create your first Synapse workspace in the Azure portal. In order to do this, there are certain prerequisites before you start working on Azure Synapse.

It would be beneficial to have basic knowledge of the Azure portal, as well as an understanding of SQL and Spark. Knowledge of Azure Data Factory and Power BI would be helpful but not essential.

You must have your own Azure subscription or access to an Azure subscription with appropriate permissions. If you are new to Azure, you can go through the following link to create a free Azure account: https://azure.microsoft.com/en-us/free/.

Once you have your Azure subscription created, you can proceed further with the main topics of this chapter.

 

Introducing the components of Azure Synapse

Azure Synapse is a limitless analytics service on the Azure platform. It bundles together data warehousing and big data analytics with deep integration of Azure Machine Learning and Power BI. Azure Synapse brings together relational and non-relational data and helps in querying files in the data lake without looking for any other service.

One of the best features that has been introduced with Azure Synapse is code-free data orchestration where you can build ETL/ELT processes to bring data to Synapse from various sources.

Important note

Synapse provides various layers of security for the data stored; however, you need to follow the security guidelines to keep your data secured. For example, do not expose the username and password in any publicly accessible place – you will invite the biggest threat to your data by doing so. It is important to understand that Azure gives you the power to secure your data, but it is in your hands to best use that power.

What happens when we embrace a new technology in an organization?

We need to look out for a resource that already has knowledge of it, which brings extra costs on top of the cost of the technical implementation. However, Azure Synapse supports various programming languages, such as T-SQL, Python, Scala, Spark, SQL, and .NET, making it easy for people who are already familiar with those languages to learn. In this chapter, we will show a demo for T-SQL, but we will cover examples for other languages in upcoming chapters.

The following diagram represents all the components of Azure Synapse and how all these components are tied together within Synapse Analytics:

Figure 1.1 – The components of Azure Synapse

Figure 1.1 – The components of Azure Synapse

The preceding diagram represents all components of Azure Synapse, which includes Analytics runtimes, supported languages, form factors, data integration, and Power BI workspaces. We will cover all these topics in upcoming chapters.

Important note

Although Azure Synapse is deeply integrated with Spark, Azure ML, and Power BI, you do not need to pay for all these services. You will pay only for the features/services that you use. If you are using an Azure Synapse workspace only for enterprise data warehousing, you will be charged only for that. You can find out complete pricing details in Microsoft's documentation: https://azure.microsoft.com/en-us/pricing/details/synapse-analytics/.

 

Creating a Synapse workspace

Synapse workspace provides an integrated console to manage, monitor, and administer all the components and services of Azure Synapse Analytics. In order to get started with Azure Synapse Analytics, we need to create an Azure Synapse workspace, which provides an experience to access different features related to Azure Synapse Analytics.

You can create a Synapse workspace in the Azure portal just by providing some basic details. Follow these steps to create your first Azure Synapse workspace:

  1. Go to https://portal.azure.com and provide your credentials.
  2. Click on Create a resource:
    Figure 1.2 – A screenshot of the Azure portal

    Figure 1.2 – A screenshot of the Azure portal

  3. Search for Azure Synapse using the search bar.
  4. Select Azure Synapse Analytics (Workspaces preview) from the search drop-down and click on Create:
    Figure 1.3 – A screenshot of the Azure Synapse Analytics page in Azure Marketplace

    Figure 1.3 – A screenshot of the Azure Synapse Analytics page in Azure Marketplace

  5. You need to provide basic details to create your Synapse Analytics workspace:
    • Subscription: You need to select your subscription. If you have many subscriptions in your Azure account, you need to select a specific one that you are going to use to create a Synapse workspace.

      Important note

      All resources in a subscription are billed together.

    • Resource group: A Resource group is a container that holds all the resources for the solution, or only those resources that you want to manage under one group. Select a Resource group for the Synapse workspace. If you do not already have a Resource group created, click on Create new right below the text field for Resource group:
Figure 1.4 – A screenshot highlighting the field to provide a Resource group name

Figure 1.4 – A screenshot highlighting the field to provide a Resource group name

  • Workspace name: Provide an appropriate name for the workspace that you are going to create.

    Important note

    This name must be unique, so it is better to keep it specific to your team/project. 

  • Region: You can see many options in the dropdown. Select the most appropriate region for your Synapse Analytics workspace:
Figure 1.5 – A screenshot of regions appearing in a drop-down list

Figure 1.5 – A screenshot of regions appearing in a drop-down list

  • Select Data Lake Storage Gen2: This will be the primary storage account for the workspace, holding catalog data and metadata associated with the workspace:
Figure 1.6 – A screenshot highlighting fields of Select Data Lake Storage Gen2

Figure 1.6 – A screenshot highlighting fields of Select Data Lake Storage Gen2

  • Account name: You can select from the dropdown or you can create a new one. Only Data Lake Gen2 accounts with a hierarchical namespace enabled will appear in the dropdown. However, if you click on Create new, then it will create a Data Lake Gen2 account with hierarchical namespace enabled.

    Important note

    A storage account name must be between 3 and 24 characters in length and use numbers and lowercase letters only.

  • File system name: Again, you can select from the dropdown or you can create a new one. To create a new file system name, click on Create new and provide an appropriate name for it. A file system name must contain only lowercase letters, numbers, or hyphens:
Figure 1.7 – A screenshot highlighting assignment of the Storage Blob Data Contributor role

Figure 1.7 – A screenshot highlighting assignment of the Storage Blob Data Contributor role

  1. Click on Security + networking to configure security options and networking settings for your workspace, as seen in Figure 1.8.

    Provide SQL administrator credentials that can be used for administrator access to the workspace's SQL pools. We will talk about SQL pools in future chapters:

    Figure 1.8 – A screenshot of the Security + networking form for Azure Synapse

    Figure 1.8 – A screenshot of the Security + networking form for Azure Synapse

  2. Click on Tags to provide a name-value pair to this resource.
  3. Go to the next page to review the summary and click on Create after verifying all the details on the summary page.
  4. In your Azure Synapse workspace in the Azure portal, click Open Synapse Studio:
Figure 1.9 – A screenshot highlighting the link for launching Synapse Studio

Figure 1.9 – A screenshot highlighting the link for launching Synapse Studio

This deployment takes just a couple of minutes and creates a workspace that bundles Synapse analytics, ETL, reporting, modeling, and analysis together under one umbrella. Now you are ready to build your enterprise-level solution!

 

Understanding Azure Data Lake

A data lake is a storage repository that allows you to store your data in native format without having to first structure the data at any scale.

Azure Data Lake Storage provides secure, scalable, cost-effective storage for big data analytics. There are two generations of Azure Data Lake, Gen1 and Gen2; however, we will focus on Gen2 only throughout this chapter. Azure Data Lake Gen2 converges the capabilities of Azure Data Lake Gen1 with the capabilities of Azure Blob Storage with the addition of a Hierarchical Namespace to Blob Storage. Because of Azure Blob Storage's capabilities, you get a high availability/disaster recovery solutions for your data lake at a low cost.

The new Azure Blob File System (ABFS) driver is available within Azure HDInsight, Azure Databricks, and Azure Synapse Analytics, which can be used to access the data in a similar way to Hadoop Distributed File System (HDFS).

To use Data Lake Storage Gen2's capabilities, you need to create a storage account that has a hierarchical namespace. You can go through the following steps to create your Azure Data Lake Storage Gen2 account:

  1. Log in to the Azure portal: https://portal.azure.com.
  2. Click on the + Create a Resource link and select Storage account from the list of all available resources.
  3. Select the Resource group where you want to create your storage account. If you don't have a Resource group created, click on the Create new link below the drop-down list.
  4. Fill in the fields for Storage account name and Location.  
  5. Select Standard or Premium Performance as per your business need. If you are new to Data Lake, then it would be better to begin with Standard.
  6. Select an appropriate value for Account kind and Replication as per the business need. Again, the recommendation would be to leave the default selected values in these fields if you are performing this operation just for your learning purposes:
    Figure 1.10 – Creating Azure Data Lake Gen2 in Azure

    Figure 1.10 – Creating Azure Data Lake Gen2 in Azure

  7. For now, we can skip the Networking and Data protection tabs and move directly to the Advanced tab.
  8. Click on the Enabled radio button for the Hierarchical namespace property under the Advanced tab:
    Figure 1.11 – Enabling Hierarchical namespace for Data Lake Storage Gen2 on the Advanced tab

    Figure 1.11 – Enabling Hierarchical namespace for Data Lake Storage Gen2 on the Advanced tab

  9. Leave the default values for all other fields and click on Review + create.
  10. After reviewing all the details, click on Create and your Azure Data Lake Gen2 account will be created in a couple of minutes.

Now that you have already created your Azure Data Lake Gen2 account, you can use this account with Azure Synapse Analytics. We will learn how to read data from Data Lake in later chapters, but for now, we will learn about Azure Synapse Studio, and how it provides a unified experience when working with various resources under one roof.

 

Exploring Synapse Studio

Synapse Studio is a unified experience for data preparation, data management, data warehousing, and big data analytics. Synapse Studio is a one-stop-shop for developers, data engineers, data scientists, and report analysts.

Before we start exploring more about Synapse Studio, we should know how we can get to Synapse Studio from the Azure portal. There are a couple of ways to navigate to Synapse Studio, but for that, first we need to navigate to our Synapse workspace on the Azure portal. In Figure 1.12, you can see Workspace web URL, which is highlighted. You can either click on that URL or copy that URL and paste it in your browser to access Synapse Studio:

Figure 1.12 – A screenshot of a Synapse workspace in the Azure portal highlighting the links to access Synapse Studio

Figure 1.12 – A screenshot of a Synapse workspace in the Azure portal highlighting the links to access Synapse Studio

Another simple approach is to just click on the Open Synapse Studio link under the Getting started section of the Synapse workspace.

You will need to provide credentials to access Synapse Studio. After successful authentication, you will see Synapse Studio opened in a new tab. You will find a direct link to various hubs integrated in Synapse Studio:

Figure 1.13 – A screenshot of the Synapse Studio Home page

Figure 1.13 – A screenshot of the Synapse Studio Home page

As you can see in Figure 1.13, Synapse Studio has six different hubs. We will learn about all these hubs in brief here:

  • Home: The Home hub provides you with a direct link to ingest, explore, or visualize your data. You can also access your recent resources without wasting your time searching across all the resources available on your Synapse Studio. In fact, you can click on the New button at the top of the Synapse Studio screen to create a new SQL script, notebook, data flow, Apache Spark job definition, or pipeline. You do not need to be worried about any of these if you are new to Azure Synapse; we are going to cover all these topics in detail in other chapters:
Figure 1.14 – Synapse Studio highlighting the New button at the top of the screen

Figure 1.14 – Synapse Studio highlighting the New button at the top of the screen

  • Data: The Data hub provides a simple way to organize your workspace databases and analytical stores for SQL as well as Spark. You can see two tabs in the Data hub: one is Workspace, which shows your SQL and Spark databases created and managed with your Azure Synapse workspace. The other tab is Linked, which shows connected services such as Data Lake Gen2, operational stores in Azure Cosmos DB, and so on:
Figure 1.15 – A screenshot of the Data hub on Synapse Studio

Figure 1.15 – A screenshot of the Data hub on Synapse Studio

  • Develop: The Develop hub contains your SQL scripts, notebooks, data flows, and Spark job definitions. You can also find all your Power BI reports created in your Power BI workspace if you have already connected your Power BI workspace with the Synapse workspace. We will learn more about this in Chapter 8, Integrating a Power BI Workspace with Azure Synapse:
Figure 1.16 – A screenshot of the Develop hub on Synapse Studio

Figure 1.16 – A screenshot of the Develop hub on Synapse Studio

  • Integrate: You will find a lot of similarities between the Integrate hub of Synapse Studio and Azure Data Factory if you are familiar with Azure Data Factory already. You can create new data pipelines to perform one-time or scheduled data ingestion from 90+ data sources. We will learn more about this in Chapter 4, Using Synapse Pipelines to Orchestrate Your Data:

Figure 1.17 – Creating a pipeline in the Integrate hub of Synapse Studio

Figure 1.17 – Creating a pipeline in the Integrate hub of Synapse Studio

  • Monitor: The Monitor hub enables you to see the statuses of all your Integration resources, activities, and pools in one place:
Figure 1.18 – A screenshot of the Monitor hub in Synapse Studio

Figure 1.18 – A screenshot of the Monitor hub in Synapse Studio

  • Manage: From the Manage hub, you can manage your SQL pools, Spark pools, linked services, triggers, and integration runtimes. The Manage hub also provides you with the ability to manage access control and credentials for your Synapse workspace. Recently, they added Git configuration to the Manage hub as well:
Figure 1.19 – A screenshot of the Manage hub on Synapse Studio

Figure 1.19 – A screenshot of the Manage hub on Synapse Studio

In this section, we got an introduction to Synapse Studio, however, in the following chapters, we are going to explore more about Synapse Studio.

 

Summary

In this chapter, we covered an introduction to Azure Synapse and how can you create your first Azure Synapse workspace. After going through the sample scripts, you should have a fairly good idea about how Azure Synapse Studio works, and some of the different languages supported by Azure Synapse. We also discussed the differences between Azure SQL Data Warehouse and Azure Synapse. You learned about pausing and resuming a SQL pool, as well as automatic pausing of a Spark pool, which will save you some money if implemented.

In the next chapter, we will begin to look at specific analytics runtimes you need to understand and create your first Spark and SQL pool.

About the Author

  • Prashant Kumar Mishra

    Prashant Kumar Mishra is an engineering architect at Microsoft. He has more than 10 years of professional expertise in the Microsoft data and AI segment as a developer, consultant, and architect. He has been focused on Microsoft Azure Cloud technologies for several years now and has helped various customers in their data journey. He prefers to share his knowledge with others to make the data community stronger day by day through his blogs and meetup groups.

    Browse publications by this author
Limitless Analytics with Azure Synapse
Unlock this book and the full library for FREE
Start free trial