You're reading from Cloud Scale Analytics with Azure Data Services

Product type Book

Published in Jul 2021

Publisher Packt

ISBN-13 9781800562936

Pages 520 pages

Edition 1st Edition

Languages

Concepts

Data Streaming

Author (1):

Patrik Borosch

Table of Contents (20) Chapters

Preface

Section 1: Data Warehousing and Considerations Regarding Cloud Computing

Chapter 1: Balancing the Benefits of Data Lakes Over Data Warehouses

Chapter 2: Connecting Requirements and Technology

Section 2: The Storage Layer

Chapter 3: Understanding the Data Lake Storage Layer

Chapter 4: Understanding Synapse SQL Pools and SQL Options

Section 3: Cloud-Scale Data Integration and Data Transformation

Chapter 5: Integrating Data into Your Modern Data Warehouse

Chapter 6: Using Synapse Spark Pools

Chapter 7: Using Databricks Spark Clusters

Chapter 8: Streaming Data into Your MDWH

Chapter 9: Integrating Azure Cognitive Services and Machine Learning

Chapter 10: Loading the Presentation Layer

Section 4: Data Presentation, Dashboarding, and Distribution

Chapter 11: Developing and Maintaining the Presentation Layer

Chapter 12: Distributing Data

Chapter 13: Introducing Industry Data Models

Chapter 14: Establishing Data Governance

Other Books You May Enjoy

Chapter 7: Using Databricks Spark Clusters

In the last chapter, Chapter 6, Using Synapse Spark Pools, you learned about Spark and the Synapse integrated Spark engine. But what about cases where you only need a Spark cluster to interact with your Data Lake Store? You would, for example, choose Databricks over Synapse Spark pools at this point in time, when you need to work on Spark 3.0 or when you need to implement Structured Streaming. R, as a required programming language, will require Databricks as well as the Databricks-specific features of Delta Lake, such as vacuuming and others. Synapse will offer most of these options, too, in the future. But at the moment, they are available only in Databricks.

With Azure Databricks, Microsoft offers a standalone Spark environment that will give you all the aforementioned options and can still integrate with other data services on Azure if needed. And with Databricks, you have the people at your back that invented Spark. The cluster architecture...

Technical requirements

For this chapter, you will need the following:

An Azure subscription where you have at least contributor rights or you are the owner
The right to create a service principal in Azure Active Directory
The right to provision a Databricks workspace

Provisioning Databricks

Provisioning a Databricks workspace is as easy as the services in the previous chapters:

First, navigate to the Azure portal and click Create a resource.
In the search box, type Databricks and select Azure Databricks from the quick results displayed beneath the search. The Databricks info is displayed.
Click Create and start the provisioning.
In the Basics blade, fill in or select the values for the input fields. You will need to select the subscription to build your Azure Data Factory (ADF) and either select an existing resource group or create a new one. See Chapter 3, Understanding the Data Lake Storage Layer, for a description of resource groups. You want to name your workspace here and assign it to the most suitable region for you. As regards the Pricing Tier, please select the appropriate one. For a first test, you might select Trial (Premium - 14 Days Free DBUs) as this won't cost anything. You can then proceed with Next: Networking...

Examining the Databricks workspace

Like ADF and Synapse Analytics, too, Databricks follows the concept of a browser-based workspace interface. When your basic deployment has been successful and you navigate to the resource in your Azure portal, you will find the Launch Workspace button very prominent on the Overview blade of your Databricks service:

Figure 7.3 – The Launch Workspace button on the Overview Blade of the Databricks service

Click on the button to enter your workspace for the first time. You are taken to the Databricks workspace portal:

Figure 7.4 – Entering your Databricks workspace

You will find a navigation area on the left side of the screen. Here, you have access to all the different areas of Databricks, such as your workspace, recent artifacts, data environments such as databases and tables, your clusters, Spark jobs, and machine learning models. The final option on the navigation area is Search.

Click...

Understanding the Databricks components

In the last chapter, Chapter 6, Using Synapse Spark Pools, we examined the basic Spark architecture, and Databricks also follows those rules. You will find driver and worker nodes that will process your requests. And we shouldn't forget that Databricks was the first to deliver autoscaling Spark as a Service, which will even take the compute environment down as soon as an idle time threshold is reached.

Although Databricks is based on Apache Spark, it has built its own runtime, optimized for usage on Azure. When you spin up a cluster, for example, different sessions will reuse the same cluster and will not instantiate it as with Synapse Spark pools.

Creating Databricks clusters

This section will take you through the provisioning process of a Databricks cluster. You will see the different node sizes and the options that you have, such as autotermination and autoscaling, when you create your compute engine here.

But let's see...

Setting up security

The two most important aspects when you examine the security mechanisms of Databricks are networking, on the one hand, and access controls, on the other.

Examining access controls

Access control lists are, as you have already seen in Chapter 3, Understanding the Data Lake Storage Layer, a fine-grained method for controlling who can see what and do what in the environment. If you have chosen to provide a premium plan workspace, you can set up access control lists for the following:

Workspaces: Within workspaces, you can set ACLs on a finer level for Folders, Notebooks, and MLflow Experiments. The different object types will show up with several different abilities that you can control. You can find a detailed overview via the link in the Further reading, Workspace Access Control, section. You can configure No Permissions, Read, Run, Edit, and Manage for the different abilities in the different artifacts.
Clusters: For clusters, you can set No Permissions...

Monitoring Databricks

Azure Monitor is the Azure-wide log collection service that enables you to collect, analyze, and correlate logs from Azure, but also from on-premises applications. With Azure Monitor, you will be able to analyze not just one particular service, but bring together information from a wider context, and with this, develop a new level of understanding and insights.

As Azure Databricks is not (yet) natively integrated with Azure Monitor, your applications will need to use an additional library to inject your log events into the Log Analytics workspace of Azure Monitor. Microsoft provides a GitHub repository where you can download and build the required library to be used in your code. You can find the link to the documentation in the Further reading, Monitoring Databricks, section.

Summary

This chapter took us into the world of Databricks. You provisioned a Databricks workspace and examined it. In the workspace, you created a new Spark cluster and learned how to manage it.

You created a Databricks notebook, ran it interactively, and saw how to visualize data in your notebook. You also saw how to create a batch job from your notebook and learned about other alternatives for running code as a batch in your environment.

In the section that followed, you learned about Databricks tables and we examined additional capabilities, such as using Delta Lake to manage your data in your environment.

We saw how to add additional functionality using third-party libraries and how to create dashboards from your data.

Finally, we examined security features, such as access controls and secrets, and learned about networking features and how to integrate with Azure Monitor.

There are many more topics related to Azure Databricks that would have exceeded the capacity...