Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Cloud Scale Analytics with Azure Data Services

You're reading from  Cloud Scale Analytics with Azure Data Services

Product type Book
Published in Jul 2021
Publisher Packt
ISBN-13 9781800562936
Pages 520 pages
Edition 1st Edition
Languages
Author (1):
Patrik Borosch Patrik Borosch
Profile icon Patrik Borosch

Table of Contents (20) Chapters

Preface Section 1: Data Warehousing and Considerations Regarding Cloud Computing
Chapter 1: Balancing the Benefits of Data Lakes Over Data Warehouses Chapter 2: Connecting Requirements and Technology Section 2: The Storage Layer
Chapter 3: Understanding the Data Lake Storage Layer Chapter 4: Understanding Synapse SQL Pools and SQL Options Section 3: Cloud-Scale Data Integration and Data Transformation
Chapter 5: Integrating Data into Your Modern Data Warehouse Chapter 6: Using Synapse Spark Pools Chapter 7: Using Databricks Spark Clusters Chapter 8: Streaming Data into Your MDWH Chapter 9: Integrating Azure Cognitive Services and Machine Learning Chapter 10: Loading the Presentation Layer Section 4: Data Presentation, Dashboarding, and Distribution
Chapter 11: Developing and Maintaining the Presentation Layer Chapter 12: Distributing Data Chapter 13: Introducing Industry Data Models Chapter 14: Establishing Data Governance Other Books You May Enjoy

Chapter 7: Using Databricks Spark Clusters

In the last chapter, Chapter 6, Using Synapse Spark Pools, you learned about Spark and the Synapse integrated Spark engine. But what about cases where you only need a Spark cluster to interact with your Data Lake Store? You would, for example, choose Databricks over Synapse Spark pools at this point in time, when you need to work on Spark 3.0 or when you need to implement Structured Streaming. R, as a required programming language, will require Databricks as well as the Databricks-specific features of Delta Lake, such as vacuuming and others. Synapse will offer most of these options, too, in the future. But at the moment, they are available only in Databricks.

With Azure Databricks, Microsoft offers a standalone Spark environment that will give you all the aforementioned options and can still integrate with other data services on Azure if needed. And with Databricks, you have the people at your back that invented Spark. The cluster architecture...

Technical requirements

For this chapter, you will need the following:

  • An Azure subscription where you have at least contributor rights or you are the owner
  • The right to create a service principal in Azure Active Directory
  • The right to provision a Databricks workspace

Provisioning Databricks

Provisioning a Databricks workspace is as easy as the services in the previous chapters:

  1. First, navigate to the Azure portal and click Create a resource.
  2. In the search box, type Databricks and select Azure Databricks from the quick results displayed beneath the search. The Databricks info is displayed.
  3. Click Create and start the provisioning.
  4. In the Basics blade, fill in or select the values for the input fields. You will need to select the subscription to build your Azure Data Factory (ADF) and either select an existing resource group or create a new one. See Chapter 3, Understanding the Data Lake Storage Layer, for a description of resource groups. You want to name your workspace here and assign it to the most suitable region for you. As regards the Pricing Tier, please select the appropriate one. For a first test, you might select Trial (Premium - 14 Days Free DBUs) as this won't cost anything. You can then proceed with Next: Networking...

Examining the Databricks workspace

Like ADF and Synapse Analytics, too, Databricks follows the concept of a browser-based workspace interface. When your basic deployment has been successful and you navigate to the resource in your Azure portal, you will find the Launch Workspace button very prominent on the Overview blade of your Databricks service:

Figure 7.3 – The Launch Workspace button on the Overview Blade of the Databricks service

Click on the button to enter your workspace for the first time. You are taken to the Databricks workspace portal:

Figure 7.4 – Entering your Databricks workspace

You will find a navigation area on the left side of the screen. Here, you have access to all the different areas of Databricks, such as your workspace, recent artifacts, data environments such as databases and tables, your clusters, Spark jobs, and machine learning models. The final option on the navigation area is Search.

Click...

Understanding the Databricks components

In the last chapter, Chapter 6, Using Synapse Spark Pools, we examined the basic Spark architecture, and Databricks also follows those rules. You will find driver and worker nodes that will process your requests. And we shouldn't forget that Databricks was the first to deliver autoscaling Spark as a Service, which will even take the compute environment down as soon as an idle time threshold is reached.

Although Databricks is based on Apache Spark, it has built its own runtime, optimized for usage on Azure. When you spin up a cluster, for example, different sessions will reuse the same cluster and will not instantiate it as with Synapse Spark pools.

Creating Databricks clusters

This section will take you through the provisioning process of a Databricks cluster. You will see the different node sizes and the options that you have, such as autotermination and autoscaling, when you create your compute engine here.

But let's see...

Setting up security

The two most important aspects when you examine the security mechanisms of Databricks are networking, on the one hand, and access controls, on the other.

Examining access controls

Access control lists are, as you have already seen in Chapter 3, Understanding the Data Lake Storage Layer, a fine-grained method for controlling who can see what and do what in the environment. If you have chosen to provide a premium plan workspace, you can set up access control lists for the following:

  • Workspaces: Within workspaces, you can set ACLs on a finer level for Folders, Notebooks, and MLflow Experiments. The different object types will show up with several different abilities that you can control. You can find a detailed overview via the link in the Further reading, Workspace Access Control, section. You can configure No Permissions, Read, Run, Edit, and Manage for the different abilities in the different artifacts.
  • Clusters: For clusters, you can set No Permissions...

Monitoring Databricks

Azure Monitor is the Azure-wide log collection service that enables you to collect, analyze, and correlate logs from Azure, but also from on-premises applications. With Azure Monitor, you will be able to analyze not just one particular service, but bring together information from a wider context, and with this, develop a new level of understanding and insights.

As Azure Databricks is not (yet) natively integrated with Azure Monitor, your applications will need to use an additional library to inject your log events into the Log Analytics workspace of Azure Monitor. Microsoft provides a GitHub repository where you can download and build the required library to be used in your code. You can find the link to the documentation in the Further reading, Monitoring Databricks, section.

Summary

This chapter took us into the world of Databricks. You provisioned a Databricks workspace and examined it. In the workspace, you created a new Spark cluster and learned how to manage it.

You created a Databricks notebook, ran it interactively, and saw how to visualize data in your notebook. You also saw how to create a batch job from your notebook and learned about other alternatives for running code as a batch in your environment.

In the section that followed, you learned about Databricks tables and we examined additional capabilities, such as using Delta Lake to manage your data in your environment.

We saw how to add additional functionality using third-party libraries and how to create dashboards from your data.

Finally, we examined security features, such as access controls and secrets, and learned about networking features and how to integrate with Azure Monitor.

There are many more topics related to Azure Databricks that would have exceeded the capacity...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Cloud Scale Analytics with Azure Data Services
Published in: Jul 2021 Publisher: Packt ISBN-13: 9781800562936
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}