Azure Databricks Cookbook

By Phani Raj , Vinod Jaiswal
    Advance your knowledge in tech with a Packt subscription

  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Chapter 2: Reading and Writing Data from and to Various Azure Services and File Formats

About this book

Azure Databricks is a unified collaborative platform for performing scalable analytics in an interactive environment. The Azure Databricks Cookbook provides recipes to get hands-on with the analytics process, including ingesting data from various batch and streaming sources and building a modern data warehouse.

The book starts by teaching you how to create an Azure Databricks instance within the Azure portal, Azure CLI, and ARM templates. You’ll work through clusters in Databricks and explore recipes for ingesting data from sources, including files, databases, and streaming sources such as Apache Kafka and EventHub. The book will help you explore all the features supported by Azure Databricks for building powerful end-to-end data pipelines. You'll also find out how to build a modern data warehouse by using Delta tables and Azure Synapse Analytics. Later, you’ll learn how to write ad hoc queries and extract meaningful insights from the data lake by creating visualizations and dashboards with Databricks SQL. Finally, you'll deploy and productionize a data pipeline as well as deploy notebooks and Azure Databricks service using continuous integration and continuous delivery (CI/CD).

By the end of this Azure book, you'll be able to use Azure Databricks to streamline different processes involved in building data-driven apps.

Publication date:
September 2021
Publisher
Packt
Pages
452
ISBN
9781789809718

 

Chapter 1: Creating an Azure Databricks Service

Azure Databricks is a high-performance Apache Spark-based platform that has been optimized for the Microsoft Azure cloud.

It offers three environments for building and developing data applications:

  • Databricks Data Science and Engineering: This provides an interactive workspace that enables collaboration between data engineers, data scientists, machine learning engineers, and business analysts and allows you to build big data pipelines.
  • Databricks SQL: This allows you to run ad hoc SQL queries on your data lake and supports multiple visualization types to explore your query results.
  • Databricks Machine Learning: Provides end-to-end machine learning environment for feature development, model training , experiment tracking along with model serving and management.

In this chapter, we will cover how to create an Azure Databricks service using the Azure portal, Azure CLI, and ARM templates. We will learn about different...

 

Technical requirements

To follow along with the examples in this chapter, you will need to have the following:

 

Creating a Databricks workspace in the Azure portal

There are multiple ways we can create an Azure Databricks service. This recipe will focus on creating the service in the Azure portal. This method is usually used for learning purposes or ad hoc requests. The preferred methods for creating services are using the Azure PowerShell, Azure CLI and ARM templates.

By the end of this recipe, you will have learned how to create an Azure Databricks service instance using the Azure portal.

Getting ready

You will need access via a subscription to the service and have a Contributor role available in it.

How to do it…

Follow these steps to create a Databricks service using the Azure portal:

  1. Log into the Azure portal (https://portal.azure.com) and click on Create a resource. Then, search for Azure Databricks and click on Create:

    Figure 1.1 – Azure Databricks – Create button

  2. Create a new resource group (CookbookRG) or pick any existing resource group...
 

Creating a Databricks service using the Azure CLI (command-line interface)

In this recipe, we will look at an automated way of creating and managing a Databricks workspace using the Azure CLI.

By the end of this recipe, you will know how to use the Azure CLI and deploy Azure Databricks. Knowing how to deploy resources using the CLI will help you automate the task of deploying from your DevOps pipeline or running the task from a PowerShell terminal.

Getting ready

Azure hosts the Azure Cloud Shell, which can be used to work with Azure services. Azure Cloud Shell uses preinstalled commands that can be executed without us needing to install it in our local environment.

This recipe will use a service principal (SP) to authenticate to Azure so that we can deploy the Azure Databricks workspace. Before we start running the SP script, we must create one.

You can find out how to create an Azure AD app and a SP in the Azure portal by going to Microsoft identity platform | Microsoft...

 

Creating a Databricks service using Azure Resource Manager (ARM) templates

Using ARM templates for deployment is a well known method to deploy resource in Azure.

By the end of this recipe, you will have learned how to deploy an Azure Databricks workspace using ARM templates. ARM templates can be deployed from an Azure DevOps pipeline, as well as by using PowerShell or CLI commands.

Getting ready

In this recipe, we will use a service principal to authenticate to Azure so that we can deploy the Azure Databricks workspace. Before we start running the SP script, you must create one.

You can find out how to create an Azure AD app and service principal by going to the Azure portal and selecting Microsoft identity platform | Microsoft Docs (https://docs.microsoft.com/en-us/azure/active-directory/develop/howto-create-service-principal-portal#:~:text=Option%201:%20Upload%20a%20certificate.%201%20Select%20Run,key,%20and%20export%20to%20a%20.CER%20file).

For service principal authentication...

 

Adding users and groups to the workspace

In this recipe, we will learn how to add users and groups to the workspace so that they can collaborate when creating data applications. This exercise will provide you with a detailed understanding of how users are created in a workspace. You will also learn about the different permissions that can be granted to users.

Getting ready

Log into the Databricks workspace as an Azure Databricks admin. Before you add a user to the workspace, ensure that the user exists in Azure Active Directory.

How to do it…

Follow these steps to create users and groups from the Admin Console:

  1. From the Azure Databricks service, click on the Launch workspace option.
  2. After launching the workspace, click the user icon at the top right and click on Admin Console, as shown in the following screenshot:
    Figure 1.12 – Azure Databricks workspace

    Figure 1.12 – Azure Databricks workspace

  3. You can click on Add User and invite users who are part of your Azure Active Directory...
 

Creating a cluster from the user interface (UI)

In this recipe, we will look at the different types of clusters and cluster modes in Azure Databricks and how to create them. Understanding cluster types will help you determine the right type of cluster you should use for your workload and usage pattern.

Getting ready

Before you get started, ensure you have created an Azure Databricks workspace, as shown in the preceding recipes.

How to do it…

Follow these steps to create a cluster via the UI:

  1. After launching your workspace, click on the Cluster option from the left-hand pane:

    Figure 1.17 – Create Cluster page

  2. Provide a name for your cluster and select a cluster mode based on your scenario. Here, we are selecting Standard.
  3. We are not selecting a pool here. Let's go with the latest version of Spark (3.0.1) and the latest version of the Databricks runtime (7.4) that's available at the time of writing this book.
  4. The possibility to...
 

Getting started with notebooks and jobs in Azure Databricks

In this recipe, we will import a notebook into our workspace and learn how to execute and schedule it using jobs. By the end of this recipe, you will know how to import, create, execute, and schedule Notebooks in Azure Databricks.

Getting ready

Ensure the Databricks cluster is up and running. Clone the cookbook repository from https://github.com/PacktPublishing/Azure-Databricks-Cookbook to any location on your laptop/PC. You will find the required demo files in the chapter-01 folder.

How to do it…

Let's dive into importing the Notebook into our workspace:

  1. First, let's create a simple Notebook that will be used to create a new job and schedule it.
  2. In the cloned repository, go to chapter-01. You will find a file called DemoRun.dbc. You can import the dbc file into your workspace by right-clicking the Shared workspace and selecting the Import option:

    Figure 1.20 – Importing the dbc...

 

Authenticating to Databricks using a PAT

To authenticate and access Databricks REST APIs, we can use two types of tokens:

  • PATs
  • Azure Active Directory tokens

A PAT is used as an alternative password to authenticate and access Databricks REST APIs. By the end of this recipe, you will have learned how to use PATs to access the Spark managed tables that we created in the preceding recipes using Power BI Desktop and create basic visualizations.

Getting ready

PATs are enabled by default for all Databricks workspaces created on or after 2018. If this is not enabled, an administrator can enable or disable tokens, irrespective of their creation date.

Users can create PATs and use them in REST API requests. Tokens have optional expiration dates and can be revoked.

How to do it…

This section will show you how to generate PATs using the Azure Databricks UI. Also, apart from the UI, you can use the Token API to generate and revoke tokens. However, there...

About the Authors

  • Phani Raj

    Phani Raj is an Azure data architect at Microsoft. He has more than 12 years of IT experience and works primarily on the architecture, design, and development of complex data warehouses, OLTP, and big data solutions on Azure for customers across the globe.

    Browse publications by this author
  • Vinod Jaiswal

    Vinod Jaiswal is a data engineer at Microsoft. He has more than 13 years of IT experience and works primarily on the architecture, design, and development of complex data warehouses, OLTP, and big data solutions on Azure using Azure data services for a variety of customers. He has also worked on designing and developing real-time data processing and analytics reports from the data ingested from streaming systems using Azure Databricks.

    Browse publications by this author
Azure Databricks Cookbook
Unlock this book and the full library for FREE
Start free trial