Reader small image

You're reading from  Azure Databricks Cookbook

Product typeBook
Published inSep 2021
PublisherPackt
ISBN-139781789809718
Edition1st Edition
Right arrow
Authors (2):
Phani Raj
Phani Raj
author image
Phani Raj

Phani Raj is an experienced data architect and a product manager having 15 years of experience working with customers on building data platforms on both on-prem and on cloud. Worked on designing and implementing large scale big data solutions for customers on different verticals. His passion for continuous learning and adapting to the dynamic nature of technology underscores his role as a trusted advisor in the realm of data architecture ,data science and product management.
Read more about Phani Raj

Vinod Jaiswal
Vinod Jaiswal
author image
Vinod Jaiswal

Vinod Jaiswal is an experienced data engineer, excels in transforming raw data into valuable insights. With over 8 years in Databricks, he designs and implements data pipelines, optimizes workflows, and crafts scalable solutions for intricate data challenges. Collaborating seamlessly with diverse teams, Vinod empowers them with tools and expertise to leverage data effectively. His dedication to staying updated on the latest data engineering trends ensures cutting-edge, robust solutions. Apart from technical prowess, Vinod is a proficient educator. Through presentations and mentoring, he shares his expertise, enabling others to harness the power of data within the Databricks ecosystem.
Read more about Vinod Jaiswal

View More author details
Right arrow

Chapter 7: Implementing Near-Real-Time Analytics and Building a Modern Data Warehouse

Azure has changed the way data applications are designed and implemented and how data is processed and stored. As we see more data coming from various disparate sources, we need to have better tools and techniques to handle streamed, batched, semi-structured, unstructured, and relational data together. Modern data solutions are being built in such a way that they define a framework that describes how data can be read from various sources, processed together, and stored or sent to other streaming consumers to generate meaningful insights from the raw data.

In this chapter, we will learn how to ingest data coming from disparate sources such as Azure Event Hubs, Azure Data Lake Storage Gen2 (ADLS Gen2) storage, and Azure SQL Database, and how this data can be processed together and stored as a data warehouse model with Facts and Dimension Azure Synapse Analytics, and store processed and raw data...

Technical requirements

To follow along with the examples shown in the recipes, you will need to have the following:

  • An Azure subscription and required permissions on the subscription, as mentioned in the Technical requirements section of Chapter 1, Creating an Azure Databricks Service.
  • An Azure Databricks workspace with a Spark 3.x cluster. All the notebooks used in this chapter are executed on a Spark 3. 0.1 cluster.
  • You can find the scripts for this chapter in the GitHub repository at https://github.com/PacktPublishing/Azure-Databricks-Cookbook/tree/main/Chapter07. The Chapter07 folder contains all the notebooks, Python code, and a few Parquet files required for the demonstration.

Understanding the scenario for an end-to-end (E2E) solution

In this recipe, you will learn about an E2E solution that you will be building in this chapter. We will look at various components used in the solution and the data warehouse model that we will be creating in a Synapse dedicated pool.

Getting ready

You need to ensure you read the following chapters before starting on this recipe if you are new to Azure Databricks:

  • Chapter 2, Reading and Writing Data from and to Various Azure Services and File Formats
  • Chapter 4, Working with Streaming Data
  • Chapter 6, Exploring Delta Lake in Azure Databricks

Also, you will need to have a basic understanding of facts and dimension tables in a data warehouse.

How to do it…

In this section, you will understand an E2E data pipeline scenario along with the modern data warehouse model that will be created as part of the solution. The following points list down all the stages of the data pipeline, right from...

Creating required Azure resources for the E2E demonstration

In this recipe, you will create Azure resources required for an E2E solution for implementing batch processing, real-time analytics, and building a modern warehouse solution using Azure Stack.

Getting ready

Before starting, we need to ensure we have access to the subscription and have contributor access to the subscription, or are the owner of the resource group.

Before we begin, ensure that you have completed the recipes in Chapter 2, Reading and Writing Data from and to Various Azure Services and File Formats, and Chapter 4, Working with Streaming Data.

How to do it…

In this section, you will learn how to get the required information for various components to establish a connection.

  1. Refer to the Reading and writing data from and to Azure Cosmos recipe from Chapter 2, Reading and Writing Data from and to Various Azure Services and File Formats to learn how to read and write data to and from a...

Simulating a workload for streaming data

In this recipe, you will learn how to simulate the vehicle sensor data that can be sent to Event Hubs. In this recipe, we will be running a Python script that will be generating the sensor data for 10 vehicle IDs and learn how the events can be pushed to Event Hubs for Kafka.

Getting ready

Before you start on this recipe, make sure you have the latest version of Python installed on the machine from which you will be running the Python script. The Python script was tested on Python 3.8.

  • You need to install confluent_kafka libraries by running pip install confluent_kafka from bash, PowerShell, or Command Prompt.
  • You need to have Azure Event Hubs for Kafka, which is mentioned in the previous recipe, Creating required Azure resources for the E2E demonstration. In the mentioned recipe, you can find out how to get the Event Hubs connection string that will be used for the bootstrap.servers notebook variable.
  • The Python script...

Processing streaming and batch data using Structured Streaming

We tend to see scenarios where we need to process batch data in comma-separated values (CSV) or Parquet format stored in ADLS Gen2 and from real-time streaming sources such as Event Hubs together. In this recipe, we will learn how we use Structured Streaming for both batch and real-time streaming sources and process the data together. We will also fetch the data from Azure SQL Database for all metadata information required for our processing.

Getting ready

Before starting, you need to have a valid subscription with contributor access, a Databricks workspace (Premium), and an ADLS Gen2 storage account. Also, ensure that you have been through the previous recipes of this chapter.

We have executed the notebook on Databricks Runtime Version 7.5 having Spark 3.0.1.

You can follow along by running the steps in the following notebook:

https://github.com/PacktPublishing/Azure-Databricks-Cookbook/blob/main/Chapter07...

Understanding the various stages of transforming data

Building a near-real-time warehouse is being used these days as a common architectural pattern for many organizations who want to avoid the delays that we see in on-premises data warehouse systems. Customers want to view the data in near real time in their new modern warehouse architecture and they can achieve that by using Azure Databricks Delta Lake with Spark Structured Streaming APIs. In this recipe, you will learn the various stages involved in building a near-real-time data warehouse in Delta Lake. We are storing the data in a denormalized way in Delta Lake, but in a Synapse dedicated SQL pool, we are storing the data in facts and dimension tables to enhance reporting capabilities.

As part of data processing in Delta Lake, you will be creating three Delta tables, as follows:

  1. Bronze table: This will hold the data as received from Event Hubs for Kafka.
  2. Silver table: We will implement the required business rules...

Loading the transformed data into Azure Cosmos DB and a Synapse dedicated pool

In this recipe, we will learn how to write the transformed data into various sinks such as Azure Cosmos DB and a Synapse dedicated SQL pool. The processed data needs to be saved in a different destination for further consumption or to build a data warehouse. Azure Cosmos DB is one of the most widely used NoSQL databases and acts as a source for web portals. Similarly, an Azure Synapse dedicated SQL pool is used for creating a data warehouse.

Customers want to view data in near real time in their warehouse or web application so that they can get insights from the data in real time. In the following section, you will learn how to read data from the Silver zone (Delta table), do some data processing, and then finally load the data into Cosmos DB and to a Synapse dedicated SQL pool.

Getting ready

In the previous recipe, we saw how data was loaded into Bronze, Silver, and Gold Delta tables. In this...

Creating a visualization and dashboard in a notebook for near-real-time analytics

Azure Databricks provides the capability to create various visualizations using Spark SQL and to create dashboards from various visualizations. This helps data engineers and data scientists to create quick dashboards for near-real-time analytics. In this recipe, you will learn how to create visualizations in a notebook and how to create a dashboard for static and near-real-time reporting. The dashboard capabilities of Azure Databricks are very limited when compared to reporting tools such as Power BI. If you need more drill-downs and various levels of slicing and dicing, then you can use Power BI for reporting purposes.

Getting ready

Before starting, we need to ensure we have executed the following notebook. The following notebook creates the required Delta tables on which we can build our visualizations and dashboard in the notebook:

https://github.com/PacktPublishing/Azure-Databricks-Cookbook...

Creating a visualization in Power BI for near-real-time analytics

Before starting, we need to ensure we have executed the following notebook. This notebook creates the required tables on which we can build out visualizations and dashboards in the notebook: https://github.com/PacktPublishing/Azure-Databricks-Cookbook/blob/main/Chapter07/7.1-End-to-End%20Data%20Pipeline.ipynb.

Getting ready

Before starting to work on this recipe, you need to get the server hostname and the HyperText Transfer Protocol (HTTP) path details for the Azure Databricks clusters.

Go to the Clusters tab in the Azure Databricks workspace and select the cluster you are using. Under Configuration, you will find advanced options—select the JDBC/ODBC option to get the details about the server hostname and HTTP path, as shown in the following screenshot:

Figure 7.23 – Cluster configuration details

Copy the entire string for the server hostname and HTTP path that you...

Using Azure Data Factory (ADF) to orchestrate the E2E pipeline

ADF is a serverless data integration and data transformation Azure service. It's a cloud Extract Transform Load (ETL)/Extract Load Transform (ELT) service in the Microsoft Azure platform. In this recipe, we will learn how to orchestrate and automate a data pipeline using ADF.

Getting ready

Before starting with this recipe, you need to ensure that you have a valid Azure subscription, valid permission to create an ADF resource, and Azure Databricks workspace details with an access token.

Note

Explaining ADF is beyond the scope of this book and readers are expected to have basic knowledge of creating an ADF pipeline and how to schedule it.

How to do it…

In this section, we learn how to invoke a Databricks notebook in an ADF pipeline and how to schedule an E2E data pipeline using an ADF trigger.

  1. Open an ADF workspace from the Azure portal and click on the Author & Monitor link to...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Azure Databricks Cookbook
Published in: Sep 2021Publisher: PacktISBN-13: 9781789809718
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Phani Raj

Phani Raj is an experienced data architect and a product manager having 15 years of experience working with customers on building data platforms on both on-prem and on cloud. Worked on designing and implementing large scale big data solutions for customers on different verticals. His passion for continuous learning and adapting to the dynamic nature of technology underscores his role as a trusted advisor in the realm of data architecture ,data science and product management.
Read more about Phani Raj

author image
Vinod Jaiswal

Vinod Jaiswal is an experienced data engineer, excels in transforming raw data into valuable insights. With over 8 years in Databricks, he designs and implements data pipelines, optimizes workflows, and crafts scalable solutions for intricate data challenges. Collaborating seamlessly with diverse teams, Vinod empowers them with tools and expertise to leverage data effectively. His dedication to staying updated on the latest data engineering trends ensures cutting-edge, robust solutions. Apart from technical prowess, Vinod is a proficient educator. Through presentations and mentoring, he shares his expertise, enabling others to harness the power of data within the Databricks ecosystem.
Read more about Vinod Jaiswal