You're reading from Azure Databricks Cookbook

Product typeBook

Published inSep 2021

PublisherPackt

ISBN-139781789809718

Edition1st Edition

Concepts

Data Streaming

Authors (2):

Phani Raj

Vinod Jaiswal

View More author details

Chapter 7: Implementing Near-Real-Time Analytics and Building a Modern Data Warehouse

Azure has changed the way data applications are designed and implemented and how data is processed and stored. As we see more data coming from various disparate sources, we need to have better tools and techniques to handle streamed, batched, semi-structured, unstructured, and relational data together. Modern data solutions are being built in such a way that they define a framework that describes how data can be read from various sources, processed together, and stored or sent to other streaming consumers to generate meaningful insights from the raw data.

In this chapter, we will learn how to ingest data coming from disparate sources such as Azure Event Hubs, Azure Data Lake Storage Gen2 (ADLS Gen2) storage, and Azure SQL Database, and how this data can be processed together and stored as a data warehouse model with Facts and Dimension Azure Synapse Analytics, and store processed and raw data...

Technical requirements

To follow along with the examples shown in the recipes, you will need to have the following:

An Azure subscription and required permissions on the subscription, as mentioned in the Technical requirements section of Chapter 1, Creating an Azure Databricks Service.
An Azure Databricks workspace with a Spark 3.x cluster. All the notebooks used in this chapter are executed on a Spark 3. 0.1 cluster.
You can find the scripts for this chapter in the GitHub repository at https://github.com/PacktPublishing/Azure-Databricks-Cookbook/tree/main/Chapter07. The Chapter07 folder contains all the notebooks, Python code, and a few Parquet files required for the demonstration.

Understanding the scenario for an end-to-end (E2E) solution

In this recipe, you will learn about an E2E solution that you will be building in this chapter. We will look at various components used in the solution and the data warehouse model that we will be creating in a Synapse dedicated pool.

Getting ready

You need to ensure you read the following chapters before starting on this recipe if you are new to Azure Databricks:

Chapter 2, Reading and Writing Data from and to Various Azure Services and File Formats
Chapter 4, Working with Streaming Data
Chapter 6, Exploring Delta Lake in Azure Databricks

Also, you will need to have a basic understanding of facts and dimension tables in a data warehouse.

How to do it…

In this section, you will understand an E2E data pipeline scenario along with the modern data warehouse model that will be created as part of the solution. The following points list down all the stages of the data pipeline, right from...

Creating required Azure resources for the E2E demonstration

In this recipe, you will create Azure resources required for an E2E solution for implementing batch processing, real-time analytics, and building a modern warehouse solution using Azure Stack.

Getting ready

Before starting, we need to ensure we have access to the subscription and have contributor access to the subscription, or are the owner of the resource group.

Before we begin, ensure that you have completed the recipes in Chapter 2, Reading and Writing Data from and to Various Azure Services and File Formats, and Chapter 4, Working with Streaming Data.

How to do it…

In this section, you will learn how to get the required information for various components to establish a connection.

Refer to the Reading and writing data from and to Azure Cosmos recipe from Chapter 2, Reading and Writing Data from and to Various Azure Services and File Formats to learn how to read and write data to and from a...

Simulating a workload for streaming data

In this recipe, you will learn how to simulate the vehicle sensor data that can be sent to Event Hubs. In this recipe, we will be running a Python script that will be generating the sensor data for 10 vehicle IDs and learn how the events can be pushed to Event Hubs for Kafka.

Getting ready

Before you start on this recipe, make sure you have the latest version of Python installed on the machine from which you will be running the Python script. The Python script was tested on Python 3.8.

You need to install confluent_kafka libraries by running pip install confluent_kafka from bash, PowerShell, or Command Prompt.
You need to have Azure Event Hubs for Kafka, which is mentioned in the previous recipe, Creating required Azure resources for the E2E demonstration. In the mentioned recipe, you can find out how to get the Event Hubs connection string that will be used for the bootstrap.servers notebook variable.
The Python script...

Processing streaming and batch data using Structured Streaming

We tend to see scenarios where we need to process batch data in comma-separated values (CSV) or Parquet format stored in ADLS Gen2 and from real-time streaming sources such as Event Hubs together. In this recipe, we will learn how we use Structured Streaming for both batch and real-time streaming sources and process the data together. We will also fetch the data from Azure SQL Database for all metadata information required for our processing.

Getting ready

Before starting, you need to have a valid subscription with contributor access, a Databricks workspace (Premium), and an ADLS Gen2 storage account. Also, ensure that you have been through the previous recipes of this chapter.

We have executed the notebook on Databricks Runtime Version 7.5 having Spark 3.0.1.

You can follow along by running the steps in the following notebook:

https://github.com/PacktPublishing/Azure-Databricks-Cookbook/blob/main/Chapter07...

Understanding the various stages of transforming data

Building a near-real-time warehouse is being used these days as a common architectural pattern for many organizations who want to avoid the delays that we see in on-premises data warehouse systems. Customers want to view the data in near real time in their new modern warehouse architecture and they can achieve that by using Azure Databricks Delta Lake with Spark Structured Streaming APIs. In this recipe, you will learn the various stages involved in building a near-real-time data warehouse in Delta Lake. We are storing the data in a denormalized way in Delta Lake, but in a Synapse dedicated SQL pool, we are storing the data in facts and dimension tables to enhance reporting capabilities.

As part of data processing in Delta Lake, you will be creating three Delta tables, as follows:

Bronze table: This will hold the data as received from Event Hubs for Kafka.
Silver table: We will implement the required business rules...

Loading the transformed data into Azure Cosmos DB and a Synapse dedicated pool

In this recipe, we will learn how to write the transformed data into various sinks such as Azure Cosmos DB and a Synapse dedicated SQL pool. The processed data needs to be saved in a different destination for further consumption or to build a data warehouse. Azure Cosmos DB is one of the most widely used NoSQL databases and acts as a source for web portals. Similarly, an Azure Synapse dedicated SQL pool is used for creating a data warehouse.

Customers want to view data in near real time in their warehouse or web application so that they can get insights from the data in real time. In the following section, you will learn how to read data from the Silver zone (Delta table), do some data processing, and then finally load the data into Cosmos DB and to a Synapse dedicated SQL pool.

Getting ready

In the previous recipe, we saw how data was loaded into Bronze, Silver, and Gold Delta tables. In this...

Creating a visualization and dashboard in a notebook for near-real-time analytics

Azure Databricks provides the capability to create various visualizations using Spark SQL and to create dashboards from various visualizations. This helps data engineers and data scientists to create quick dashboards for near-real-time analytics. In this recipe, you will learn how to create visualizations in a notebook and how to create a dashboard for static and near-real-time reporting. The dashboard capabilities of Azure Databricks are very limited when compared to reporting tools such as Power BI. If you need more drill-downs and various levels of slicing and dicing, then you can use Power BI for reporting purposes.

Getting ready

Before starting, we need to ensure we have executed the following notebook. The following notebook creates the required Delta tables on which we can build our visualizations and dashboard in the notebook:

https://github.com/PacktPublishing/Azure-Databricks-Cookbook...

Creating a visualization in Power BI for near-real-time analytics

Before starting, we need to ensure we have executed the following notebook. This notebook creates the required tables on which we can build out visualizations and dashboards in the notebook: https://github.com/PacktPublishing/Azure-Databricks-Cookbook/blob/main/Chapter07/7.1-End-to-End%20Data%20Pipeline.ipynb.

Getting ready

Before starting to work on this recipe, you need to get the server hostname and the HyperText Transfer Protocol (HTTP) path details for the Azure Databricks clusters.

Go to the Clusters tab in the Azure Databricks workspace and select the cluster you are using. Under Configuration, you will find advanced options—select the JDBC/ODBC option to get the details about the server hostname and HTTP path, as shown in the following screenshot:

Figure 7.23 – Cluster configuration details

Copy the entire string for the server hostname and HTTP path that you...

Using Azure Data Factory (ADF) to orchestrate the E2E pipeline

ADF is a serverless data integration and data transformation Azure service. It's a cloud Extract Transform Load (ETL)/Extract Load Transform (ELT) service in the Microsoft Azure platform. In this recipe, we will learn how to orchestrate and automate a data pipeline using ADF.

Getting ready

Before starting with this recipe, you need to ensure that you have a valid Azure subscription, valid permission to create an ADF resource, and Azure Databricks workspace details with an access token.

Note

Explaining ADF is beyond the scope of this book and readers are expected to have basic knowledge of creating an ADF pipeline and how to schedule it.

How to do it…

In this section, we learn how to invoke a Databricks notebook in an ADF pipeline and how to schedule an E2E data pipeline using an ADF trigger.

Open an ADF workspace from the Azure portal and click on the Author & Monitor link to...

The rest of the chapter is locked

You have been reading a chapter from

Azure Databricks Cookbook

Published in: Sep 2021Publisher: PacktISBN-13: 9781789809718

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Phani Raj

Phani Raj is an experienced data architect and a product manager having 15 years of experience working with customers on building data platforms on both on-prem and on cloud. Worked on designing and implementing large scale big data solutions for customers on different verticals. His passion for continuous learning and adapting to the dynamic nature of technology underscores his role as a trusted advisor in the realm of data architecture ,data science and product management.
Read more about Phani Raj

Vinod Jaiswal

Vinod Jaiswal is an experienced data engineer, excels in transforming raw data into valuable insights. With over 8 years in Databricks, he designs and implements data pipelines, optimizes workflows, and crafts scalable solutions for intricate data challenges. Collaborating seamlessly with diverse teams, Vinod empowers them with tools and expertise to leverage data effectively. His dedication to staying updated on the latest data engineering trends ensures cutting-edge, robust solutions. Apart from technical prowess, Vinod is a proficient educator. Through presentations and mentoring, he shares his expertise, enabling others to harness the power of data within the Databricks ecosystem.
Read more about Vinod Jaiswal

Other recommended products

Related to this chapter

Azure Data Engineering Cookbook

This book will help you design and implement modern ETL workflows along with data management, monitoring, and security aspects to meet the current organization's needs. You will use various services such as Azure Data Factory, Azure Databricks, Azure Stream Analytics, and Azure Data Explorer to design efficient data processing solutions.

BookApr 2021454 pages

Distributed Data Systems with Azure Databricks

This book helps you to learn how to extract, transform, and orchestrate massive amounts of data to develop robust data pipelines. You'll perform complex machine learning tasks using advanced Azure Databricks features, and also explore model tuning, deployment, and control using Databricks functionalities such as AutoML and Delta Lake with TensorFlow.

BookMay 2021414 pages

Azure Data Factory Cookbook

With the help of well-structured and practical recipes, this book will teach you how to integrate data from the cloud and on-premise. You’ll learn how to transform, clean, and consolidate data into a single data platform and get to grips with using ADF as the main ETL and orchestration tool for your data warehouse or data platform project.

BookDec 2020382 pages

Cloud Analytics with Microsoft Azure

Cloud Analytics with Microsoft Azure is an end-to-end guide to processing and analyzing big data using a range of Microsoft Azure features. This book covers everything you need to build your own data warehouse and learn numerous techniques to gain useful insights by analyzing big data.

BookNov 2019242 pages

Limitless Analytics with Azure Synapse

This book helps you understand the basic concepts and techniques of using Azure Synapse step-by-step. You'll gradually gain the skills you need to work with data and develop analytics solutions using the Azure analytics platform even with no prior knowledge of Azure.

BookJun 2021392 pages

Cloud Scale Analytics with Azure Data Services

This book will help you to understand the architectural components of a modern data warehouse and select those suitable for your requirements. You’ll learn everything from how to integrate your source data into Azure Data Lake at scale to how to structure your analytical data estate and more.

BookJul 2021520 pages

ETL with Azure Cookbook

This book will take you through hand-on recipes for extracting, transforming, and loading data using big data tools and Azure services such as Data Factory and Azure Databricks. You will learn how to interact effectively with Azure services, along with covering automation with BIML and data profiling in Azure.

BookSep 2020446 pages

Stream Analytics with Microsoft Azure

This book is your guide to understanding the basics of how Azure Stream Analytics works, and build your own analytics solution using its capabilities. By the end of this book, you will be well-versed in using Azure Stream Analytics to develop an efficient analytics solution which can work with any type of data.

BookDec 2017322 pages

Data Modeling for Azure Data Services

Data modeling for Azure Data Services teaches you the core concepts of setting up different types of databases for different use cases. With this hands-on guide, you'll learn how to implement the resulting data model in Azure efficiently.

BookJul 2021428 pages

Exam Ref AZ-304 Microsoft Azure Architect Design Certification and Beyond

If you're taking the AZ-304 Microsoft Azure Architect Design exam, you need to know which Azure technologies to choose and when. Exam Ref AZ-304 Microsoft Azure Architect Design Certification and Beyond prepares you for the AZ-304 exam and shows you how to design scalable and secure solutions using compute, storage, data, monitoring, and logging.

BookJul 2021520 pages

Azure for Architects

Azure cloud services have risen rapidly and there is also a gradual increase in the number of organizations that adopt Azure for their cloud services. This 3rd edition will assist readers to create a comprehensive Azure cloud solution that is Enterprise-class and ready for the future.

BookJul 2020698 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages