You're reading from Azure Databricks Cookbook

Product typeBook

Published inSep 2021

PublisherPackt

ISBN-139781789809718

Edition1st Edition

Concepts

Data Streaming

Authors (2):

Phani Raj

Vinod Jaiswal

View More author details

Preface

Azure Databricks provides the latest and older versions of Apache Spark and allows you to integrate with various Azure resources for orchestrating, deploying, and monitoring your big data solution. This book shows you how you can ingest and transform data coming from various sources and formats and build a modern data warehouse solution that meets the demands of near real-time data requirements in the data warehouse.

You will begin with how to spin up an Azure Databricks service and what cluster options are available. You will gain knowledge of how to process data from various files formats and sources, including Kafka, Event Hub, Azure SQL Databases, Azure Synapse Analytics, and Cosmos DB. Once you know how to read and write data from and to various sources, you will be building end-to-end big data solutions using large datasets and streaming data.

Once the big data solution has been built, you will learn how to deploy notebooks to various environments such as UAT and production. Later on, you will learn security aspects associated with data isolation, where you will learn how to restrict access to the data in ADLS that AAD users can see while reading the data from Azure Databricks. Finally, you will learn how to monitor your Azure Databricks cluster utilization by learning about Ganglia reports.

Who this book is for

This book is for anyone who wants to learn how to ingest and process data from various Azure sources using Azure Databricks and build a modern data warehouse. This book takes a recipe-based approach to help you learn how to ingest, process, secure, deploy, and monitor the big data solution you have built on Azure Databricks. Working knowledge of Spark and a familiarity with the basics of Azure should be sufficient to get started on this book.

What this book covers

Chapter 1, Creating an Azure Databricks Service, explains how to create an Azure Databricks service from the Azure portal, Azure CLI, and ARM templates. You will get to understand the different types of clusters in Azure Databricks, while also learning how to add users and groups and how to authenticate to Azure Databricks using a personal access token.

Chapter 2, Reading and Writing Data from and to Various Azure Services and File Formats, explains how to read and write data from and to various data sources, including Azure SQL DB, Azure Synapse Analytics, ADLS Gen2, Storage Blob, Azure Cosmos, CSV, JSON, and Parquet formats. You will be using native connectors for Azure SQL and an Azure Synapse dedicated pool to read and write the data for improved performance.

Chapter 3, Understanding Spark Query Execution, dives deep into query execution and explains how to check the Spark execution plan and learn more about input, shuffle, and output partitions. You will get to understand the different types of joins while working with data frames. Also, you will learn about a few commonly used session-level configurations for changing partitions.

Chapter 4, Working with Streaming Data, explains how to ingest streaming data from HDInsight Kafka clusters, Event Hub, and Event Hub for Kafka, and how to perform certain transformations and write the output to Spark tables and Delta tables for downstream consumers. You will get to know the various options to use while ingesting data from streaming sources.

Chapter 5, Integrating with Azure Key Vault, App Configuration, and Log Analytics, integrates Azure Databricks with Azure resources such as Azure Key Vault and App Configuration to store credentials and secrets that will be read from Azure Databricks notebooks and explains how to integrate Azure Databricks with Log Analytics for telemetry.

Chapter 6, Exploring Delta Lake in Azure Databricks, explains how to use Delta for batch and streaming data in Azure Databricks. You will also understand how Delta Engine will assist in making your queries run faster.

Chapter 7, Implementing Near-Real-Time Analytics and Building a Modern Data Warehouse, explains how to implement an end-to-end big data solution where you read data from streaming sources such as Event Hub for Kafka, as well as from batch sources such as ADLS Gen-2, and perform various data transformations on data and later store the data in destinations such as Azure Cosmos DB, Azure Synapse Analytics, and Delta Lake. You will build a modern data warehouse and orchestrate the end-to-end pipeline using Azure Data Factory. You will be performing near real-time analytics using notebook visualizations and Power BI.

Chapter 8, Databricks SQL, explains how to run ad hoc SQL queries on the data in your data lake. You will get to know how to create SQL endpoints with multi clusters, write queries, create various visualizations in Data Lake, and create dashboards.

Chapter 9, DevOps Integrations and Implementing CI/CD for Azure Databricks, details how to integrate Azure DevOps Repo and GitHub with Databricks notebooks. You will learn how to implement Azure DevOps CI/CD for deploying notebooks across various environments (UAT and production) as well as how to deploy Azure Databricks resources using ARM templates and automate deployment using the Azure DevOps release pipeline.

Chapter 10, Understanding Security and Monitoring in Azure Databricks, covers pass-through authentication in Azure Databricks and how to restrict access to ADLS Gen-2 using RBAC and ACLs so that users can read data to which they have access from Azure Databricks. You will learn how to deploy Azure Databricks in a VNet as well as how to securely access the data in an ADLS Gen-2 storage account.

To get the most out of this book

This book is for data engineers, data analysts, and data scientists who are familiar with Spark and have some knowledge of Azure. If you understand the basics of Spark and want to learn how to build an end-to-end data pipeline in Azure Databricks and productionize it, then this book is for you. Data scientists and business analysts with some knowledge of SQL who want to run ad hoc queries on large volumes of data using Databricks SQL will also find this book useful.

Basic knowledge of Python and PySpark is all you need to understand and execute the code:

If you are using the digital version of this book, we advise you to type the code yourself or access the code via the GitHub repository (link available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Azure-Databricks-Cookbook. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781789809718_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "--location -l: The Azure region where the workspace will be created."

A block of code is set as follows:

$appId="sdasdasdsdfsa7-xxx-xxx-xx-xx"
$appSecret="xxxxxxx~.xxxxxxjgx"
$tenantId="xxxxx-xxx-xx-xxxx-xxxcdxxxxxx"
$subscriptionName="Pay As You Go"
$resourceGroup = "CookbookRG

Any command-line input or output is written as follows:

az databricks workspace update

Bold: Indicates a new term, an important word, or words that you see on screen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "We will select New Job Cluster here, select Edit option and provide the cluster configuration for the new cluster."

Tips or important notes

Appear like this.

Sections

In this book, you will find several headings that appear frequently (Getting ready, How to do it..., How it works..., There's more..., and See also).

To give clear instructions on how to complete a recipe, use these sections as follows:

Getting ready

This section tells you what to expect in the recipe and describes how to set up any software or any preliminary settings required for the recipe.

How to do it…

This section contains the steps required to follow the recipe.

How it works…

This section usually consists of a detailed explanation of what happened in the previous section.

There's more…

This section consists of additional information about the recipe in order to make you more knowledgeable about the recipe.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at customercare@packtpub.com.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at copyright@packt.com with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

The rest of the chapter is locked

You have been reading a chapter from

Azure Databricks Cookbook

Published in: Sep 2021Publisher: PacktISBN-13: 9781789809718

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Phani Raj

Phani Raj is an experienced data architect and a product manager having 15 years of experience working with customers on building data platforms on both on-prem and on cloud. Worked on designing and implementing large scale big data solutions for customers on different verticals. His passion for continuous learning and adapting to the dynamic nature of technology underscores his role as a trusted advisor in the realm of data architecture ,data science and product management.
Read more about Phani Raj

Vinod Jaiswal

Vinod Jaiswal is an experienced data engineer, excels in transforming raw data into valuable insights. With over 8 years in Databricks, he designs and implements data pipelines, optimizes workflows, and crafts scalable solutions for intricate data challenges. Collaborating seamlessly with diverse teams, Vinod empowers them with tools and expertise to leverage data effectively. His dedication to staying updated on the latest data engineering trends ensures cutting-edge, robust solutions. Apart from technical prowess, Vinod is a proficient educator. Through presentations and mentoring, he shares his expertise, enabling others to harness the power of data within the Databricks ecosystem.
Read more about Vinod Jaiswal

Other recommended products

Related to this chapter

Azure Data Engineering Cookbook

This book will help you design and implement modern ETL workflows along with data management, monitoring, and security aspects to meet the current organization's needs. You will use various services such as Azure Data Factory, Azure Databricks, Azure Stream Analytics, and Azure Data Explorer to design efficient data processing solutions.

BookApr 2021454 pages

Distributed Data Systems with Azure Databricks

This book helps you to learn how to extract, transform, and orchestrate massive amounts of data to develop robust data pipelines. You'll perform complex machine learning tasks using advanced Azure Databricks features, and also explore model tuning, deployment, and control using Databricks functionalities such as AutoML and Delta Lake with TensorFlow.

BookMay 2021414 pages

Azure Data Factory Cookbook

With the help of well-structured and practical recipes, this book will teach you how to integrate data from the cloud and on-premise. You’ll learn how to transform, clean, and consolidate data into a single data platform and get to grips with using ADF as the main ETL and orchestration tool for your data warehouse or data platform project.

BookDec 2020382 pages

Cloud Analytics with Microsoft Azure

Cloud Analytics with Microsoft Azure is an end-to-end guide to processing and analyzing big data using a range of Microsoft Azure features. This book covers everything you need to build your own data warehouse and learn numerous techniques to gain useful insights by analyzing big data.

BookNov 2019242 pages

Limitless Analytics with Azure Synapse

This book helps you understand the basic concepts and techniques of using Azure Synapse step-by-step. You'll gradually gain the skills you need to work with data and develop analytics solutions using the Azure analytics platform even with no prior knowledge of Azure.

BookJun 2021392 pages

Cloud Scale Analytics with Azure Data Services

This book will help you to understand the architectural components of a modern data warehouse and select those suitable for your requirements. You’ll learn everything from how to integrate your source data into Azure Data Lake at scale to how to structure your analytical data estate and more.

BookJul 2021520 pages

ETL with Azure Cookbook

This book will take you through hand-on recipes for extracting, transforming, and loading data using big data tools and Azure services such as Data Factory and Azure Databricks. You will learn how to interact effectively with Azure services, along with covering automation with BIML and data profiling in Azure.

BookSep 2020446 pages

Stream Analytics with Microsoft Azure

This book is your guide to understanding the basics of how Azure Stream Analytics works, and build your own analytics solution using its capabilities. By the end of this book, you will be well-versed in using Azure Stream Analytics to develop an efficient analytics solution which can work with any type of data.

BookDec 2017322 pages

Data Modeling for Azure Data Services

Data modeling for Azure Data Services teaches you the core concepts of setting up different types of databases for different use cases. With this hands-on guide, you'll learn how to implement the resulting data model in Azure efficiently.

BookJul 2021428 pages

Exam Ref AZ-304 Microsoft Azure Architect Design Certification and Beyond

If you're taking the AZ-304 Microsoft Azure Architect Design exam, you need to know which Azure technologies to choose and when. Exam Ref AZ-304 Microsoft Azure Architect Design Certification and Beyond prepares you for the AZ-304 exam and shows you how to design scalable and secure solutions using compute, storage, data, monitoring, and logging.

BookJul 2021520 pages

Azure for Architects

Azure cloud services have risen rapidly and there is also a gradual increase in the number of organizations that adopt Azure for their cloud services. This 3rd edition will assist readers to create a comprehensive Azure cloud solution that is Enterprise-class and ready for the future.

BookJul 2020698 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Azure Databricks Cookbook

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Sections

Getting ready

How to do it…

How it works…

There's more…

See also

Get in touch

Share Your Thoughts

Unlock this book and the full library FREE for 7 days

Authors (2)

Azure Data Engineering Cookbook

Distributed Data Systems with Azure Databricks

Azure Data Factory Cookbook

Cloud Analytics with Microsoft Azure

Cloud Analytics with Microsoft Azure is an end-to-end guide to processing and analyzing big data using a range of Microsoft Azure features. This book covers everything you need to build your own data warehouse and learn numerous techniques to gain useful insights by analyzing big data.

Limitless Analytics with Azure Synapse

This book helps you understand the basic concepts and techniques of using Azure Synapse step-by-step. You'll gradually gain the skills you need to work with data and develop analytics solutions using the Azure analytics platform even with no prior knowledge of Azure.

Cloud Scale Analytics with Azure Data Services

This book will help you to understand the architectural components of a modern data warehouse and select those suitable for your requirements. You’ll learn everything from how to integrate your source data into Azure Data Lake at scale to how to structure your analytical data estate and more.

ETL with Azure Cookbook

Stream Analytics with Microsoft Azure

Data Modeling for Azure Data Services

Data modeling for Azure Data Services teaches you the core concepts of setting up different types of databases for different use cases. With this hands-on guide, you'll learn how to implement the resulting data model in Azure efficiently.

Exam Ref AZ-304 Microsoft Azure Architect Design Certification and Beyond

Azure for Architects

Azure cloud services have risen rapidly and there is also a gradual increase in the number of organizations that adopt Azure for their cloud services. This 3rd edition will assist readers to create a comprehensive Azure cloud solution that is Enterprise-class and ready for the future.

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook