You're reading from Optimizing Databricks Workloads

Product typeBook

Published inDec 2021

PublisherPackt

ISBN-139781801819077

Edition1st Edition

Concepts

Data Processing

Authors (3):

Anirudh Kala

Anshul Bhatnagar

Sarthak Sarbahi

View More author details

Chapter 6: Databricks Delta Lake

Delta Lake is an open source storage layer that provides functionalities to data in the data lake that only exist in data warehouses. When combined with cloud storage, Databricks and Delta Lake lead to the formation of a Lakehouse. A Lakehouse simply provides the best of both worlds – data lakes and data warehouses. In today's world, a Lakehouse provides the same set of capabilities as a traditional data warehouse and at a much lower cost. This is made possible due to cheap cloud storage such as Azure Data Lake, Spark as the processing engine, and data being stored in the Delta Lake format. In this chapter, we will learn about various Delta Lake optimizations that help us build a more performant Lakehouse.

In this chapter, we will cover the following topics:

Working with the OPTIMIZE and ZORDER commands
Using AUTO OPTIMIZE
Learning about delta caching
Learning about dynamic partition pruning
Understanding bloom...

Technical requirements

To follow the hands-on tutorials in the chapter, the following is required:

An Azure subscription
Azure Databricks
Azure Databricks notebooks and a Spark cluster
Access to this book's GitHub repository at https://github.com/PacktPublishing/Optimizing-Databricks-Workload/tree/main/Chapter06

To start, let's spin up a Spark cluster with the following configurations:

Cluster Name: packt-cluster
Cluster Mode: Standard
Databricks Runtime Version: 8.3 (includes Apache Spark 3.1.1, Scala 2.12)
Autoscaling: Disabled
Automatic Termination: After 30 minutes of inactivity
Worker Type: Standard_DS3_v2
Number of workers: 1
Spot instances: Disabled
Driver Type: Same as the worker

Now, create a new notebook and attach it to the newly created cluster to get started!

Working with the OPTIMIZE and ZORDER commands

Delta lake on Databricks lets you speed up queries by changing the layout of the data stored in the cloud storage. The algorithms that support this functionality are as follows:

Bin-packing: This uses the OPTIMIZE command and helps coalesce small files into larger ones.
Z-Ordering: This uses the ZORDER command and helps collocate data in the same set of files. This co-locality helps reduce the amount of data that's read by Spark while processing.

Let's learn more about these two layout algorithms with a worked-out example:

Run the following code block:

from pyspark.sql.types import *
from pyspark.sql.functions import *
manual_schema = StructType([
  StructField('Year',IntegerType(),True),
  StructField('Month',IntegerType(),True),
  StructField('DayofMonth',IntegerType(),True),
  StructField('DayOfWeek',IntegerType(),True...

Using Auto Optimize

Auto Optimize is a feature that helps us automatically compact small files while an individual writes to a delta table. Unlike bin-packing, we do not need to run a command every time Auto Optimize is executed. It consists of two components:

Optimized Writes: Databricks dynamically optimizes Spark partition sizes to write 128 MB chunks of table partitions.
Auto Compaction: Here, Databricks runs an optimized job when the data writing process has been completed and compacts small files. It tries to coalesce small files into 128 MB files. This works on data that has the greatest number of small files.

Next, we will learn about the Spark configurations for Auto Optimize and go through a worked-out example to understand how it works:

To enable Auto Optimize for all new tables, we need to run the following Spark configuration code:

%sql
set spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite = true;
set spark.databricks.delta...

Learning about delta caching

Delta caching is an optimization technique that helps speed up queries by storing the data in the cluster node's local storage. The delta cache stores local copies of data that resides in remote locations such as Azure Data Lake or Azure Blob Storage. It improves the performance of a wide range of queries but cannot store the results of arbitrary subqueries.

Once delta caching has been enabled, any data that is fetched from an external location is automatically added to the cache. This process does not require action. To preload data into the delta cache, the CACHE command can be used. Any changes that have been made to the data persisted in the delta cache are automatically detected by the delta cache. The easiest way to use delta caching is to provision a cluster with Standard_L series worker types (Delta Cache Accelerated).

Now, we will go through a worked-out example with delta caching. To begin with, we will provide a new cluster with the...

Learning about dynamic partition pruning

Dynamic partition pruning is a data-skipping technique that can drastically speed up query execution time. Delta lake collects metadata on the partition files it manages so that data can be skipped without the need to access it. This technique is very useful for star schema types of queries as it can dynamically skip partitions and their respective files. Using this technique, we can prune the partitions of a fact table during the join to a dimension table. This is made possible when the filter that's applied to a dimension table to prune its partitions is dynamically applied to the fact table. We will now learn how this technique works by looking at an example. Before we get started, do not forget to spin up the packt-cluster cluster!

In this example, we will demonstrate a star schema model by joining a fact table and a dimension table. A star schema is one of the simplest ways to build a data warehouse. It consists of one or more fact...

Understanding bloom filter indexing

A bloom filter index is a data structure that provides data skipping on columns, especially on fields containing arbitrary text. The filter works by either stating that certain data is definitely not in a file or that it is probably in the file, which is defined by a false positive probability (FPP). The bloom filter index can help speed up needle in a haystack type of queries, which are not sped up by other techniques.

Let's go through a worked-out example that illustrates the performance benefits of using a bloom filter index:

We will start by checking the Spark configuration for bloom filter indexes. Run the following line of code in a new cell:
```
spark.conf.get('spark.databricks.io.skipping.bloomFilter.enabled')
```
By default, it is true.
Now, we can start creating our very first bloom filter index! To begin with, let's create a delta table using the following block of code:
```
%sql
CREATE OR REPLACE TABLE bloom_filter_test...
```

Summary

In this chapter, we learned about several optimization techniques concerning Databricks Delta Lake. We started with file compaction and clustering techniques and ended with techniques for efficient data skipping. These optimization techniques play a crucial role in making querying and data engineering workloads in Databricks quicker and more efficient.

In the next chapter, we will learn about another set of Spark optimization techniques related to Spark core. We will develop a theoretical understanding of these optimizations and write code to understand their practical usage in different scenarios.

The rest of the chapter is locked

You have been reading a chapter from

Optimizing Databricks Workloads

Published in: Dec 2021Publisher: PacktISBN-13: 9781801819077

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (3)

Anirudh Kala

Anirudh Kala is an expert in machine learning techniques, artificial intelligence, and natural language processing. He has helped multiple organizations to run their large-scale data warehouses with quantitative research, natural language generation, data science exploration, and big data implementation. He has worked in every aspect of data analytics using the Azure data platform. Currently, he works as the director of Celebal Technologies, a data science boutique firm dedicated to large-scale analytics. Anirudh holds a computer engineering degree from the University of Rajasthan and his work history features the likes of IBM and ZS Associates.
Read more about Anirudh Kala

Anshul Bhatnagar

Anshul Bhatnagar is an experienced, hands-on data architect involved in the architecture, design, and implementation of data platform architectures, and distributed systems. He has worked in the IT industry since 2015 in a range of roles such as Hadoop/Spark developer, data engineer, and data architect. He has also worked in many other sectors including energy, media, telecoms, and e-commerce. He is currently working for a data and AI boutique company, Celebal Technologies, in India. He is always keen to hear about new ideas and technologies in the areas of big data and AI, so look him up on LinkedIn to ask questions or just to say hi.
Read more about Anshul Bhatnagar

Sarthak Sarbahi

Sarthak Sarbahi is a certified data engineer and analyst with a wide technical breadth and a deep understanding of Databricks. His background has led him to a variety of cloud data services with an eye toward data warehousing, big data analytics, robust data engineering, data science, and business intelligence. Sarthak graduated with a degree in mechanical engineering.
Read more about Sarthak Sarbahi

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages