You're reading from Simplifying Data Engineering and Analytics with Delta

Product typeBook

Published inJul 2022

PublisherPackt

ISBN-139781801814867

Edition1st Edition

Concepts

Big Data

Author (1)

Anindita Mahapatra

Chapter 6: Solving Common Data Pattern Scenarios with Delta

"Without changing our pattern of thought, we will not be able to solve the problems we created with our current pattern of thoughts"

– Albert Einstein

In the previous chapters, we established the foundation of Delta and how it helps to consolidate disparate datasets, and how it offers a wide array of tools to slice and dice data using unified processing and storage APIs. We examined basic Create, Retrieve, Update, Delete (CRUD) operations using Delta and time travel capabilities to rewind to a different view of data at a previous point in time for rollback capabilities. We used Delta to showcase functionality around fine-grained updates and deletes to data and the handling of late-arriving data. It may arise on account of a technical glitch upstream or a human error. We demonstrated the ability to...

Technical requirements

To follow the instructions of this chapter, make sure you have the code and instructions as detailed in this GitHub location:

https://github.com/PacktPublishing/Simplifying-Data-Engineering-and-Analytics-with-Delta/tree/main/Chapter06

Examples in this book cover some Databricks-specific features to provide a complete view of capabilities. New features continue to be ported from Databricks to the open source Delta.

Let's get started!

Understanding use case requirements

Each problem that a client brings up will always have some similarities to a problem you may have seen before and yet have some nuances to it that make it a little different. So, before rushing to reuse a solution, you need to understand the requirements and the priorities so that they can be handled in the order of importance that the client values them. A good way to look at requirements is by demarcating the functional ones from the non-functional ones. Functional requirements specify what the system should do, whereas non-functional requirements describe how the system will perform. For example, we may be able to perform fine-grained deletes from the enterprise data lake for a GDPR compliance requirement, but it takes two days and two engineers to do so at the end of each month, so it will not meet the requirements of a 12-hour SLA. The technical capabilities exist, but the solution is still not usable. The following diagram helps you classify...

Minimizing data movement with Delta time travel

Apart from ensuring data quality, the other advantage of minimizing data movement is that it reduces the costs associated with data. To prevent fragile disparate systems from being stitched together, the first core requirement is to keep data in an open format for multiple tools of the ecosystem to handle, which is what Delta architectures promote.

There are some scenarios where a data professional needs to make copies of an underlying dataset. For example, to make a series of A/B tests in the context of debugging and integration testing, a data engineer needs a point-in-time reference to a data snapshot to compare for debugging and integration testing purposes. A BI analyst may need to run different reports off the same data to run some audit checks. Similarly, an ML practitioner may need a consistent dataset because experiments have to be compared across different ML model architectures or against different hyperparameter combinations...

Delta cloning

Cloning is the process of making a copy. In the previous section, we started out by saying that we should try to minimize data movement and data copies whenever possible because there will always be a lot of effort required to keep things in sync and reconcile data. However, there are some cases where it is inevitable for business requirements. For example, there may be a scenario for data archiving, trying to reproduce an ML flow experiment in a different environment, short-term experimental runs on production data, the need to share data with a different LOB, or maybe the need to tweak a few table properties without affecting original source especially if there are consumers leveraging it with some assumptions.

Shallow cloning refers to copying metadata and deep cloning refers to copying both metadata and data. If shallow cloning suffices, it should be preferred as it is light and inexpensive, whereas deep cloning is a more involved process.

...

Handling CDC

CDC is a process that identifies the classification of incoming records in real-time to determine which ones are brand new, which ones are modifications of existing data, and which ones are requests for deletes. Operational data stores are capturing transactions in OLTP systems continuously and streaming them across to OLAP systems. These two data systems need to be kept in sync to reconcile data and keep data fidelity. It is like a replay of the operations but on a different system.

CDC

This is the flow of data from the OLTP system into an OLAP system, typically the first landing zone, which is referred to as the bronze layer in the medallion architecture. Several tools, such as GoldenGate from Oracle or PowerGate from Informatica, support the generation of change sets, or they could be generated by other relational stores that capture this information on a data modification trigger. Moreover, this could be an omni-channel scenario where the same type of data is...

Handling Slowly Changing Dimensions (SCD)

Operational data makes its way into OLAP systems that comprise fact and dimension tables. The facts change frequently and are usually additive in nature. The dimensions do not change as often but they do experience some change, hence the name "slowly changing dimensions."

Business rules dictate how this change is to be handled and the various types of SCD operations reflect this. The following table lists them.

Figure 6.5 – SCD types

Of all these alternatives, types 1 and 2 are the most popular in the industry. In the next section, we will explore them in more detail.

SCD Type 1

This is fairly straightforward as there is no need to store the historical data; the newer data just overwrites the older data. Delta's MERGE constructs come in handy. There is an initial full load of the data. New data is inserted, existing data is updated, and deletes remove the data altogether.

...

Summary

Delta Lake with ACID transactions makes it much easier to reliably perform UPDATE and DELETE operations. Delta introduces the MERGE INTO operator to perform Upsert/Merge actions as atomic operations along with time travel features to provide rewind capabilities on Delta Lake tables. Cloning, CDC, and SCD are patterns found in several use cases that build upon these base operations. In this chapter, we have looked at these common data patterns and shown how Delta continues to provide efficient, robust, and elegant solutions to simplify the everyday work scenarios of a data persona, allowing them to focus on the use case at hand.

In the next chapter, we will look at data warehouse use cases and see if all of them can be accommodated in the context of a data lake. We will reflect on whether there is a better architecture strategy to consider instead of just shunting between warehouses and lakes.

The rest of the chapter is locked

You have been reading a chapter from

Simplifying Data Engineering and Analytics with Delta

Published in: Jul 2022Publisher: PacktISBN-13: 9781801814867

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Anindita Mahapatra

Anindita Mahapatra is a Solutions Architect at Databricks in the data and AI space helping clients across all industry verticals reap value from their data infrastructure investments. She teaches a data engineering and analytics course at Harvard University as part of their extension school program. She has extensive big data and Hadoop consulting experience from Thinkbig/Teradata prior to which she was managing development of algorithmic app discovery and promotion for both Nokia and Microsoft AppStores. She holds a Masters degree in Liberal Arts and Management from Harvard Extension School, a Masters in Computer Science from Boston University and a Bachelors in Computer Science from BITS Pilani, India.
Read more about Anindita Mahapatra

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages