Reader small image

You're reading from  Simplifying Data Engineering and Analytics with Delta

Product typeBook
Published inJul 2022
PublisherPackt
ISBN-139781801814867
Edition1st Edition
Concepts
Right arrow
Author (1)
Anindita Mahapatra
Anindita Mahapatra
author image
Anindita Mahapatra

Anindita Mahapatra is a Solutions Architect at Databricks in the data and AI space helping clients across all industry verticals reap value from their data infrastructure investments. She teaches a data engineering and analytics course at Harvard University as part of their extension school program. She has extensive big data and Hadoop consulting experience from Thinkbig/Teradata prior to which she was managing development of algorithmic app discovery and promotion for both Nokia and Microsoft AppStores. She holds a Masters degree in Liberal Arts and Management from Harvard Extension School, a Masters in Computer Science from Boston University and a Bachelors in Computer Science from BITS Pilani, India.
Read more about Anindita Mahapatra

Right arrow

Chapter 6: Solving Common Data Pattern Scenarios with Delta

"Without changing our pattern of thought, we will not be able to solve the problems we created with our current pattern of thoughts"

– Albert Einstein

In the previous chapters, we established the foundation of Delta and how it helps to consolidate disparate datasets, and how it offers a wide array of tools to slice and dice data using unified processing and storage APIs. We examined basic Create, Retrieve, Update, Delete (CRUD) operations using Delta and time travel capabilities to rewind to a different view of data at a previous point in time for rollback capabilities. We used Delta to showcase functionality around fine-grained updates and deletes to data and the handling of late-arriving data. It may arise on account of a technical glitch upstream or a human error. We demonstrated the ability to...

Technical requirements

To follow the instructions of this chapter, make sure you have the code and instructions as detailed in this GitHub location:

https://github.com/PacktPublishing/Simplifying-Data-Engineering-and-Analytics-with-Delta/tree/main/Chapter06

Examples in this book cover some Databricks-specific features to provide a complete view of capabilities. New features continue to be ported from Databricks to the open source Delta.

Let's get started!

Understanding use case requirements

Each problem that a client brings up will always have some similarities to a problem you may have seen before and yet have some nuances to it that make it a little different. So, before rushing to reuse a solution, you need to understand the requirements and the priorities so that they can be handled in the order of importance that the client values them. A good way to look at requirements is by demarcating the functional ones from the non-functional ones. Functional requirements specify what the system should do, whereas non-functional requirements describe how the system will perform. For example, we may be able to perform fine-grained deletes from the enterprise data lake for a GDPR compliance requirement, but it takes two days and two engineers to do so at the end of each month, so it will not meet the requirements of a 12-hour SLA. The technical capabilities exist, but the solution is still not usable. The following diagram helps you classify...

Minimizing data movement with Delta time travel

Apart from ensuring data quality, the other advantage of minimizing data movement is that it reduces the costs associated with data. To prevent fragile disparate systems from being stitched together, the first core requirement is to keep data in an open format for multiple tools of the ecosystem to handle, which is what Delta architectures promote.

There are some scenarios where a data professional needs to make copies of an underlying dataset. For example, to make a series of A/B tests in the context of debugging and integration testing, a data engineer needs a point-in-time reference to a data snapshot to compare for debugging and integration testing purposes. A BI analyst may need to run different reports off the same data to run some audit checks. Similarly, an ML practitioner may need a consistent dataset because experiments have to be compared across different ML model architectures or against different hyperparameter combinations...

Delta cloning

Cloning is the process of making a copy. In the previous section, we started out by saying that we should try to minimize data movement and data copies whenever possible because there will always be a lot of effort required to keep things in sync and reconcile data. However, there are some cases where it is inevitable for business requirements. For example, there may be a scenario for data archiving, trying to reproduce an ML flow experiment in a different environment, short-term experimental runs on production data, the need to share data with a different LOB, or maybe the need to tweak a few table properties without affecting original source especially if there are consumers leveraging it with some assumptions.

Shallow cloning refers to copying metadata and deep cloning refers to copying both metadata and data. If shallow cloning suffices, it should be preferred as it is light and inexpensive, whereas deep cloning is a more involved process.

...

Handling CDC

CDC is a process that identifies the classification of incoming records in real-time to determine which ones are brand new, which ones are modifications of existing data, and which ones are requests for deletes. Operational data stores are capturing transactions in OLTP systems continuously and streaming them across to OLAP systems. These two data systems need to be kept in sync to reconcile data and keep data fidelity. It is like a replay of the operations but on a different system.

CDC

This is the flow of data from the OLTP system into an OLAP system, typically the first landing zone, which is referred to as the bronze layer in the medallion architecture. Several tools, such as GoldenGate from Oracle or PowerGate from Informatica, support the generation of change sets, or they could be generated by other relational stores that capture this information on a data modification trigger. Moreover, this could be an omni-channel scenario where the same type of data is...

Handling Slowly Changing Dimensions (SCD)

Operational data makes its way into OLAP systems that comprise fact and dimension tables. The facts change frequently and are usually additive in nature. The dimensions do not change as often but they do experience some change, hence the name "slowly changing dimensions."

Business rules dictate how this change is to be handled and the various types of SCD operations reflect this. The following table lists them.

Figure 6.5 – SCD types

Of all these alternatives, types 1 and 2 are the most popular in the industry. In the next section, we will explore them in more detail.

SCD Type 1

This is fairly straightforward as there is no need to store the historical data; the newer data just overwrites the older data. Delta's MERGE constructs come in handy. There is an initial full load of the data. New data is inserted, existing data is updated, and deletes remove the data altogether.

...

Summary

Delta Lake with ACID transactions makes it much easier to reliably perform UPDATE and DELETE operations. Delta introduces the MERGE INTO operator to perform Upsert/Merge actions as atomic operations along with time travel features to provide rewind capabilities on Delta Lake tables. Cloning, CDC, and SCD are patterns found in several use cases that build upon these base operations. In this chapter, we have looked at these common data patterns and shown how Delta continues to provide efficient, robust, and elegant solutions to simplify the everyday work scenarios of a data persona, allowing them to focus on the use case at hand.

In the next chapter, we will look at data warehouse use cases and see if all of them can be accommodated in the context of a data lake. We will reflect on whether there is a better architecture strategy to consider instead of just shunting between warehouses and lakes.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Simplifying Data Engineering and Analytics with Delta
Published in: Jul 2022Publisher: PacktISBN-13: 9781801814867
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Anindita Mahapatra

Anindita Mahapatra is a Solutions Architect at Databricks in the data and AI space helping clients across all industry verticals reap value from their data infrastructure investments. She teaches a data engineering and analytics course at Harvard University as part of their extension school program. She has extensive big data and Hadoop consulting experience from Thinkbig/Teradata prior to which she was managing development of algorithmic app discovery and promotion for both Nokia and Microsoft AppStores. She holds a Masters degree in Liberal Arts and Management from Harvard Extension School, a Masters in Computer Science from Boston University and a Bachelors in Computer Science from BITS Pilani, India.
Read more about Anindita Mahapatra