Reader small image

You're reading from  Simplifying Data Engineering and Analytics with Delta

Product typeBook
Published inJul 2022
PublisherPackt
ISBN-139781801814867
Edition1st Edition
Concepts
Right arrow
Author (1)
Anindita Mahapatra
Anindita Mahapatra
author image
Anindita Mahapatra

Anindita Mahapatra is a Solutions Architect at Databricks in the data and AI space helping clients across all industry verticals reap value from their data infrastructure investments. She teaches a data engineering and analytics course at Harvard University as part of their extension school program. She has extensive big data and Hadoop consulting experience from Thinkbig/Teradata prior to which she was managing development of algorithmic app discovery and promotion for both Nokia and Microsoft AppStores. She holds a Masters degree in Liberal Arts and Management from Harvard Extension School, a Masters in Computer Science from Boston University and a Bachelors in Computer Science from BITS Pilani, India.
Read more about Anindita Mahapatra

Right arrow

Chapter 3: Delta – The Foundation Block for Big Data

"Without a solid foundation, you will have trouble creating anything of value."

– Erica Oppenheimer, on academic mastery

In the previous chapters, we looked at the trends in big data processing and how to model data. In this chapter, we will look at the need to break down data silos and consolidate all types of data in a centralized data lake to get holistic insights. First, we will understand the importance of the Delta protocol and the specific problems that it helps address. Data products have certain repeatable patterns and we will apply Delta in each situation to analyze the before and after scenarios. Then, we will look at the underlying file format and the components that are used to build Delta, its genesis, and the high-level features that make Delta the go-to file format for all types of big data workloads. It makes not only the data engineer's job easier, but also other data personas...

Technical requirements

The following GitHub link will help you get started with Delta: https://github.com/delta-io/delta. Here, you will find the Delta Lake documentation and QuickStart guide to help you set up your environment and become familiar with the necessary APIs.

To follow along this chapter, make sure you have the code and instructions as detailed in this GitHub location: https://github.com/PacktPublishing/Simplifying-Data-Engineering-and-Analytics-with-Delta/tree/main/Chapter03

Examples in this book cover some Databricks specific features to provide a complete view of capabilities. Newer features continue to be ported from Databricks to the Open Source Delta. (https://github.com/delta-io/delta/issues/920)

Let's start by examining the main challenges plaguing traditional data lakes.

Motivation for Delta

Data lakes have been in existence for a while now, so their need is no longer questioned. What is more relevant is the specifics of the solution's implementation. Consolidating all the siloed data by itself does not constitute a data lake. However, it is a starting point. Layering in governance makes the data consumable and is a step toward a curated data lake. Big data systems provide scale out of the box but force us to make some accommodations for data quality. Age-old aspects of transactional integrity were compromised on a distributed system because it was very hard to maintain ACID compliance. Due to this, BASE properties were favored. All of this was moving the needle in the wrong direction and from pristine data lakes we were moving toward data swamps, where the data could not be trusted and hence insights that were generated on the data could not be trusted either. So, what is the point of building a data lake?

Let's consider a few common...

Demystifying Delta

The Delta protocol is based on Parquet and has several components. Let's look at its composition. The transaction log is the secret sauce that supports key features such as ACID compliance, schema evolution, and time travel and unlocks the power of Delta. It is an ordered record of every change that's been made to the table by users and can be regarded as the single source of truth. The following diagram shows the sub-components that are broadly regarded as part of a Delta table:

Figure 3.3 – Delta protocol components

The main point to highlight is that metadata lies alongside data in the transaction logs. Before this, all the metadata was in the metastore. However, when the data is changing frequently, it would be too much information to store in a metastore, and storing just the last state means lineage and history will be lost. In the context of big data, the transaction history and metadata changes are also big data by...

The main features of Delta

The features we will define in this section are equivalent to weapons in an arsenal that Delta provides so that you can create data products and services. These will help ensure that your pipelines are built around sound principles of reliability and performance to maximize the effectiveness of the use cases built on top of these pipelines. Without any more preamble, let's dive right in.

ACID transaction support

In a cloud ecosystem, even the most robust and well-tested pipelines can fail on account of temporary glitches, reinforcing the fact that a chain is as strong as its weakest link and it doesn't matter that a long-running job failed in the first few minutes or the last few minutes. Cleaning up the subsequent mess in a distributed system would be an arduous task. Worse still is the fact that partial data has now been exposed to consumers who may use it in their dashboards or models to arrive at wrong insights and trigger incorrect alarms...

Life with and without Delta

The tech landscape is changing rapidly, with the whole industry innovating faster today than ever before. A complex system is hard to change and is not agile enough to take advantage of the pace of innovation, especially in the open source world. Delta is an open source protocol that facilitates flexible analytic platforms as it comes prepackaged with a lot of features that benefit all kinds of data personas. With its support for ACID transactions and full compatibility with Apache Spark APIs, it is a no-brainer to adopt it for all your data use cases. This helps simplify the architecture both during development as well as during subsequent maintenance phases. Features such as the unification of batch and streaming, schema inference, and evolution take the burden off DevOps and data engineer personnel, allowing them to focus on the core use cases to keep the business competitive.

It is very easy to create a Delta table, store data in Delta format, or...

Summary

Delta helps address the inherent challenges of traditional data lakes and is the foundational piece of the Lakehouse paradigm, which makes it a clear choice in big data projects.

In this chapter, we examined the Delta protocol, its main features, contrasted the before and after scenarios, and concluded that not only do the features work out of the box but it is very easy to transition to Delta and start reaping the benefits instead of spending time, resources, and effort solving infrastructure problems over and over again.

There is great value when applying Delta to real-world big data use cases, especially those involving fine-grained updates and deletes as in the GDPR scenario, enforcing schema evolution, or going back in time using its time travel capabilities.

In the next chapter, we will look at examples of ETL pipelines involving both batch and streaming to see how Delta helps unify them to simplify not only creating but maintaining them.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Simplifying Data Engineering and Analytics with Delta
Published in: Jul 2022Publisher: PacktISBN-13: 9781801814867
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Anindita Mahapatra

Anindita Mahapatra is a Solutions Architect at Databricks in the data and AI space helping clients across all industry verticals reap value from their data infrastructure investments. She teaches a data engineering and analytics course at Harvard University as part of their extension school program. She has extensive big data and Hadoop consulting experience from Thinkbig/Teradata prior to which she was managing development of algorithmic app discovery and promotion for both Nokia and Microsoft AppStores. She holds a Masters degree in Liberal Arts and Management from Harvard Extension School, a Masters in Computer Science from Boston University and a Bachelors in Computer Science from BITS Pilani, India.
Read more about Anindita Mahapatra