You're reading from Simplifying Data Engineering and Analytics with Delta

Product type Book

Published in Jul 2022

Publisher Packt

ISBN-13 9781801814867

Pages 334 pages

Edition 1st Edition

Languages

Concepts

Big Data

Author (1):

Anindita Mahapatra

Table of Contents (18) Chapters

Preface

Section 1 – Introduction to Delta Lake and Data Engineering Principles

Chapter 1: Introduction to Data Engineering

Chapter 2: Data Modeling and ETL

Chapter 3: Delta – The Foundation Block for Big Data

Section 2 – End-to-End Process of Building Delta Pipelines

Chapter 4: Unifying Batch and Streaming with Delta

Chapter 5: Data Consolidation in Delta Lake

Chapter 6: Solving Common Data Pattern Scenarios with Delta

Chapter 7: Delta for Data Warehouse Use Cases

Chapter 8: Handling Atypical Data Scenarios with Delta

Chapter 9: Delta for Reproducible Machine Learning Pipelines

Chapter 10: Delta for Data Products and Services

Section 3 – Operationalizing and Productionalizing Delta Pipelines

Chapter 11: Operationalizing Data and ML Pipelines

Chapter 12: Optimizing Cost and Performance with Delta

Chapter 13: Managing Your Data Journey

Other Books You May Enjoy

Chapter 7: Delta for Data Warehouse Use Cases

"It is not the strongest of the species that survives, nor the most intelligent. It is the one that is the most adaptable to change."

– Charles Darwin, On the Origin of Species. "Descent with modification"

In the previous chapters, we went over the main capabilities of Delta and its edge over other data formats and protocols. Delta has its origins in data lakes, and we examined how Delta addresses the common challenges of traditional data lakes. In fact, it is the evolutionary next stage of lakes and fits into a new category known as the data lakehouse. So, it is a no-brainer that it is the preferred choice for any new data lake initiative, but what about data warehouse use cases? Is that a separate category of data scenarios that data lakes do not tackle effectively?

The data warehouse is a concept that was born in the 1980s! It was popularized by relational database platforms. As is true of...

Technical requirements

To follow along this chapter, make sure you have the code and instructions as detailed in this GitHub location:

https://github.com/PacktPublishing/Simplifying-Data-Engineering-and-Analytics-with-Delta/tree/main/Chapter07

Examples in this book cover some Databricks-specific features to provide a complete view of capabilities. Newer features continue to be ported from Databricks to the open source Delta.

Let's get started!

Choosing the right architecture

If you were evaluating technologies to choose an architecture framework for your data use cases, what guiding criteria would you use to make the decision? The two popular ones out there are data warehouses and data lakes. The three main dimensions to consider are storage, compute, and governance.

The questions you should ask are: what are your use cases? If they are only BI, do you need to future-proof your investment? What kinds of data are you going to ingest? If it is mostly structured data, then do you want to consider the cost of migrating architectures if your stakeholders suddenly want to understand free text or image data? What about the existing skill set of the data personas who are going to work on this data platform? If you want to retain the best talent, you'll need to give them opportunities to learn and grow, so you'll need to spend some time understanding the capabilities offered by warehouses and lakes and not make a short...

Understanding what a data warehouse really solves

At its core, a data warehouse is a data repository of disparate data sources, mostly structured to help answer BI analysis questions using reports and dashboards. It is typically used by heads of departments and businesses to get a bird's-eye view of how the organization is doing holistically using SQL interfaces. It analyzes operational data to predict growth and identify bottlenecks and other business stragglers to help the business evaluate its performance using KPI metrics in the face of its competitors and plan more strategically. The older on-premises offerings are moving into the cloud to take advantage of the elasticity of cloud computing.

This can be simplified into two main parts:

Base underlying storage:
- Cloud storage is increasingly popular on account of it being affordable, scalable, and reliable.
The analytic layer built on top of storage:
- The analytic layer houses several pieces beyond just data, such...

Discovering when a data lake does not suffice

Data lakes were supposed to fix all the deficiencies of warehouses, but did they? Let's find out. The best contribution of data lakes was to truly unify all kinds of data in an open format at all velocities, and this includes real-time ingestion and analytics. Yes, a big check mark for this one. We can now have all kinds of unstructured data right alongside the structured ones, making them true first-class citizens. Likewise, we can enable ML on this data along with having the ability to use high-level languages to grapple with data using open APIs. Also, another big advantage is that data remained in an open format, allowing all tools of the ecosystem to leverage it effectively from its single source of truth. No more vendor lock-in!

The thing that they had to give up was effective BI, which warehouses were so good at. Exploratory Data Analytics (EDA), which did not have such stringent SLAs, was enabled, but core BI reporting...

Addressing concurrency and latency requirements with Delta

Analytics queries are of two types:

Ad-hoc data exploration by analysts as they proceed with data discovery activities. Data scientists and BI analysts have some tolerance for ad-hoc queries, meaning if it takes longer to retrieve the results, it is undesirable but tolerated.
Known queries for well-defined consumption patterns. There is very little tolerance for known queries. Consumers expect these to be refreshed quickly as the end user may be a business executive or someone outside of the data organization who'll dislike the latency.

We should remember that a dashboard hosts several queries as widgets or sections and there are several consumers of that data. The time it takes to return the results is referred to as the latency of the query, and the maximum number of simultaneous users that it serves at the same point in time is referred to as the concurrency of the query. Latency and concurrency...

Visualizing data using BI reporting

One of the main use cases served by warehouses is around SQL queries and visualization in various reports and dashboards. Visualizing data involves converting it into graphical representations that make it easy to comprehend and detect outliers. It is especially useful to correlate data and make a statement about it that can be remembered for a long time. As noted earlier, the queries themselves fall into the two categories of ad hoc data exploration and known queries. There are a lot of tools on the market for visualizations, including Tableau, Power BI, Looker, Spotfire, and Qlik. They allow non-technical users to easily build personalized reports and dashboards, provided that there is access to the right datasets.

Spark supports a distributed SQL engine using its Thrift-based JDBC/ODBC or command-line interface (https://spark.apache.org/docs/latest/sql-distributed-sql-engine.html#running-the-thrift-jdbcodbc-server).

Managed platforms such...

Analyzing tradeoffs in a push versus pull data flow

A long, long time ago, we started with a data warehouse. As we discovered its inadequacies, we moved to a data lake. However, a vanilla data lake is no silver bullet, so folks would perform expensive ETL in a data lake and push curated, aggregated data slivers into a downstream warehouse for BI tools to pick up. Another architecture anti-pattern that we've seen in the field is ETL being done in a warehouse and pushing data to a lake to do ML. We have come a long way from there. Modern data lakes embrace the lakehouse paradigm, and BI tools can directly reach out to the data in a lake, bypassing the warehouse completely. We believe that this pattern will continue to gain traction in the industry. So, is the warehouse dead? Yes, in spirit, it is, but in practice, it'll take a few more years to phase out completely. So, when is it good to have any kind of specialized data stored to the right of a data lake? If it can be avoided...

Considerations around data governance

Data governance refers to aligning all aspects of data strategy, business strategy, and compliance requirements. A three-pronged approach of people, policy, and process will provide oversight for all data operations from the time data touches a system to the point it leaves. Roles and responsibilities dictate who has access to what data, something that needs to be enforced and monitored. Data lineage is tracked to provide accountability for how data has been transformed at various steps. Delta's history functionality provides a good audit trail. A central catalog builds on top of it and provides a central place for defining the rules, enforcing them, and monitoring compliance via audit logs. Some of these catalogs have to be built and stitched together unless a managed platform that has taken care of these aspects is leveraged.

People using data need to be assured of its quality, so being able to define constraints, note when they have...

The rise of the lakehouse category

Simply put, "lakehouse" refers to an open data architecture that combines the best of data lakes and data warehouses on a single platform. At this point, it would be fair to say that a lakehouse is closer to a data lake than a data warehouse. In fact, it is an extension of your data lake to support all use cases, from BI to AI. All data science and ML personas who were shunted into downstream applications because the tools of their trade were so vastly different and can now share the same stage and have access to the same data as other data personas. This eliminates the need to stitch fragile systems together and leads to better data quality and end-to-end latencies since there is no need to copy data across disparate architectures. The following diagram shows the growing pains of both warehouses and lakes, and how a lakehouse is a combination of the best attributes of both architectures.

Figure 7.5 – From the...

Summary

In this chapter, we emphasized the need to choose the right architecture for future-proofing a business. This choice will determine the future agility of on-boarding use cases and the productivity of data personas in exploring and executing use cases. Traditional data warehouses and data lakes have their own strengths and weaknesses, and the lakehouse is a happy amalgamation of the two technologies.

The data format of warehouses is closed and proprietary, whereas a lakehouse prescribes an open data format. Our recommendation is to use Delta, as it is the best open source data format in the open source community today. The data type of warehouses caters to mostly structured data, and some semi-structured, whereas a lakehouse supports all kinds of data, including unstructured. Cloud storage is highly scalable, durable, and cost-effective, so a lakehouse is not only highly scalable but much cheaper and more performant than its warehouse counterpart. A warehouse was designed...