You're reading from Simplifying Data Engineering and Analytics with Delta

Product type Book

Published in Jul 2022

Publisher Packt

ISBN-13 9781801814867

Pages 334 pages

Edition 1st Edition

Languages

Concepts

Big Data

Author (1):

Anindita Mahapatra

Table of Contents (18) Chapters

Preface

Section 1 – Introduction to Delta Lake and Data Engineering Principles

Chapter 1: Introduction to Data Engineering

Chapter 2: Data Modeling and ETL

Chapter 3: Delta – The Foundation Block for Big Data

Section 2 – End-to-End Process of Building Delta Pipelines

Chapter 4: Unifying Batch and Streaming with Delta

Chapter 5: Data Consolidation in Delta Lake

Chapter 6: Solving Common Data Pattern Scenarios with Delta

Chapter 7: Delta for Data Warehouse Use Cases

Chapter 8: Handling Atypical Data Scenarios with Delta

Chapter 9: Delta for Reproducible Machine Learning Pipelines

Chapter 10: Delta for Data Products and Services

Section 3 – Operationalizing and Productionalizing Delta Pipelines

Chapter 11: Operationalizing Data and ML Pipelines

Chapter 12: Optimizing Cost and Performance with Delta

Chapter 13: Managing Your Data Journey

Other Books You May Enjoy

Chapter 2: Data Modeling and ETL

"Poor programmers care about code, and good programmers care about the data structure and the relationships between data."

— Linus Torvalds, the founder of Linux, on the importance of data modeling

In the previous chapter, we introduced the big data ecosystem, the use cases across different industry verticals that use this data, the common challenges that they all face, and their journey towards digitization. We also looked at the trends in compute and storage technologies along with cloud adoption that is paving the way to enable companies to be more data-driven.

Data platforms are continuously evolving to support business analytic use cases and speed to insights is critical for a business to remain relevant and competitive. Both BI and AI leverage curated data to produce sound insights, but getting to curated data requires some discipline around data layout, modeling, and governance. In this chapter, we will look at ways...

Technical requirements

To follow along this chapter, make sure you have the code and instructions as detailed in this GitHub location: https://github.com/PacktPublishing/Simplifying-Data-Engineering-and-Analytics-with-Delta/tree/main/Chapter02

Let’s get started!

What is data modeling and why should you care?

When you design software, you start architecting with a paper design of the various components and how they interact. The same is true of big data systems. To get the most value out of data, you need to understand its intrinsic properties and inherent relationships. Data modeling is the process of organizing, representing, and visualizing your data so it fits the needs of the business processes. There are several functional and nonfunctional requirements around data, technology, and business that should be taken into consideration. Business operational processes and the structure of the generated data from the various operations are inputs to the data model. Let's look at some of the advantages of going through data modeling before rushing to implement a data solution.

Advantages of a data modeling exercise

At its core, data modeling is designed for persisting data and retrieving it in an optimal way. The following lists some...

Understanding metadata – data about data

Metadata is data about the data and is an important governance aspect exposed in data catalogs. The data value use case is around the ability to identify key data assets and assess their economic importance to the organization. Let's examine different aspects of metadata.

Data catalog

A catalog is a tool that houses the metadata and provides the tooling for search and discoverability. This is often confused with data dictionaries, which are just data artifacts and do not necessarily have the associated tooling to facilitate data search and retrieval.

There are several vendors in this space and some of the popular ones include Collibra, Alation, and Glue. The data discovery use case is probably the most valuable as it helps users (data engineers, data analysts, and data scientists) search, find, and understand data.

Data governance is another important capability, where data lineage is documented in a central place and...

Moving and transforming data using ETL

A data pipeline is an artifact of a data engineering process. It transforms raw data into data ready for analytics. These in turn help solve problems, aid support decisions, and make our lives more convenient. In some ways, it can be thought of as the stitch between the OLTP and OLAP systems. Data pipelines are sometimes referred to as ETL, which stands for extract, transform, load, and it has a variation called extract, load, transform (ELT). The main difference between the two is whether the incoming data is first saved to disk and then transformed (data wrangling) or vice versa. The processing is loosely referred to as ETL. Although, it is fair to say ELT is relevant in the context of Data Lakes and unstructured data, whereas ETL is used for Data Warehouses. The following diagram shows how ETL bridges the gap between the OLTP and OLAP systems:

Figure 2.9 – ETL stitches OLTP and OLAP systems

Data pipelines include...

How to choose the right data format

Not all tools support all of the data formats. Every tool reads data off disk in chunks of blocks (KB/MB/GB), that is, minimizing these fetches helps improve the speed of access to data. Conversely, a single read for a single record brings back a lot more data than you may want, so caching it may help with subsequent queries. Different systems have different default block sizes. To choose the right data format, you need to consider several factors, such as the following:

What is the optimal tradeoff between cost, performance, and throughput considerations of ingestion and access patterns?
Are you constrained by storage or memory or CPU or I/O?
How large is a file? If your data is not splittable, we lose the parallelism that allows fast queries.
How many columns are being stored, and how many columns are used for the analysis?
Does your data change over time? If it does, how often does it happen, and how does it change?...

Common big data design patterns

Design patterns provide a common vocabulary for data personas to understand and share design and architecture blueprints. So, given a set of requirements, everyone understands the likely design pattern to apply as a plausible solution. Traditional software engineering design patterns are object-oriented and are of three types, categorized under creational, structural, and behavioral patterns. Data engineering is not necessarily object-oriented (OO) related and is better articulated around concepts of data ingestion, transformation, storage, and analytics patterns. In the next few sections, we will look at reusable patterns in each of these areas.

Ingestion

Ingestion refers to all aspects of consolidating data into a target site for further processing and analysis from multiple sources using different languages, different file formats, and different sizes and frequencies. The number of such combinations is large, which is why we see a lot of data...

Summary

In this chapter, we talked about the importance of data modeling exercises to organize and persist the data while designing a new ETL use case so that subsequent data operations can benefit from an optimal balance of performance, cost, efficiency, and quality.

A good data model helps us with faster query speeds and reduces unnecessary I/O throughput brought about by expensive wasted scans. A design-first approach forces us to think through the data relations and can not only help reduce data redundancy but also help improve the reuse of pre-computed results, thereby reducing storage and computing costs for big data platforms. The increase in efficiency of data utilization improves the overall user experience. Having stable base datasets ensures more consistency of derived datasets further down the pipeline, thereby improving the quality of generated insights.

In the next chapter, we will look at the Delta protocol and the main features that help bring reliability, perfor...