Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Simplifying Data Engineering and Analytics with Delta

You're reading from  Simplifying Data Engineering and Analytics with Delta

Product type Book
Published in Jul 2022
Publisher Packt
ISBN-13 9781801814867
Pages 334 pages
Edition 1st Edition
Languages
Concepts
Author (1):
Anindita Mahapatra Anindita Mahapatra
Profile icon Anindita Mahapatra

Table of Contents (18) Chapters

Preface Section 1 – Introduction to Delta Lake and Data Engineering Principles
Chapter 1: Introduction to Data Engineering Chapter 2: Data Modeling and ETL Chapter 3: Delta – The Foundation Block for Big Data Section 2 – End-to-End Process of Building Delta Pipelines
Chapter 4: Unifying Batch and Streaming with Delta Chapter 5: Data Consolidation in Delta Lake Chapter 6: Solving Common Data Pattern Scenarios with Delta Chapter 7: Delta for Data Warehouse Use Cases Chapter 8: Handling Atypical Data Scenarios with Delta Chapter 9: Delta for Reproducible Machine Learning Pipelines Chapter 10: Delta for Data Products and Services Section 3 – Operationalizing and Productionalizing Delta Pipelines
Chapter 11: Operationalizing Data and ML Pipelines Chapter 12: Optimizing Cost and Performance with Delta Chapter 13: Managing Your Data Journey Other Books You May Enjoy

Chapter 2: Data Modeling and ETL

"Poor programmers care about code, and good programmers care about the data structure and the relationships between data."

— Linus Torvalds, the founder of Linux, on the importance of data modeling

In the previous chapter, we introduced the big data ecosystem, the use cases across different industry verticals that use this data, the common challenges that they all face, and their journey towards digitization. We also looked at the trends in compute and storage technologies along with cloud adoption that is paving the way to enable companies to be more data-driven.

Data platforms are continuously evolving to support business analytic use cases and speed to insights is critical for a business to remain relevant and competitive. Both BI and AI leverage curated data to produce sound insights, but getting to curated data requires some discipline around data layout, modeling, and governance. In this chapter, we will look at ways...

Technical requirements

To follow along this chapter, make sure you have the code and instructions as detailed in this GitHub location: https://github.com/PacktPublishing/Simplifying-Data-Engineering-and-Analytics-with-Delta/tree/main/Chapter02

Let’s get started!

What is data modeling and why should you care?

When you design software, you start architecting with a paper design of the various components and how they interact. The same is true of big data systems. To get the most value out of data, you need to understand its intrinsic properties and inherent relationships. Data modeling is the process of organizing, representing, and visualizing your data so it fits the needs of the business processes. There are several functional and nonfunctional requirements around data, technology, and business that should be taken into consideration. Business operational processes and the structure of the generated data from the various operations are inputs to the data model. Let's look at some of the advantages of going through data modeling before rushing to implement a data solution.

Advantages of a data modeling exercise

At its core, data modeling is designed for persisting data and retrieving it in an optimal way. The following lists some...

Understanding metadata – data about data

Metadata is data about the data and is an important governance aspect exposed in data catalogs. The data value use case is around the ability to identify key data assets and assess their economic importance to the organization. Let's examine different aspects of metadata.

Data catalog

A catalog is a tool that houses the metadata and provides the tooling for search and discoverability. This is often confused with data dictionaries, which are just data artifacts and do not necessarily have the associated tooling to facilitate data search and retrieval.

There are several vendors in this space and some of the popular ones include Collibra, Alation, and Glue. The data discovery use case is probably the most valuable as it helps users (data engineers, data analysts, and data scientists) search, find, and understand data.

Data governance is another important capability, where data lineage is documented in a central place and...

Moving and transforming data using ETL

A data pipeline is an artifact of a data engineering process. It transforms raw data into data ready for analytics. These in turn help solve problems, aid support decisions, and make our lives more convenient. In some ways, it can be thought of as the stitch between the OLTP and OLAP systems. Data pipelines are sometimes referred to as ETL, which stands for extract, transform, load, and it has a variation called extract, load, transform (ELT). The main difference between the two is whether the incoming data is first saved to disk and then transformed (data wrangling) or vice versa. The processing is loosely referred to as ETL. Although, it is fair to say ELT is relevant in the context of Data Lakes and unstructured data, whereas ETL is used for Data Warehouses. The following diagram shows how ETL bridges the gap between the OLTP and OLAP systems:

Figure 2.9 – ETL stitches OLTP and OLAP systems

Data pipelines include...

How to choose the right data format

Not all tools support all of the data formats. Every tool reads data off disk in chunks of blocks (KB/MB/GB), that is, minimizing these fetches helps improve the speed of access to data. Conversely, a single read for a single record brings back a lot more data than you may want, so caching it may help with subsequent queries. Different systems have different default block sizes. To choose the right data format, you need to consider several factors, such as the following:

  • What is the optimal tradeoff between cost, performance, and throughput considerations of ingestion and access patterns?
  • Are you constrained by storage or memory or CPU or I/O?
  • How large is a file? If your data is not splittable, we lose the parallelism that allows fast queries.
  • How many columns are being stored, and how many columns are used for the analysis?
  • Does your data change over time? If it does, how often does it happen, and how does it change?...

Common big data design patterns

Design patterns provide a common vocabulary for data personas to understand and share design and architecture blueprints. So, given a set of requirements, everyone understands the likely design pattern to apply as a plausible solution. Traditional software engineering design patterns are object-oriented and are of three types, categorized under creational, structural, and behavioral patterns. Data engineering is not necessarily object-oriented (OO) related and is better articulated around concepts of data ingestion, transformation, storage, and analytics patterns. In the next few sections, we will look at reusable patterns in each of these areas.

Ingestion

Ingestion refers to all aspects of consolidating data into a target site for further processing and analysis from multiple sources using different languages, different file formats, and different sizes and frequencies. The number of such combinations is large, which is why we see a lot of data...

Summary

In this chapter, we talked about the importance of data modeling exercises to organize and persist the data while designing a new ETL use case so that subsequent data operations can benefit from an optimal balance of performance, cost, efficiency, and quality.

A good data model helps us with faster query speeds and reduces unnecessary I/O throughput brought about by expensive wasted scans. A design-first approach forces us to think through the data relations and can not only help reduce data redundancy but also help improve the reuse of pre-computed results, thereby reducing storage and computing costs for big data platforms. The increase in efficiency of data utilization improves the overall user experience. Having stable base datasets ensures more consistency of derived datasets further down the pipeline, thereby improving the quality of generated insights.

In the next chapter, we will look at the Delta protocol and the main features that help bring reliability, perfor...

Further reading

Please check out the following links for further reading:

  • DAS Webinar: Best Practices in Metadata Management: -practices-in-metadata-management
  • What is metadata management?

  • P-Codec: Parallel Compressed File Decompression Algorithm for Hadoop:
lock icon The rest of the chapter is locked
You have been reading a chapter from
Simplifying Data Engineering and Analytics with Delta
Published in: Jul 2022 Publisher: Packt ISBN-13: 9781801814867
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}