You're reading from Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Product type Book

Published in Oct 2021

Publisher Packt

ISBN-13 9781801077743

Pages 480 pages

Edition 1st Edition

Languages

Concepts

Data Processing

Author (1):

Manoj Kukreja

Table of Contents (17) Chapters

Preface

Section 1: Modern Data Engineering and Tools

Chapter 1: The Story of Data Engineering and Analytics

Chapter 2: Discovering Storage and Compute Data Lakes

Chapter 3: Data Engineering on Microsoft Azure

Section 2: Data Pipelines and Stages of Data Engineering

Chapter 4: Understanding Data Pipelines

Chapter 5: Data Collection Stage – The Bronze Layer

Chapter 6: Understanding Delta Lake

Chapter 7: Data Curation Stage – The Silver Layer

Chapter 8: Data Aggregation Stage – The Gold Layer

Section 3: Data Engineering Challenges and Effective Deployment Strategies

Chapter 9: Deploying and Monitoring Pipelines in Production

Chapter 10: Solving Data Engineering Challenges

Chapter 11: Infrastructure Provisioning

Chapter 12: Continuous Integration and Deployment (CI/CD) of Data Pipelines

Other Books You May Enjoy

Understanding data consumption

Before we start verifying the aggregated data, we should focus on how our end users will be able to consume data for dashboarding, ML, and AI purposes. As per the laid-out architecture of the Electroniz lakehouse, we decided to publish data from both the gold and silver layers.

Publishing data from the gold layer is necessary; otherwise, how would users be able to access aggregated data? But why do we need to publish data from the silver layer? You guessed it – analytics is an ongoing process. In the future, users may want to create new dashboards and ML models for the betterment of the company.

Important

Publishing data from the gold and silver layers is acceptable because they store data that is in a clean and secure state. But the same cannot be said for data in the bronze layer. Publishing raw/unclean data not only throws a lot of work around standardization, validation, and deduplication at end users, but it also ends up exposing...

The rest of the chapter is locked

You're reading from Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Table of Contents (17) Chapters

Understanding data consumption

Authors (1)

Personalised recommendations for you

You're reading from Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Table of Contents (17) Chapters

Understanding data consumption

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you