You're reading from Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Product type Book

Published in Oct 2021

Publisher Packt

ISBN-13 9781801077743

Pages 480 pages

Edition 1st Edition

Languages

Concepts

Data Processing

Author (1):

Manoj Kukreja

Table of Contents (17) Chapters

Preface

1. Section 1: Modern Data Engineering and Tools

2. Chapter 1: The Story of Data Engineering and Analytics

3. Chapter 2: Discovering Storage and Compute Data Lakes

4. Chapter 3: Data Engineering on Microsoft Azure

5. Section 2: Data Pipelines and Stages of Data Engineering

6. Chapter 4: Understanding Data Pipelines

7. Chapter 5: Data Collection Stage – The Bronze Layer

8. Chapter 6: Understanding Delta Lake

9. Chapter 7: Data Curation Stage – The Silver Layer

10. Chapter 8: Data Aggregation Stage – The Gold Layer

11. Section 3: Data Engineering Challenges and Effective Deployment Strategies

12. Chapter 9: Deploying and Monitoring Pipelines in Production

13. Chapter 10: Solving Data Engineering Challenges

14. Chapter 11: Infrastructure Provisioning

15. Chapter 12: Continuous Integration and Deployment (CI/CD) of Data Pipelines

16. Other Books You May Enjoy

What this book covers

Chapter 1, The Story of Data Engineering and Analytics, introduces the core concepts of data engineering. It introduces you to the two data processing architectures in big data – Lambda and Kappa.

Chapter 2, Discovering Storage and Compute Data Lake Architectures, introduces one of the most important concepts in data engineering – segregating storage and compute layers. By following this principle, you will be introduced to the idea of building data lakes. An understanding of this key principle will lay the foundation for your understanding of the modern-day data lake design patterns discussed later in the book.

Chapter 3, Data Engineering on Microsoft Azure, introduces the world of data engineering on the Microsoft Azure cloud platform. It will familiarize you with all the Azure tools and services that play a major role in the Azure data engineering ecosystem. These tools and services will be used throughout the book for all practical examples.

Chapter 4, Understanding Data Pipelines, introduces you to the idea of data pipelines. This chapter further enhances your knowledge of the various stages of data engineering and how data pipelines can enhance efficiency by integrating individual components together and running them in a streamlined fashion.

Chapter 5, Data Collection Stage – The Bronze Layer, guides us in building a data lake using the Lakehouse architecture. We will start with data collection and the development of the bronze layer.

Chapter 6, Understanding Delta Lake, introduces Delta Lake and helps you quickly explore the main features of Delta Lake. Understanding Delta Lake's features is an integral skill for a data engineering professional who would like to build data lakes with data freshness, fast performance, and governance in mind. We will also be talking about the Lakehouse architecture in detail.

Chapter 7, Data Curation Stage – The Silver Layer, continues our building of a data lake. The focus of this chapter will be on data cleansing, standardization, and building the silver layer using Delta Lake.

Chapter 8, Data Aggregation Stage – The Gold Layer, continues our building a data lake. The focus of this chapter will be on data aggregation and building the gold layer.

Chapter 9, Deploying and Monitoring Pipelines in Production, explains how to effectively manage data pipelines running in production. We will explore data pipeline management from an operational perspective and cover security, performance management, and monitoring.

Chapter 10, Solving Data Engineering Challenges, lists the major challenges experienced by data engineering professionals. Various use cases will be covered in this chapter and a challenge will be offered. We will deep dive into the effective handling of the challenge, explaining its resolution using code snippets and examples.

Chapter 11, Infrastructure Provisioning, teaches you the basics of infrastructure provisioning using Terraform. Using Terraform, we will provision the cloud resources on Microsoft Azure that are required for running a data pipeline.

Chapter 12, Continuous Integration and Deployment of Data Pipelines, introduces the idea of continuous integration and deployment (CI/CD) of data pipelines. Using the principles of CI/CD, data engineering professionals can rapidly deploy new data pipelines/changes to existing data pipelines in a repeatable fashion.

To get the most out of this book

You will need a Microsoft Azure account.

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book's GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Do ensure that you close all instances of Azure after you have run your code, so that your costs are minimized.