Search icon
Subscription
0
Cart icon
Close icon
You have no products in your basket yet
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Data Engineering with Apache Spark, Delta Lake, and Lakehouse

You're reading from  Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Product type Book
Published in Oct 2021
Publisher Packt
ISBN-13 9781801077743
Pages 480 pages
Edition 1st Edition
Languages
Author (1):
Manoj Kukreja Manoj Kukreja
Profile icon Manoj Kukreja

Table of Contents (17) Chapters

Preface 1. Section 1: Modern Data Engineering and Tools
2. Chapter 1: The Story of Data Engineering and Analytics 3. Chapter 2: Discovering Storage and Compute Data Lakes 4. Chapter 3: Data Engineering on Microsoft Azure 5. Section 2: Data Pipelines and Stages of Data Engineering
6. Chapter 4: Understanding Data Pipelines 7. Chapter 5: Data Collection Stage – The Bronze Layer 8. Chapter 6: Understanding Delta Lake 9. Chapter 7: Data Curation Stage – The Silver Layer 10. Chapter 8: Data Aggregation Stage – The Gold Layer 11. Section 3: Data Engineering Challenges and Effective Deployment Strategies
12. Chapter 9: Deploying and Monitoring Pipelines in Production 13. Chapter 10: Solving Data Engineering Challenges 14. Chapter 11: Infrastructure Provisioning 15. Chapter 12: Continuous Integration and Deployment (CI/CD) of Data Pipelines 16. Other Books You May Enjoy

What this book covers

Chapter 1, The Story of Data Engineering and Analytics, introduces the core concepts of data engineering. It introduces you to the two data processing architectures in big data – Lambda and Kappa.

Chapter 2, Discovering Storage and Compute Data Lake Architectures, introduces one of the most important concepts in data engineering – segregating storage and compute layers. By following this principle, you will be introduced to the idea of building data lakes. An understanding of this key principle will lay the foundation for your understanding of the modern-day data lake design patterns discussed later in the book.

Chapter 3, Data Engineering on Microsoft Azure, introduces the world of data engineering on the Microsoft Azure cloud platform. It will familiarize you with all the Azure tools and services that play a major role in the Azure data engineering ecosystem. These tools and services will be used throughout the book for all practical examples.

Chapter 4, Understanding Data Pipelines, introduces you to the idea of data pipelines. This chapter further enhances your knowledge of the various stages of data engineering and how data pipelines can enhance efficiency by integrating individual components together and running them in a streamlined fashion.

Chapter 5, Data Collection Stage – The Bronze Layer, guides us in building a data lake using the Lakehouse architecture. We will start with data collection and the development of the bronze layer.

Chapter 6, Understanding Delta Lake, introduces Delta Lake and helps you quickly explore the main features of Delta Lake. Understanding Delta Lake's features is an integral skill for a data engineering professional who would like to build data lakes with data freshness, fast performance, and governance in mind. We will also be talking about the Lakehouse architecture in detail.

Chapter 7, Data Curation Stage – The Silver Layer, continues our building of a data lake. The focus of this chapter will be on data cleansing, standardization, and building the silver layer using Delta Lake.

Chapter 8, Data Aggregation Stage – The Gold Layer, continues our building a data lake. The focus of this chapter will be on data aggregation and building the gold layer.

Chapter 9, Deploying and Monitoring Pipelines in Production, explains how to effectively manage data pipelines running in production. We will explore data pipeline management from an operational perspective and cover security, performance management, and monitoring.

Chapter 10, Solving Data Engineering Challenges, lists the major challenges experienced by data engineering professionals. Various use cases will be covered in this chapter and a challenge will be offered. We will deep dive into the effective handling of the challenge, explaining its resolution using code snippets and examples.

Chapter 11, Infrastructure Provisioning, teaches you the basics of infrastructure provisioning using Terraform. Using Terraform, we will provision the cloud resources on Microsoft Azure that are required for running a data pipeline.

Chapter 12, Continuous Integration and Deployment of Data Pipelines, introduces the idea of continuous integration and deployment (CI/CD) of data pipelines. Using the principles of CI/CD, data engineering professionals can rapidly deploy new data pipelines/changes to existing data pipelines in a repeatable fashion.

To get the most out of this book

You will need a Microsoft Azure account.

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book's GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Do ensure that you close all instances of Azure after you have run your code, so that your costs are minimized.

lock icon The rest of the chapter is locked
Next Chapter arrow right
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}