You're reading from Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Product typeBook

Published inOct 2021

PublisherPackt

ISBN-139781801077743

Edition1st Edition

Tools

Apache Spark

Concepts

Data Processing

Author (1)

Manoj Kukreja

Chapter 10: Solving Data Engineering Challenges

In the past few chapters, we learned about the data lakehouse architecture. After covering several exercises, we learned how a data engineer builds and deploys the bronze, silver, and gold layers of the lakehouse. Data in the lakehouse increases and changes over time. As new data sources get added and the previous ones undergo modifications, the data engineering practice needs to keep up with this growth. Just like anything else in the industry, the role of the data engineer needs to evolve as well. In addition to building and deploying data pipelines, they need to cover several other complicated aspects of data engineering that were not covered previously. They must learn to deal with these new challenges.

In this chapter, we will cover the following topics:

Schema evolution
Sharing data
Data governance

Schema evolution

Schema evolution can be described as a technique that's used to adapt to ongoing structural changes to data. As systems mature and add more functionality, schema evolution is inevitable. Therefore, adapting to schema evolution is an extremely important requirement of modern-day pipelines.

It is customary to start developing pipelines so that they have base schemas for tables at the start of the project. Frequently, by the time things move into production, there is a very high likelihood that the schema for some incoming file or table has changed. But why is this such a big problem?

Important

A data engineer should never make the mistake of assuming that the schema of incoming data will never change. Instead, prepare the pipelines so that they auto-adjust to this evolution.

Let's discuss an example scenario to illustrate this point. Let's assume your pipelines have been deployed in production and that, for a while, you have been ingesting...

Data governance

I started this book by stating "Every byte of data has a story to tell. The real question is whether the story is being narrated accurately, securely, and efficiently." While organizations are busy harnessing the true power of data, data governance and security frequently get neglected. But it cannot be neglected for too long. Regulations such as GDPR and many others are enforcing legal accountability and strict penalties on organizations failing to meet governance policies related to data privacy, retention, and portability.

An effective path to data governance often starts with an effective method for data discovery, classification, and tracking lineage. This can be a daunting task, considering the vast variety, volumes, and velocity of data. Azure Purview is a new, unified governance service that automates tasks such as discovery, classification, and lineage.

Using Azure Purview, the registered data sources can be automatically discovered once or...

Cleaning up Azure resources

To save on costs, you may want to clean up the following Azure resources:

Delete the Azure Purview account; that is, trainingcatalog.
Delete the Azure Data Share account; that is, trainingshare.

Now, let's summarize this chapter.

Summary

This was an extremely important chapter for several reasons. Using a few examples, we learned about the common challenges that are faced in the world of data engineering. Dealing with challenges such as schema evolution and governance are critical in modern data engineering projects. These challenges have evolved over time, and so have the techniques and tools that make the job of the data engineer a little easier.

In the next chapter, we will look at DevOps essentials for data engineers. DevOps is quickly becoming an essential add-on skill for data engineers, so we will learn how to automatically provision cloud resources using the Infrastructure as Code (IaC) paradigm. Using the power of IaC, data engineers can spin up data pipelines that use cloud resources in a fast, repeatable, and secure fashion.

The rest of the chapter is locked

You have been reading a chapter from

Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Published in: Oct 2021Publisher: PacktISBN-13: 9781801077743

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Manoj Kukreja

Manoj Kukreja is a Principal Architect at Northbay Solutions who specializes in creating complex Data Lakes and Data Analytics Pipelines for large-scale organizations such as banks, insurance companies, universities, and US/Canadian government agencies. Previously, he worked for Pythian, a large managed service provider where he was leading the MySQL and MongoDB DBA group and supporting large-scale data infrastructure for enterprises across the globe. With over 25 years of IT experience, he has delivered Data Lake solutions using all major cloud providers including AWS, Azure, GCP, and Alibaba Cloud. On weekends, he trains groups of aspiring Data Engineers and Data Scientists on Hadoop, Spark, Kafka and Data Analytics on AWS and Azure Cloud.
Read more about Manoj Kukreja

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Chapter 10: Solving Data Engineering Challenges

Schema evolution

Sharing data

Data governance

Cleaning up Azure resources

Summary

Unlock this book and the full library FREE for 7 days

Author (1)

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook