Reader small image

You're reading from  Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Product typeBook
Published inOct 2021
PublisherPackt
ISBN-139781801077743
Edition1st Edition
Right arrow
Author (1)
Manoj Kukreja
Manoj Kukreja
author image
Manoj Kukreja

Manoj Kukreja is a Principal Architect at Northbay Solutions who specializes in creating complex Data Lakes and Data Analytics Pipelines for large-scale organizations such as banks, insurance companies, universities, and US/Canadian government agencies. Previously, he worked for Pythian, a large managed service provider where he was leading the MySQL and MongoDB DBA group and supporting large-scale data infrastructure for enterprises across the globe. With over 25 years of IT experience, he has delivered Data Lake solutions using all major cloud providers including AWS, Azure, GCP, and Alibaba Cloud. On weekends, he trains groups of aspiring Data Engineers and Data Scientists on Hadoop, Spark, Kafka and Data Analytics on AWS and Azure Cloud.
Read more about Manoj Kukreja

Right arrow

Chapter 10: Solving Data Engineering Challenges

In the past few chapters, we learned about the data lakehouse architecture. After covering several exercises, we learned how a data engineer builds and deploys the bronze, silver, and gold layers of the lakehouse. Data in the lakehouse increases and changes over time. As new data sources get added and the previous ones undergo modifications, the data engineering practice needs to keep up with this growth. Just like anything else in the industry, the role of the data engineer needs to evolve as well. In addition to building and deploying data pipelines, they need to cover several other complicated aspects of data engineering that were not covered previously. They must learn to deal with these new challenges.

In this chapter, we will cover the following topics:

  • Schema evolution
  • Sharing data
  • Data governance

Schema evolution

Schema evolution can be described as a technique that's used to adapt to ongoing structural changes to data. As systems mature and add more functionality, schema evolution is inevitable. Therefore, adapting to schema evolution is an extremely important requirement of modern-day pipelines.

It is customary to start developing pipelines so that they have base schemas for tables at the start of the project. Frequently, by the time things move into production, there is a very high likelihood that the schema for some incoming file or table has changed. But why is this such a big problem?

Important

A data engineer should never make the mistake of assuming that the schema of incoming data will never change. Instead, prepare the pipelines so that they auto-adjust to this evolution.

Let's discuss an example scenario to illustrate this point. Let's assume your pipelines have been deployed in production and that, for a while, you have been ingesting...

Sharing data

In Chapter 1, The Story of Data Engineering and Analytics, we discussed the power of data. This has enabled organizations to realize revenue diversification using data monetization. But this dream cannot be effectively realized without sharing data with external parties. In the past, organizations used several data-sharing mechanisms such as emails, SFTP, APIs, cloud storage, and hard drives:

Figure 10.23 – State of data sharing currently

Unfortunately, there are several problems related to these data-sharing methods:

  • Complex: These data sharing mechanisms can be complex to set up and use because they may require exchanging keys/passwords and using a variety of different tools.
  • Insecure: These mechanisms may not be secure for data-at-rest or data-in-transit. This means the classic man-in-the-middle attack could expose data in cleartext.
  • Tracking: There is no clear method available for effectively tracking who shared data...

Data governance

I started this book by stating "Every byte of data has a story to tell. The real question is whether the story is being narrated accurately, securely, and efficiently." While organizations are busy harnessing the true power of data, data governance and security frequently get neglected. But it cannot be neglected for too long. Regulations such as GDPR and many others are enforcing legal accountability and strict penalties on organizations failing to meet governance policies related to data privacy, retention, and portability.

An effective path to data governance often starts with an effective method for data discovery, classification, and tracking lineage. This can be a daunting task, considering the vast variety, volumes, and velocity of data. Azure Purview is a new, unified governance service that automates tasks such as discovery, classification, and lineage.

Using Azure Purview, the registered data sources can be automatically discovered once or...

Cleaning up Azure resources

To save on costs, you may want to clean up the following Azure resources:

  1. Delete the Azure Purview account; that is, trainingcatalog.
  2. Delete the Azure Data Share account; that is, trainingshare.

Now, let's summarize this chapter.

Summary

This was an extremely important chapter for several reasons. Using a few examples, we learned about the common challenges that are faced in the world of data engineering. Dealing with challenges such as schema evolution and governance are critical in modern data engineering projects. These challenges have evolved over time, and so have the techniques and tools that make the job of the data engineer a little easier.

In the next chapter, we will look at DevOps essentials for data engineers. DevOps is quickly becoming an essential add-on skill for data engineers, so we will learn how to automatically provision cloud resources using the Infrastructure as Code (IaC) paradigm. Using the power of IaC, data engineers can spin up data pipelines that use cloud resources in a fast, repeatable, and secure fashion.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Engineering with Apache Spark, Delta Lake, and Lakehouse
Published in: Oct 2021Publisher: PacktISBN-13: 9781801077743
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Manoj Kukreja

Manoj Kukreja is a Principal Architect at Northbay Solutions who specializes in creating complex Data Lakes and Data Analytics Pipelines for large-scale organizations such as banks, insurance companies, universities, and US/Canadian government agencies. Previously, he worked for Pythian, a large managed service provider where he was leading the MySQL and MongoDB DBA group and supporting large-scale data infrastructure for enterprises across the globe. With over 25 years of IT experience, he has delivered Data Lake solutions using all major cloud providers including AWS, Azure, GCP, and Alibaba Cloud. On weekends, he trains groups of aspiring Data Engineers and Data Scientists on Hadoop, Spark, Kafka and Data Analytics on AWS and Azure Cloud.
Read more about Manoj Kukreja