Reader small image

You're reading from  Data Engineering with AWS - Second Edition

Product typeBook
Published inOct 2023
PublisherPackt
ISBN-139781804614426
Edition2nd Edition
Right arrow
Author (1)
Gareth Eagar
Gareth Eagar
author image
Gareth Eagar

Gareth Eagar has over 25 years of experience in the IT industry, starting in South Africa, working in the United Kingdom for a while, and now based in the USA. Having worked at AWS since 2017, Gareth has broad experience with a variety of AWS services, and deep expertise around building data platforms on AWS. While Gareth currently works as a Solutions Architect, he has also worked in AWS Professional Services, helping architect and implement data platforms for global customers. Gareth frequently speaks on data related topics.
Read more about Gareth Eagar

Right arrow

Data Governance, Security, and Cataloging

Data governance and security are some of the most important topics to cover in a book that is all about data. Having the most efficient data pipelines, the fastest data transformations, and the best data consumption tools is not worth much if the data is not kept secure and governed correctly. Data must also be stored and accessed in a way that complies with local laws, and the data needs to be cataloged so that it is discoverable and useful to the organization.

Sadly, it is not uncommon to read about data breaches and poor data handling by organizations, and the consequences of this can include reputational damage to the organization, as well as potentially massive penalties imposed by the government. And once an organization causes damage to their customers (such as exposing them to potential identity theft through a data breach), it is difficult for the organization to regain that trust.

It is also not uncommon for organizations...

Technical requirements

To complete the hands-on exercises included in this chapter, you will need an AWS account where you have access to a user with administrator privileges (as covered in Chapter 1, An Introduction to Data Engineering).

You can find the code files of this chapter in the GitHub repository using the following link: https://github.com/PacktPublishing/Data-Engineering-with-AWS-2nd-edition/tree/main/Chapter04

The many different aspects of data governance

Data governance is a wide-ranging topic, covering many different aspects. If you do a Google search for the definition of data governance, you are likely to see many different definitions. At its core though, data governance is the various things an organization needs to do to ensure the secure, compliant, and effective use of data, from the time the data is created through to when it is archived or deleted.

This includes processes that make the data discoverable, understandable, and usable, while ensuring that the data is of high quality and is protected and secured. This covers all data within an organization; however, for the purposes of this book, we will only be focusing on data governance for analytic data.

Most organizations consist of many different business units or teams, and each of these generates their own data, and also has their own specific requirements for accessing data from other parts of the organization....

Join our book community on Discord

https://packt.link/EarlyAccessCommunity

Qr code Description automatically generated

Data governance, security, and related topics are some of the most important topics to cover in a book that is all about data. Having the most efficient data pipelines, the fastest data transformations, and the best data consumption tools is not worth much if the data is not kept secure and governed correctly. Data must also be stored and accessed in a way that complies with local laws, and the data needs to be cataloged so that it is discoverable and useful to the organization.Sadly, it is not uncommon to read about data breaches and poor data handling by organizations, and the consequences of this can include reputational damage to the organization, as well as potentially massive penalties imposed by the government. It is also not uncommon for organizations to find that they have massive quantities of data, but that they are getting very little value from that data since it is siloed, of poor quality, or users...

Technical requirements

To complete the hands-on exercises included in this chapter, you will need an AWS account where you have access to a user with administrator privileges (as covered in Chapter 1, An Introduction to Data Engineering).You can find the code files of this chapter in the GitHub repository using the following link: https://github.com/PacktPublishing/Data-Engineering-with-AWS-2nd-edition/tree/main/Chapter04

The many different aspects of Data Governance

Data governance is a wide-ranging topic, covering many different aspects. If you do a Google search for the definition of data governance, you are likely to see many different definitions. At its core though, data governance is the various things an organization needs to do to ensure the secure, compliant and effective use of data, from the time the data is created through to when it is archived or deleted. This includes processes that make the data discoverable, understandable, and usable, while ensuring that the data is of high quality, and that data is protected and secured. This covers all data within an organization, however for the purposes of this book, we will only be focusing on data governance for analytic data. Most organizations consist of many different business units or teams, and each of these generate their own data, and also have their own specific requirements for accessing data from other parts of the organization. But if...

Data security, access and privacy

Not providing adequate protection and security of an organization's data, or not complying with relevant governance laws, can end up being a very expensive mistake for an organization.According to an article on CSO Online titled The biggest data breach fines, penalties, and settlements so far (https://www.csoonline.com/article/3410278/the-biggest-data-breach-fines-penalties-and-settlements-so-far.html), penalties and expenses related to data breaches have cost companies over $4.4 billion (and counting).For example, Equifax, the credit agency firm, had a data breach in 2017 that exposed the personal and financial information of nearly 150 million people. As a result, Equifax agreed to pay at least $575 million in a settlement with several United States government agencies, and U.S. States.But beyond financial penalties, a data breach can also do incalculable damage to an organization's reputation and brand. Once you lose the trust of your customers...

Data quality, data profiling, and data lineage

In this section we look at three different, but related, concepts: data quality, data profiling, and data lineage. Each of these aspects of data governance are important tools for ensuring that data that is shared within your organization is of high quality, and that teams across your organization can have confidence when accessing and using the data.

Data quality

Having high quality data is essential for ensuring that an organization is equipped to make the best data-driven decisions, and to be effective in all activities that are data driven (such as marketing campaigns). There are many different aspects to measuring data quality, and data quality is important in all phases of the data lifecycle. If data in the source production database is not captured correctly, then when that data is copied over to analytical systems the analytical system will have incorrect or missing data. For example, if the source system does not enforce that date...

Business and technical data catalogs

You have probably heard about swamps, even if you have never actually been to one. Generally, swamps are known to be wet areas that smell pretty bad, and where some trees and other vegetation may grow, but the area is generally not fit to be used for most purposes (unless, of course, you're an ogre similar to Shrek, and you make your home in the swamp!).In contrast to a swamp, when most people think about a lake, they picture beautiful scenery with clean water, a beautiful sunset, and perhaps a few ducks gently floating on the water. Most people would hate to find themselves in a swamp if they thought they were going to visit a beautiful lake.In the world of data lakes, as a data engineer, you want to provide an experience that is much like the pure and peaceful lake described previously, and you want to avoid your users finding that the lake looks more like a swamp. However, if you're not careful, your data lake can become a data swamp,...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Engineering with AWS - Second Edition
Published in: Oct 2023Publisher: PacktISBN-13: 9781804614426
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Gareth Eagar

Gareth Eagar has over 25 years of experience in the IT industry, starting in South Africa, working in the United Kingdom for a while, and now based in the USA. Having worked at AWS since 2017, Gareth has broad experience with a variety of AWS services, and deep expertise around building data platforms on AWS. While Gareth currently works as a Solutions Architect, he has also worked in AWS Professional Services, helping architect and implement data platforms for global customers. Gareth frequently speaks on data related topics.
Read more about Gareth Eagar