Reader small image

You're reading from  Modern Data Architecture on AWS

Product typeBook
Published inAug 2023
PublisherPackt
ISBN-139781801813396
Edition1st Edition
Concepts
Right arrow
Author (1)
Behram Irani
Behram Irani
author image
Behram Irani

Behram Irani is currently a technology leader with Amazon Web Services (AWS) specializing in data, analytics and AI/ML. He has spent over 18 years in the tech industry helping organizations, from start-ups to large-scale enterprises, modernize their data platforms. In the last 6 years working at AWS, Behram has been a thought leader in the data, analytics and AI/ML space; publishing multiple papers and leading the digital transformation efforts for many organizations across the globe. Behram has completed his Bachelor of Engineering in Computer Science from the University of Pune and has an MBA degree from the University of Florida.
Read more about Behram Irani

Right arrow

Data Governance

A compass provides a clear sense of direction, guiding travelers through unfamiliar terrain. Similarly, data governance establishes a clear direction for an organization’s data management efforts. It sets goals, defines strategies, and outlines the path toward effective data utilization, ensuring that data initiatives align with the organization’s overall objectives. However, the term data governance can also get nebulous and cause friction in many organizations. Sometimes, data leaders throw the kitchen sink at it without really breaking down the individual components of data governance and understanding its relevance to the organization. A clear data strategy needs to be in place and the strategy needs to align with the business goals.

As we will uncover in this chapter, data governance is a very broad topic; trying to implement every component of governance at once, without clearly mapping value to the business, will lead to a data platform that...

What is data governance?

Data governance refers to the overall management and implementation of policies, procedures, and standards for ensuring the high quality, integrity, and security of data in an organization.

Data is a business asset and is owned by specific LOBs in an organization. The process of improving data is continuous as new data is constantly generated and utilized by businesses for their daily operations. The improvement of data is quantified by data owners and its impact is linked directly to business outcomes.

Along with the data improvement process, the privacy and security portion of the data is also a key responsibility of the data owners. As soon as the data is compromised or misused, it can lead to severe consequences for an organization, impacting its operations, reputation, and financial well-being.

The following figure highlights the high-level components of data governance. We will dive into more details as we unfold this topic in this chapter:

...

Data governance on AWS

AWS recognizes data governance as a combination of people, processes, and tools that all work in tandem to ensure that a modern data platform continuously maintains high-quality data securely. Data governance mechanisms are applied throughout the life cycle of the data – right from ingestion until consumption.

The following figure demonstrates the approach of having a governed data platform using a combination of people, processes, and technology:

Figure 14.3 – Data governance operating model on AWS

Figure 14.3 – Data governance operating model on AWS

When the rubber meets the road and you try to put technology into every type of process you want to govern, it often makes it challenging to use just a few tools or services. This is because the governance process is broad and diverse, even in terms of technology. Every tool, service, or vendor product focuses on certain components of governance. For example, some tools/vendors focus on data quality, some focus on data...

Data governance using Amazon DataZone

When it comes to making data-driven business decisions, adopting an agile and productive mindset is essential. Rather than relying on a centralized data management platform that provides generalized analytics, organizations are empowering their business units to deliver data products. This approach allows LOBs and organizational units to operate autonomously, taking ownership of their data products from end to end. Meanwhile, the organization as a whole benefits from centralized data discovery, governance, auditing, data privacy, and compliance.

By decentralizing data management and encouraging ownership, organizations can unlock the full potential of their data assets. This shift empowers business units to respond quickly to evolving market needs, drive innovation, and tailor their data products to specific requirements. Simultaneously, the organization maintains oversight through centralized mechanisms that ensure compliance, privacy, and...

Fine-grained access control using AWS Lake Formation

One of the biggest challenges with setting up and operating data lakes on a large scale is to make sure all the data is secure. This challenge arises due to data being all over the place in a data lake, across multiple S3 buckets, and accessible via many cataloged tables. Setting up a unified permission model around who gets access to what portion of the data is not a trivial task. Imagine a very large data lake with thousands of databases and thousands of tables with 10,000 users continuously trying to access the data; to complicate things further, new users are getting onboarded every day and new datasets are constantly getting added to the data lake. Unless there is a robust mechanism to control fine-grained data access across all the datasets, the data lake would become a governance nightmare.

AWS Lake Formation

In a few of the previous chapters, we touched upon AWS Lake Formation as a service that helps in multiple aspects...

Improving data quality using Glue Data Quality

Data quality is one of the most important data governance components that no organization can ignore. To be a world-class data-driven organization, the data being used to derive insights needs to yield a high degree of accurate results. However, data analytics platforms collect, process, and consume data from many source systems, each with their own data formats and quality challenges. Therefore, data quality is a high-priority data governance measure that needs to be implemented judiciously.

Glue Data Quality

Recently, AWS introduced another feature in the Glue service that helps with data quality right inside the data pipelines. Let’s discuss AWS Glue Data Quality by bringing up a use case from GreatFin.

Use case for data quality using AWS Glue Data Quality

One of the LOBs for GreatFin has architected a data lake on Amazon S3 and designed all the layers of the data lake. They will bring all the data from all the source...

Sensitive data discovery with Amazon Macie

In the previous section, we saw how AWS Lake Formation helps with access control mechanisms, which is a vital piece of data governance. When certain datasets contain confidential data or sensitive data, you can use Lake Formation to selectively grant access to only certain columns by tagging them accordingly and granting access via those tags.

The big assumption we made was that data stewards of the data lake are already aware of all the confidential data in the data lake, along with its S3 bucket and filename. In a large implementation of a data lake with lots of contributing source systems, finding sensitive data and classifying it accordingly is like finding a needle in a haystack.

So many use cases require that data assets be classified and tagged accordingly so that accurate permissions can be granted to only the personas who should have access to the data. Doing this also ensures that such sensitive data is tracked as it migrates...

Data collaborations with partners using AWS Clean Rooms

Collaborating on shared datasets while safeguarding the underlying raw data poses a common challenge for companies and their partners. Organizations often encounter data fragmentation across various applications, channels, departments, and partner networks, leading to interoperability and scalability issues. Numerous organizations seek improved methods for managing the collection, storage, and utilization of sensitive raw data while ensuring data privacy.

However, the methods that are traditionally used to utilize data in collaboration with partners can conflict with the objective of data protection. In certain cases, these methods have necessitated companies to share copies of their data with partners and rely on contractual agreements to prevent misuse. However, customers prefer to minimize data movement to safeguard their information, prevent misuse, and mitigate the risks of data leaks. Consequently, they often opt against...

Data resolution with AWS Entity Resolution

The best way to explain this topic would be to take the example of data at GreatFin, the example company we have been using in this book for use cases. GreatFin has data coming in from multiple LOBs. All LOBs have overlapping customer information. Sometimes, customers update details with one LOB but other LOBs don’t always see that update. This eventually creates a web of conflicting information across the enterprise where a golden version of truth for a customer or any other entity doesn’t exist. This is where inaccuracies arise in the operational systems as well as in the analytical environments. All organizations strive to create a golden or a master copy of their entities.

The following figure highlights the efforts of organizations to create a golden copy of the entity from across multiple sources of data:

Figure 14.41 – Entity resolution process

Figure 14.41 – Entity resolution process

Let’s introduce the service...

Summary

In this chapter, we looked at a whole range of data governance aspects. First, we laid out what data governance means and why organizations need it to create a world-class modern data platform. We also looked at how AWS views data governance, as defined by a combination of people, processes, and technology. All three aspects need to be aligned for data governance to be effective at an enterprise level.

We also spent quite a bit of effort explaining how a new service, Amazon DataZone, helps refine data governance and helps simplify the whole process across many of the individual analytics services of AWS. DataZone provides a comprehensive way of allowing publishers and subscribers to discover, publish, and subscribe to enterprise-wide data in a distributed manner. This alleviates the burden of creating cumbersome automations and setting up expensive tools to create a self-service analytics platform. In short, Amazon DataZone helps democratize data faster.

Afterthat, we...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Modern Data Architecture on AWS
Published in: Aug 2023Publisher: PacktISBN-13: 9781801813396
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Behram Irani

Behram Irani is currently a technology leader with Amazon Web Services (AWS) specializing in data, analytics and AI/ML. He has spent over 18 years in the tech industry helping organizations, from start-ups to large-scale enterprises, modernize their data platforms. In the last 6 years working at AWS, Behram has been a thought leader in the data, analytics and AI/ML space; publishing multiple papers and leading the digital transformation efforts for many organizations across the globe. Behram has completed his Bachelor of Engineering in Computer Science from the University of Pune and has an MBA degree from the University of Florida.
Read more about Behram Irani