You're reading from Modern Data Architecture on AWS

Product typeBook

Published inAug 2023

PublisherPackt

ISBN-139781801813396

Edition1st Edition

Concepts

Data Science

Author (1)

Behram Irani

Data Governance

A compass provides a clear sense of direction, guiding travelers through unfamiliar terrain. Similarly, data governance establishes a clear direction for an organization’s data management efforts. It sets goals, defines strategies, and outlines the path toward effective data utilization, ensuring that data initiatives align with the organization’s overall objectives. However, the term data governance can also get nebulous and cause friction in many organizations. Sometimes, data leaders throw the kitchen sink at it without really breaking down the individual components of data governance and understanding its relevance to the organization. A clear data strategy needs to be in place and the strategy needs to align with the business goals.

As we will uncover in this chapter, data governance is a very broad topic; trying to implement every component of governance at once, without clearly mapping value to the business, will lead to a data platform that...

What is data governance?

Data governance refers to the overall management and implementation of policies, procedures, and standards for ensuring the high quality, integrity, and security of data in an organization.

Data is a business asset and is owned by specific LOBs in an organization. The process of improving data is continuous as new data is constantly generated and utilized by businesses for their daily operations. The improvement of data is quantified by data owners and its impact is linked directly to business outcomes.

Along with the data improvement process, the privacy and security portion of the data is also a key responsibility of the data owners. As soon as the data is compromised or misused, it can lead to severe consequences for an organization, impacting its operations, reputation, and financial well-being.

The following figure highlights the high-level components of data governance. We will dive into more details as we unfold this topic in this chapter:

...

Data governance on AWS

AWS recognizes data governance as a combination of people, processes, and tools that all work in tandem to ensure that a modern data platform continuously maintains high-quality data securely. Data governance mechanisms are applied throughout the life cycle of the data – right from ingestion until consumption.

The following figure demonstrates the approach of having a governed data platform using a combination of people, processes, and technology:

Figure 14.3 – Data governance operating model on AWS

When the rubber meets the road and you try to put technology into every type of process you want to govern, it often makes it challenging to use just a few tools or services. This is because the governance process is broad and diverse, even in terms of technology. Every tool, service, or vendor product focuses on certain components of governance. For example, some tools/vendors focus on data quality, some focus on data...

Data governance using Amazon DataZone

When it comes to making data-driven business decisions, adopting an agile and productive mindset is essential. Rather than relying on a centralized data management platform that provides generalized analytics, organizations are empowering their business units to deliver data products. This approach allows LOBs and organizational units to operate autonomously, taking ownership of their data products from end to end. Meanwhile, the organization as a whole benefits from centralized data discovery, governance, auditing, data privacy, and compliance.

By decentralizing data management and encouraging ownership, organizations can unlock the full potential of their data assets. This shift empowers business units to respond quickly to evolving market needs, drive innovation, and tailor their data products to specific requirements. Simultaneously, the organization maintains oversight through centralized mechanisms that ensure compliance, privacy, and...

Fine-grained access control using AWS Lake Formation

One of the biggest challenges with setting up and operating data lakes on a large scale is to make sure all the data is secure. This challenge arises due to data being all over the place in a data lake, across multiple S3 buckets, and accessible via many cataloged tables. Setting up a unified permission model around who gets access to what portion of the data is not a trivial task. Imagine a very large data lake with thousands of databases and thousands of tables with 10,000 users continuously trying to access the data; to complicate things further, new users are getting onboarded every day and new datasets are constantly getting added to the data lake. Unless there is a robust mechanism to control fine-grained data access across all the datasets, the data lake would become a governance nightmare.

AWS Lake Formation

In a few of the previous chapters, we touched upon AWS Lake Formation as a service that helps in multiple aspects...

Improving data quality using Glue Data Quality

Data quality is one of the most important data governance components that no organization can ignore. To be a world-class data-driven organization, the data being used to derive insights needs to yield a high degree of accurate results. However, data analytics platforms collect, process, and consume data from many source systems, each with their own data formats and quality challenges. Therefore, data quality is a high-priority data governance measure that needs to be implemented judiciously.

Glue Data Quality

Recently, AWS introduced another feature in the Glue service that helps with data quality right inside the data pipelines. Let’s discuss AWS Glue Data Quality by bringing up a use case from GreatFin.

Use case for data quality using AWS Glue Data Quality

One of the LOBs for GreatFin has architected a data lake on Amazon S3 and designed all the layers of the data lake. They will bring all the data from all the source...

Sensitive data discovery with Amazon Macie

In the previous section, we saw how AWS Lake Formation helps with access control mechanisms, which is a vital piece of data governance. When certain datasets contain confidential data or sensitive data, you can use Lake Formation to selectively grant access to only certain columns by tagging them accordingly and granting access via those tags.

The big assumption we made was that data stewards of the data lake are already aware of all the confidential data in the data lake, along with its S3 bucket and filename. In a large implementation of a data lake with lots of contributing source systems, finding sensitive data and classifying it accordingly is like finding a needle in a haystack.

So many use cases require that data assets be classified and tagged accordingly so that accurate permissions can be granted to only the personas who should have access to the data. Doing this also ensures that such sensitive data is tracked as it migrates...

Data collaborations with partners using AWS Clean Rooms

Collaborating on shared datasets while safeguarding the underlying raw data poses a common challenge for companies and their partners. Organizations often encounter data fragmentation across various applications, channels, departments, and partner networks, leading to interoperability and scalability issues. Numerous organizations seek improved methods for managing the collection, storage, and utilization of sensitive raw data while ensuring data privacy.

However, the methods that are traditionally used to utilize data in collaboration with partners can conflict with the objective of data protection. In certain cases, these methods have necessitated companies to share copies of their data with partners and rely on contractual agreements to prevent misuse. However, customers prefer to minimize data movement to safeguard their information, prevent misuse, and mitigate the risks of data leaks. Consequently, they often opt against...

Data resolution with AWS Entity Resolution

The best way to explain this topic would be to take the example of data at GreatFin, the example company we have been using in this book for use cases. GreatFin has data coming in from multiple LOBs. All LOBs have overlapping customer information. Sometimes, customers update details with one LOB but other LOBs don’t always see that update. This eventually creates a web of conflicting information across the enterprise where a golden version of truth for a customer or any other entity doesn’t exist. This is where inaccuracies arise in the operational systems as well as in the analytical environments. All organizations strive to create a golden or a master copy of their entities.

The following figure highlights the efforts of organizations to create a golden copy of the entity from across multiple sources of data:

Figure 14.41 – Entity resolution process

Let’s introduce the service...

Summary

In this chapter, we looked at a whole range of data governance aspects. First, we laid out what data governance means and why organizations need it to create a world-class modern data platform. We also looked at how AWS views data governance, as defined by a combination of people, processes, and technology. All three aspects need to be aligned for data governance to be effective at an enterprise level.

We also spent quite a bit of effort explaining how a new service, Amazon DataZone, helps refine data governance and helps simplify the whole process across many of the individual analytics services of AWS. DataZone provides a comprehensive way of allowing publishers and subscribers to discover, publish, and subscribe to enterprise-wide data in a distributed manner. This alleviates the burden of creating cumbersome automations and setting up expensive tools to create a self-service analytics platform. In short, Amazon DataZone helps democratize data faster.

Afterthat, we...

References

Amazon Lake Formation workshop: https://catalog.us-east-1.prod.workshops.aws/workshops/78572df7-d2ee-4f78-b698-7cafdb55135d/en-US
AWS Glue Data Quality blog series: https://aws.amazon.com/blogs/big-data/aws-glue-data-quality-is-generally-available/
Amazon Macie workshop: https://catalog.us-east-1.prod.workshops.aws/workshops/9982e0dc-0ccf-4116-ad12-c053b0ab31c6/en-US
AWS Entity Resolution blog: https://aws.amazon.com/blogs/aws/aws-entity-resolution-match-and-link-related-records-from-multiple-applications-and-data-stores/

The rest of the chapter is locked

You have been reading a chapter from

Modern Data Architecture on AWS

Published in: Aug 2023Publisher: PacktISBN-13: 9781801813396

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Behram Irani

Behram Irani is currently a technology leader with Amazon Web Services (AWS) specializing in data, analytics and AI/ML. He has spent over 18 years in the tech industry helping organizations, from start-ups to large-scale enterprises, modernize their data platforms. In the last 6 years working at AWS, Behram has been a thought leader in the data, analytics and AI/ML space; publishing multiple papers and leading the digital transformation efforts for many organizations across the globe. Behram has completed his Bachelor of Engineering in Computer Science from the University of Pune and has an MBA degree from the University of Florida.
Read more about Behram Irani

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages