Reader small image

You're reading from  Serverless ETL and Analytics with AWS Glue

Product typeBook
Published inAug 2022
Reading LevelExpert
PublisherPackt
ISBN-139781800564985
Edition1st Edition
Languages
Right arrow
Authors (6):
Vishal Pathak
Vishal Pathak
author image
Vishal Pathak

Vishal Pathak is a Data Lab Solutions Architect at AWS. Vishal works with customers on their use cases, architects solutions to solve their business problems, and helps them build scalable prototypes. Prior to his journey in AWS, Vishal helped customers implement business intelligence, data warehouse, and data lake projects in the US and Australia.
Read more about Vishal Pathak

Subramanya Vajiraya
Subramanya Vajiraya
author image
Subramanya Vajiraya

Subramanya Vajiraya is a Big data Cloud Engineer at AWS Sydney specializing in AWS Glue. He obtained his Bachelor of Engineering degree specializing in Information Science & Engineering from NMAM Institute of Technology, Nitte, KA, India (Visvesvaraya Technological University, Belgaum) in 2015 and obtained his Master of Information Technology degree specialized in Internetworking from the University of New South Wales, Sydney, Australia in 2017. He is passionate about helping customers solve challenging technical issues related to their ETL workload and implementing scalable data integration and analytics pipelines on AWS.
Read more about Subramanya Vajiraya

Noritaka Sekiyama
Noritaka Sekiyama
author image
Noritaka Sekiyama

Noritaka Sekiyama is a Senior Big Data Architect on the AWS Glue and AWS Lake Formation team. He has 11 years of experience working in the software industry. Based in Tokyo, Japan, he is responsible for implementing software artifacts, building libraries, troubleshooting complex issues and helping guide customer architectures
Read more about Noritaka Sekiyama

Tomohiro Tanaka
Tomohiro Tanaka
author image
Tomohiro Tanaka

Tomohiro Tanaka is a senior cloud support engineer at AWS. He works to help customers solve their issues and build data lakes across AWS Glue, AWS IoT, and big data technologies such Apache Spark, Hadoop, and Iceberg.
Read more about Tomohiro Tanaka

Albert Quiroga
Albert Quiroga
author image
Albert Quiroga

Albert Quiroga works as a senior solutions architect at Amazon, where he is helping to design and architect one of the largest data lakes in the world. Prior to that, he spent four years working at AWS, where he specialized in big data technologies such as EMR and Athena, and where he became an expert on AWS Glue. Albert has worked with several Fortune 500 companies on some of the largest data lakes in the world and has helped to launch and develop features for several AWS services.
Read more about Albert Quiroga

Ishan Gaur
Ishan Gaur
author image
Ishan Gaur

Ishan Gaur has more than 13 years of IT experience in soft ware development and data engineering, building distributed systems and highly scalable ETL pipelines using Apache Spark, Scala, and various ETL tools such as Ab Initio and Datastage. He currently works at AWS as a senior big data cloud engineer and is an SME of AWS Glue. He is responsible for helping customers to build out large, scalable distributed systems and implement them in AWS cloud environments using various big data services, including EMR, Glue, and Athena, as well as other technologies, such as Apache Spark, Hadoop, and Hive.
Read more about Ishan Gaur

View More author details
Right arrow

Chapter 9: Data Sharing

When you build a cloud-native data platform at scale on AWS, you may want to share your data with multiple stakeholders under governance. Today, data sharing is one of the key topics in data democratization for making business decisions driven by data and driving business. Typically, the data platform is used by different users, such as data engineers, business analysts, and data scientists.

For example, data engineers own the data platform and maintain it, business analysts generate a daily report that represents business revenue and end user activities, and data scientists may want to unveil complex data patterns and build a data model for their applications. In such situations, these users can belong to different business units and organizations. For enterprise data platforms, democratizing and sharing data with different organizations under data governance securely is a high-demand requirement.

In this chapter, you will learn about three common...

Technical requirements

For this chapter, you need the following resources:

  • An AWS account
  • An AWS IAM role
  • The AWS CLI

Overview of data sharing strategies

At the time of writing, depending on the organizations and use cases, there are different ways to share data. There are three typical strategies for sharing data:

  • Single tenant
  • Hub and spoke
  • Data mesh

In this section, you will learn about each of these strategies and discuss their backgrounds, challenges, and benefits.

Single tenant

Data lakes have become a popular approach for people who want to store and query data in a centralized repository. It allows you to store all the structured data, semi-structured data, and unstructured data at any scale. Here, cloud storage such as Amazon S3 fits well with data lakes because there are no data size limits. You do not need to convert your data into a predefined fixed schema in advance. Instead, you can just ingest data as-is. When you want to analyze the data, you can easily convert the data into your preferred schema on the fly, then analyze it on top of the data lake.

...

Sharing data with multiple AWS accounts using S3 bucket policies and Glue catalog policies

In this section, you will learn how to share your data with multiple AWS accounts using an S3 bucket policy and a Glue catalog policy.

When your use case is simple, and you want to share your data with a small number of accounts, it is possible to grant data access in S3 bucket policies (https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucket-policies.html) and metadata access in Glue catalog resource policies (https://docs.aws.amazon.com/glue/latest/dg/glue-resource-policies.html). You will set these up in the following sections.

Scenario 1 – sharing data from one account with another using S3 bucket policies and Glue catalog policies

In the following scenario, there are two accounts – the producer account and the consumer account. Here, the producer account wants to share its table with the consumer account, and the consumer account wants to run SELECT queries against...

Sharing data with multiple AWS accounts using AWS Lake Formation permissions

In this section, you will learn how to share data with multiple AWS accounts using AWS Lake Formation permissions.

Lake Formation permission model

As you learned in the previous section, there are challenges in managing S3 bucket policies and Glue Data Catalog resource policies. AWS Lake Formation is the service that is designed to overcome those challenges and simplify data platform management. Lake Formation provides a central layer for defining, classifying, tagging, and managing fine-grained access control to the AWS Glue Data Catalog and Amazon S3 locations. The permission model is designed in an RDBMS-like style so that you can grant permissions on databases, tables, or columns instead of S3 objects. Once you have granted access to tables with Lake Formation permissions, Lake Formation automatically manages both data access and metadata access under the hood, so you don’t need to manually...

Summary

In this chapter, you learned about three common data sharing strategies: single-tenant, hub-and-spoke, and data mesh. You also learned how to share data with different accounts using AWS Glue and AWS Lake Formation, as well as the benefits of doing so. At this point, you can design your data sharing model by choosing the strategy that fits your use case. You also gained hands-on skills in building a data sharing mechanism for your data platform.

In the next chapter, you will learn how to manage the data processing pipeline end to end.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Serverless ETL and Analytics with AWS Glue
Published in: Aug 2022Publisher: PacktISBN-13: 9781800564985
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (6)

author image
Vishal Pathak

Vishal Pathak is a Data Lab Solutions Architect at AWS. Vishal works with customers on their use cases, architects solutions to solve their business problems, and helps them build scalable prototypes. Prior to his journey in AWS, Vishal helped customers implement business intelligence, data warehouse, and data lake projects in the US and Australia.
Read more about Vishal Pathak

author image
Subramanya Vajiraya

Subramanya Vajiraya is a Big data Cloud Engineer at AWS Sydney specializing in AWS Glue. He obtained his Bachelor of Engineering degree specializing in Information Science & Engineering from NMAM Institute of Technology, Nitte, KA, India (Visvesvaraya Technological University, Belgaum) in 2015 and obtained his Master of Information Technology degree specialized in Internetworking from the University of New South Wales, Sydney, Australia in 2017. He is passionate about helping customers solve challenging technical issues related to their ETL workload and implementing scalable data integration and analytics pipelines on AWS.
Read more about Subramanya Vajiraya

author image
Noritaka Sekiyama

Noritaka Sekiyama is a Senior Big Data Architect on the AWS Glue and AWS Lake Formation team. He has 11 years of experience working in the software industry. Based in Tokyo, Japan, he is responsible for implementing software artifacts, building libraries, troubleshooting complex issues and helping guide customer architectures
Read more about Noritaka Sekiyama

author image
Tomohiro Tanaka

Tomohiro Tanaka is a senior cloud support engineer at AWS. He works to help customers solve their issues and build data lakes across AWS Glue, AWS IoT, and big data technologies such Apache Spark, Hadoop, and Iceberg.
Read more about Tomohiro Tanaka

author image
Albert Quiroga

Albert Quiroga works as a senior solutions architect at Amazon, where he is helping to design and architect one of the largest data lakes in the world. Prior to that, he spent four years working at AWS, where he specialized in big data technologies such as EMR and Athena, and where he became an expert on AWS Glue. Albert has worked with several Fortune 500 companies on some of the largest data lakes in the world and has helped to launch and develop features for several AWS services.
Read more about Albert Quiroga

author image
Ishan Gaur

Ishan Gaur has more than 13 years of IT experience in soft ware development and data engineering, building distributed systems and highly scalable ETL pipelines using Apache Spark, Scala, and various ETL tools such as Ab Initio and Datastage. He currently works at AWS as a senior big data cloud engineer and is an SME of AWS Glue. He is responsible for helping customers to build out large, scalable distributed systems and implement them in AWS cloud environments using various big data services, including EMR, Glue, and Athena, as well as other technologies, such as Apache Spark, Hadoop, and Hive.
Read more about Ishan Gaur