Reader small image

You're reading from  Serverless ETL and Analytics with AWS Glue

Product typeBook
Published inAug 2022
Reading LevelExpert
PublisherPackt
ISBN-139781800564985
Edition1st Edition
Languages
Right arrow
Authors (6):
Vishal Pathak
Vishal Pathak
author image
Vishal Pathak

Vishal Pathak is a Data Lab Solutions Architect at AWS. Vishal works with customers on their use cases, architects solutions to solve their business problems, and helps them build scalable prototypes. Prior to his journey in AWS, Vishal helped customers implement business intelligence, data warehouse, and data lake projects in the US and Australia.
Read more about Vishal Pathak

Subramanya Vajiraya
Subramanya Vajiraya
author image
Subramanya Vajiraya

Subramanya Vajiraya is a Big data Cloud Engineer at AWS Sydney specializing in AWS Glue. He obtained his Bachelor of Engineering degree specializing in Information Science & Engineering from NMAM Institute of Technology, Nitte, KA, India (Visvesvaraya Technological University, Belgaum) in 2015 and obtained his Master of Information Technology degree specialized in Internetworking from the University of New South Wales, Sydney, Australia in 2017. He is passionate about helping customers solve challenging technical issues related to their ETL workload and implementing scalable data integration and analytics pipelines on AWS.
Read more about Subramanya Vajiraya

Noritaka Sekiyama
Noritaka Sekiyama
author image
Noritaka Sekiyama

Noritaka Sekiyama is a Senior Big Data Architect on the AWS Glue and AWS Lake Formation team. He has 11 years of experience working in the software industry. Based in Tokyo, Japan, he is responsible for implementing software artifacts, building libraries, troubleshooting complex issues and helping guide customer architectures
Read more about Noritaka Sekiyama

Tomohiro Tanaka
Tomohiro Tanaka
author image
Tomohiro Tanaka

Tomohiro Tanaka is a senior cloud support engineer at AWS. He works to help customers solve their issues and build data lakes across AWS Glue, AWS IoT, and big data technologies such Apache Spark, Hadoop, and Iceberg.
Read more about Tomohiro Tanaka

Albert Quiroga
Albert Quiroga
author image
Albert Quiroga

Albert Quiroga works as a senior solutions architect at Amazon, where he is helping to design and architect one of the largest data lakes in the world. Prior to that, he spent four years working at AWS, where he specialized in big data technologies such as EMR and Athena, and where he became an expert on AWS Glue. Albert has worked with several Fortune 500 companies on some of the largest data lakes in the world and has helped to launch and develop features for several AWS services.
Read more about Albert Quiroga

Ishan Gaur
Ishan Gaur
author image
Ishan Gaur

Ishan Gaur has more than 13 years of IT experience in soft ware development and data engineering, building distributed systems and highly scalable ETL pipelines using Apache Spark, Scala, and various ETL tools such as Ab Initio and Datastage. He currently works at AWS as a senior big data cloud engineer and is an SME of AWS Glue. He is responsible for helping customers to build out large, scalable distributed systems and implement them in AWS cloud environments using various big data services, including EMR, Glue, and Athena, as well as other technologies, such as Apache Spark, Hadoop, and Hive.
Read more about Ishan Gaur

View More author details
Right arrow

AWS Glue

On August 14, 2017, AWS released a new service called AWS Glue. AWS Glue is a serverless data integration service. AWS Glue also provides some easy-to-use features that almost eliminate the administrative overhead of infrastructure management and simplify how common data integration tasks can be integrated.

Let’s look at some of the notable components of the AWS Glue feature set:

  • AWS Glue DataBrew: Glue DataBrew is used for data cleansing and enrichment through another GUI. Creating AWS Glue DataBrew Jobs does not require the user to write any source code and the Jobs are created with the help of a GUI.
  • AWS Glue Data Catalog: AWS Glue Data Catalog is a central catalog of metadata that can be used with other AWS services such as Amazon Athena, Amazon Redshift, and Amazon EMR.
  • AWS Glue Connections: Glue Connections are catalog objects that help organize and store connection information to various data stores. AWS Glue Connections can also be created for Marketplace AWS Glue Connectors, which allows you to integrate with third-party data stores, such as Apache Hudi, Google Big Query, and Elastic Search.
  • AWS Glue Crawlers: Crawlers can be used to crawl existing data and populate an AWS Glue Data Catalog with metadata.
  • AWS Glue ETL Jobs: Glue ETL Jobs enables users to extract source data from various data stores, process it, and write output to a data target based on the logic defined in the ETL script. Users can take advantage of Apache Spark-based ETL Jobs to handle their workload in a distributed fashion. Glue also offers Python shell Jobs for ETL workloads; these don’t need distributed processing.
  • AWS Glue Interactive Sessions: Interactive sessions are managed interactive environments that can be used to develop and test AWS Glue ETL scripts.
  • AWS Glue Schema Registry: AWS Glue Schema Registry allows users to centrally control data stream schemas and has integrations with Apache Kafka, Amazon Kinesis, and AWS Lambda.
  • AWS Glue Triggers: AWS Glue Triggers are data catalog objects that allow us to either manually or automatically start executing one or more AWS Glue Crawlers or AWS Glue ETL Jobs.
  • AWS Glue Workflows: Glue Workflows can be used to orchestrate the execution of a set of AWS Glue Jobs and AWS Glue Crawlers using AWS Glue Triggers.
  • AWS Glue Blueprints: Blueprints are useful for creating parameterized workflows that can be created and shared for similar use cases.
  • AWS Glue Elastic Views: Glue Elastic Views helps users replicate the data from one store to another using familiar SQL syntax.

This book will focus on learning about AWS Glue, diving deep into the features listed here, and learning about how these features help solve the data problems of the modern world. We will also learn about the fundamental concepts of AWS LakeFormation, which are important for securely managing and administering the data assets of an organization.

Querying data using AWS

At the beginning of this chapter, we focused on various ways to collect and organize the data from various systems to enable various downstream workloads, such as feature engineering, data exploration, and analytics. While data lakes and data meshes have reduced the entry barrier to democratize data, you may still need to access data from various purpose-built stores.

Today’s applications are built around the microservice architecture, which allows teams to split vertically based on their functionality and scale independently. Organizations may have their two pizza teams working on different microservices. Each of these teams is independent and can pick its own purpose-built data stores to support its application.

In an ideal world, data from all of these purpose-built stores should flow into the data lake, but this might not always be the case. In a world where the speed of decision-making is paramount, data analysts may want to access the data and combine it even before the data starts hydrating the data lake.

This requirement led to the need for modern tools to support querying data across multiple different sources. In the AWS ecosystem, both Amazon Athena and Amazon Redshift allow you to query data across multiple data stores.

While using Amazon Athena to query S3 data cataloged in AWS Glue Catalog is quite common, Amazon Athena can also be used to query data from Amazon CloudWatch Logs, Amazon DynamoDB, Amazon DocumentDB, Amazon RDS, and JDBC-compliant relational data sources such MySQL and PostgreSQL under the Apache 2.0 license using AWS Lambda-based data source connectors. Athena Query Federation SDK can be used to write a customer connector too. These connectors return data in Apache Arrow format. Amazon Athena uses these connectors and manages parallelism, along with predicate pushdown.

Similarly, Amazon Redshift also supports querying Amazon S3 data through Amazon Redshift Spectrum. Redshift also supports querying data in Amazon RDS for PostgreSQL, Amazon Aurora PostgreSQL-Compatible Edition, Amazon RDS for MySQL, and Amazon Aurora MySQL-Compatible Edition through its Query Federation feature. Amazon Redshift offloads part of the computations to the target data stores and uses its parallel processing capabilities for the query’s operation.

To handle the undifferentiated heavy lifting, AWS Glue introduced a new feature called AWS Glue Elastic Views. It allows users to use familiar SQL. It combines and materializes the data from various sources into the target. Since AWS Glue Elastic Views is serverless, users do not have to worry about managing the underlying infrastructure or keeping the target hydrated.

Previous PageNext Page
You have been reading a chapter from
Serverless ETL and Analytics with AWS Glue
Published in: Aug 2022Publisher: PacktISBN-13: 9781800564985
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (6)

author image
Vishal Pathak

Vishal Pathak is a Data Lab Solutions Architect at AWS. Vishal works with customers on their use cases, architects solutions to solve their business problems, and helps them build scalable prototypes. Prior to his journey in AWS, Vishal helped customers implement business intelligence, data warehouse, and data lake projects in the US and Australia.
Read more about Vishal Pathak

author image
Subramanya Vajiraya

Subramanya Vajiraya is a Big data Cloud Engineer at AWS Sydney specializing in AWS Glue. He obtained his Bachelor of Engineering degree specializing in Information Science & Engineering from NMAM Institute of Technology, Nitte, KA, India (Visvesvaraya Technological University, Belgaum) in 2015 and obtained his Master of Information Technology degree specialized in Internetworking from the University of New South Wales, Sydney, Australia in 2017. He is passionate about helping customers solve challenging technical issues related to their ETL workload and implementing scalable data integration and analytics pipelines on AWS.
Read more about Subramanya Vajiraya

author image
Noritaka Sekiyama

Noritaka Sekiyama is a Senior Big Data Architect on the AWS Glue and AWS Lake Formation team. He has 11 years of experience working in the software industry. Based in Tokyo, Japan, he is responsible for implementing software artifacts, building libraries, troubleshooting complex issues and helping guide customer architectures
Read more about Noritaka Sekiyama

author image
Tomohiro Tanaka

Tomohiro Tanaka is a senior cloud support engineer at AWS. He works to help customers solve their issues and build data lakes across AWS Glue, AWS IoT, and big data technologies such Apache Spark, Hadoop, and Iceberg.
Read more about Tomohiro Tanaka

author image
Albert Quiroga

Albert Quiroga works as a senior solutions architect at Amazon, where he is helping to design and architect one of the largest data lakes in the world. Prior to that, he spent four years working at AWS, where he specialized in big data technologies such as EMR and Athena, and where he became an expert on AWS Glue. Albert has worked with several Fortune 500 companies on some of the largest data lakes in the world and has helped to launch and develop features for several AWS services.
Read more about Albert Quiroga

author image
Ishan Gaur

Ishan Gaur has more than 13 years of IT experience in soft ware development and data engineering, building distributed systems and highly scalable ETL pipelines using Apache Spark, Scala, and various ETL tools such as Ab Initio and Datastage. He currently works at AWS as a senior big data cloud engineer and is an SME of AWS Glue. He is responsible for helping customers to build out large, scalable distributed systems and implement them in AWS cloud environments using various big data services, including EMR, Glue, and Athena, as well as other technologies, such as Apache Spark, Hadoop, and Hive.
Read more about Ishan Gaur