Reader small image

You're reading from  Serverless ETL and Analytics with AWS Glue

Product typeBook
Published inAug 2022
Reading LevelExpert
PublisherPackt
ISBN-139781800564985
Edition1st Edition
Languages
Right arrow
Authors (6):
Vishal Pathak
Vishal Pathak
author image
Vishal Pathak

Vishal Pathak is a Data Lab Solutions Architect at AWS. Vishal works with customers on their use cases, architects solutions to solve their business problems, and helps them build scalable prototypes. Prior to his journey in AWS, Vishal helped customers implement business intelligence, data warehouse, and data lake projects in the US and Australia.
Read more about Vishal Pathak

Subramanya Vajiraya
Subramanya Vajiraya
author image
Subramanya Vajiraya

Subramanya Vajiraya is a Big data Cloud Engineer at AWS Sydney specializing in AWS Glue. He obtained his Bachelor of Engineering degree specializing in Information Science & Engineering from NMAM Institute of Technology, Nitte, KA, India (Visvesvaraya Technological University, Belgaum) in 2015 and obtained his Master of Information Technology degree specialized in Internetworking from the University of New South Wales, Sydney, Australia in 2017. He is passionate about helping customers solve challenging technical issues related to their ETL workload and implementing scalable data integration and analytics pipelines on AWS.
Read more about Subramanya Vajiraya

Noritaka Sekiyama
Noritaka Sekiyama
author image
Noritaka Sekiyama

Noritaka Sekiyama is a Senior Big Data Architect on the AWS Glue and AWS Lake Formation team. He has 11 years of experience working in the software industry. Based in Tokyo, Japan, he is responsible for implementing software artifacts, building libraries, troubleshooting complex issues and helping guide customer architectures
Read more about Noritaka Sekiyama

Tomohiro Tanaka
Tomohiro Tanaka
author image
Tomohiro Tanaka

Tomohiro Tanaka is a senior cloud support engineer at AWS. He works to help customers solve their issues and build data lakes across AWS Glue, AWS IoT, and big data technologies such Apache Spark, Hadoop, and Iceberg.
Read more about Tomohiro Tanaka

Albert Quiroga
Albert Quiroga
author image
Albert Quiroga

Albert Quiroga works as a senior solutions architect at Amazon, where he is helping to design and architect one of the largest data lakes in the world. Prior to that, he spent four years working at AWS, where he specialized in big data technologies such as EMR and Athena, and where he became an expert on AWS Glue. Albert has worked with several Fortune 500 companies on some of the largest data lakes in the world and has helped to launch and develop features for several AWS services.
Read more about Albert Quiroga

Ishan Gaur
Ishan Gaur
author image
Ishan Gaur

Ishan Gaur has more than 13 years of IT experience in soft ware development and data engineering, building distributed systems and highly scalable ETL pipelines using Apache Spark, Scala, and various ETL tools such as Ab Initio and Datastage. He currently works at AWS as a senior big data cloud engineer and is an SME of AWS Glue. He is responsible for helping customers to build out large, scalable distributed systems and implement them in AWS cloud environments using various big data services, including EMR, Glue, and Athena, as well as other technologies, such as Apache Spark, Hadoop, and Hive.
Read more about Ishan Gaur

View More author details
Right arrow

Chapter 14: Machine Learning Integration

Machine learning (ML) is one of the cornerstones of today’s computing for any software-related company. ML models are capable of making predictions or deductions based on past experience, provided as training data. This enables a wide variety of applications with large benefits to any organization.

Because it relies on training data, ML is closely tied to data mining, data processing, and, in general, any kind of extract, transform, load (ETL) process. Training data must be properly cleaned, formatted, and classified before it can be fed to a model – a process that greatly affects the effectiveness of the model itself. Because of this, services such as AWS Glue offer ML-specific features and integrations, catered to making ML easier and more effective to use.

Training data preparation is not the only relationship ML has with ETL processes – it can also be used to enhance and provide new transformations within the processes...

Technical requirements

For this chapter, the only requirement is that you have access to this book’s GitHub repository page (https://github.com/PacktPublishing/Serverless-ETL-and-Analytics-with-AWS-Glue).

Glue ML transformations

As mentioned previously, ML is not just an entity that reads the output data from ETL processes, but also one that powers its transformations. ML models enable a wide variety of operations that were not possible before due to computer intelligence limitations.

Because of this, Glue started to offer ML powered-operations with specific purposes under the ML transforms feature. As the name suggests, ML transforms are specific kinds of Glue transforms that are powered by ML models but must be trained and prepared before they can be used. Once they are ready, they can be called from your ETL job’s code, just like other Glue transforms.

At the time of writing, Glue has only released one ML transform, FindMatches, which will automatically find duplicated records within a dataset. Even though this seems like a simple task (most ETL engines could provide this by simply comparing records and checking if they are equal, or if they share a primary key), ML...

SageMaker integration

Amazon SageMaker is AWS’s primary service for ML development. It provides a set of tools and features that lets users handle all the stages of the ML development pipeline, from data collection and preparation to model deployment and hosting.

Just like any other ML tool, SageMaker relies on the concept of model training to get models up to the accuracy level expected from them. And as we mentioned previously, training ML models usually requires large amounts of data to be prepared and processed. Because of this, SageMaker offers native integration with Apache Spark (https://docs.aws.amazon.com/sagemaker/latest/dg/apache-spark.html), which provides model-training capabilities using an AWS-tailored version of Spark.

One of the most important features SageMaker offers is serverless notebooks (https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html). A notebook instance is a serverless EC2 instance that runs Jupyter (https://jupyter.org), a web-based...

Developing ML pipelines with Glue

The combination of SageMaker’s model-hosting features and libraries, plus Glue’s data preparation and orchestration features, allow you to create complex and highly-configurable ML pipelines. In this architecture, each service is responsible for different roles:

  • Glue handles data handling and orchestration. Data handling includes extraction, processing, preparation, and storage. Orchestration refers to the overall execution of the pipeline itself.
  • SageMaker handles all ML-related tasks such as model creation, training, and hosting.

Several components are critical to this, as follows:

  • Glue workflows are the main form of orchestration in Glue. Workflows allow users to define graph-based chains of crawlers, ETL jobs, and triggers, and to see their execution visually in the web console.
  • Python Shell jobs are a sub-class of Glue ETL jobs that are designed to run plain Python scripts instead of PySpark ones. They...

Summary

In this chapter, we discussed all aspects of ML within AWS Glue. We talked about Glue ML transforms, what they are, how they are trained, and how they can be used. We also discussed AWS SageMaker and how it can integrate with Glue resources to accelerate the execution of ML code in notebooks. Finally, we analyzed reference architectures and services for ML pipelines using AWS Glue and SageMaker.

These concepts should have given you a complete overview of how Glue can be used for ML purposes, and how Glue can fit into your ML architecture in the AWS cloud. In the next chapter, we will talk about the data lake architecture and designing use cases for real-world scenarios.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Serverless ETL and Analytics with AWS Glue
Published in: Aug 2022Publisher: PacktISBN-13: 9781800564985
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (6)

author image
Vishal Pathak

Vishal Pathak is a Data Lab Solutions Architect at AWS. Vishal works with customers on their use cases, architects solutions to solve their business problems, and helps them build scalable prototypes. Prior to his journey in AWS, Vishal helped customers implement business intelligence, data warehouse, and data lake projects in the US and Australia.
Read more about Vishal Pathak

author image
Subramanya Vajiraya

Subramanya Vajiraya is a Big data Cloud Engineer at AWS Sydney specializing in AWS Glue. He obtained his Bachelor of Engineering degree specializing in Information Science & Engineering from NMAM Institute of Technology, Nitte, KA, India (Visvesvaraya Technological University, Belgaum) in 2015 and obtained his Master of Information Technology degree specialized in Internetworking from the University of New South Wales, Sydney, Australia in 2017. He is passionate about helping customers solve challenging technical issues related to their ETL workload and implementing scalable data integration and analytics pipelines on AWS.
Read more about Subramanya Vajiraya

author image
Noritaka Sekiyama

Noritaka Sekiyama is a Senior Big Data Architect on the AWS Glue and AWS Lake Formation team. He has 11 years of experience working in the software industry. Based in Tokyo, Japan, he is responsible for implementing software artifacts, building libraries, troubleshooting complex issues and helping guide customer architectures
Read more about Noritaka Sekiyama

author image
Tomohiro Tanaka

Tomohiro Tanaka is a senior cloud support engineer at AWS. He works to help customers solve their issues and build data lakes across AWS Glue, AWS IoT, and big data technologies such Apache Spark, Hadoop, and Iceberg.
Read more about Tomohiro Tanaka

author image
Albert Quiroga

Albert Quiroga works as a senior solutions architect at Amazon, where he is helping to design and architect one of the largest data lakes in the world. Prior to that, he spent four years working at AWS, where he specialized in big data technologies such as EMR and Athena, and where he became an expert on AWS Glue. Albert has worked with several Fortune 500 companies on some of the largest data lakes in the world and has helped to launch and develop features for several AWS services.
Read more about Albert Quiroga

author image
Ishan Gaur

Ishan Gaur has more than 13 years of IT experience in soft ware development and data engineering, building distributed systems and highly scalable ETL pipelines using Apache Spark, Scala, and various ETL tools such as Ab Initio and Datastage. He currently works at AWS as a senior big data cloud engineer and is an SME of AWS Glue. He is responsible for helping customers to build out large, scalable distributed systems and implement them in AWS cloud environments using various big data services, including EMR, Glue, and Athena, as well as other technologies, such as Apache Spark, Hadoop, and Hive.
Read more about Ishan Gaur