You're reading from Serverless ETL and Analytics with AWS Glue

Product typeBook

Published inAug 2022

Reading LevelExpert

PublisherPackt

ISBN-139781800564985

Edition1st Edition

Languages

Python

Tools

AWS Glue

Concepts

Data Analysis

Authors (6):

Vishal Pathak

Subramanya Vajiraya

Noritaka Sekiyama

Tomohiro Tanaka

Albert Quiroga

Ishan Gaur

View More author details

Preface

These days, organizations have gravitated toward data-driven business. Today, data integration across various data sources has become a key driver for businesses. In the cloud, data integration services such as AWS Glue do the undifferentiated heavy lifting based on serverless infrastructure. AWS Glue helps you to integrate data across different sources and build a data lake at scale in a serverless fashion without maintaining infrastructure.

This book shows you how AWS Glue can be used to solve real-world problems, along with teaching you about data processing, data integration, and building data lakes. It allows you to learn how to perform various aspects of data integration techniques such as data ingestion from various sources, data layout optimization, data and metadata management, and data pipeline management. Further, it covers data analysis use cases such as ad hoc queries, visualization, and real-time analysis using AWS Glue. Additional topics such as CI/CD, data quality validation, data sharing, and data security aspects, such as access control, encryption, auditing, and networking, are also covered. Toward the end, the book focuses on providing various monitoring options and the best practices for tuning, debugging, and troubleshooting.

The book takes you through the AWS Glue features such as jobs, the Data Catalog, crawlers, DataBrew, Glue Studio, custom connectors, and so on, in addition to AWS Lake Formation.

By the end of this book, you will be able to integrate data across different sources and build a data platform for scalable analysis using AWS Glue.

Who this book is for

This book is designed for data engineers, ETL developers, and data analysts who want to understand how AWS Glue can help to solve their business problems. Basic knowledge of AWS data services is assumed. Experience with AWS Glue is also preferred but not required. Even without prior knowledge, you can start learning AWS Glue with the book. Most of the features are accompanied by a walkthrough to help you understand the concepts that are explained in each chapter.

What this book covers

Chapter 1, Data Management – Introduction and Concepts, introduces basic concepts associated with data management.

Chapter 2, Introduction to Important AWS Glue Features, introduces some important AWS Glue features.

Chapter 3, Data Ingestion, describes how to ingest data across multiple data stores.

Chapter 4, Data Preparation, describes typical data preparation use cases with both a GUI-based approach and a source code-based approach using AWS Glue.

Chapter 5, Designing Data Layouts, describes how to optimize data layout on Amazon S3 using AWS Glue.

Chapter 6, Data Management, describes how to manage, clean up, and enrich data using AWS Glue.

Chapter 7, Metadata Management, describes how to populate and maintain metadata based on data using AWS Glue.

Chapter 8, Data Security, describes how to secure your data by access control, encryption, auditing, and network security using AWS Glue.

Chapter 9, Data Sharing, describes how to share your data across multiple accounts to democratize your data lake.

Chapter 10, Data Pipeline Management, describes how to build and orchestrate a data-processing pipeline using AWS Glue.

Chapter 11, Monitoring, describes how to monitor a data lake and AWS Glue components.

Chapter 12, Tuning, Debugging, and Troubleshooting, describes the best practices to tune, debug, and troubleshoot typical use cases.

Chapter 13, Data Analysis, describes common options to analyze data using AWS analytics services.

Chapter 14, Machine Learning Integration, describes how to utilize your data for a machine learning workload.

Chapter 15, Architecting Data Lakes for Real-World Scenarios and Edge Cases, describes end-to-end examples of architecting data lakes.

To get the most out of this book

All walkthroughs will require a web browser (Google Chrome, Mozilla Firefox, Microsoft Edge, or Safari) installed on a computer in order to use AWS Management Console, and you’ll need an AWS account to access the AWS Console and utilize AWS resources. Next to that, you’ll need to install the AWS Command Line Interface (AWS CLI) on a computer to run commands:

Not all the chapters’ walkthroughs require an AWS CLI installation. You’ll be informed in each chapter when you need further requirements.

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Serverless-ETL-and-Analytics-with-AWS-Glue. If there’s an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://packt.link/fTqGe.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “We used the glueContext.write_dynamic_frame.from_options() method to write the data to Amazon S3.”

A block of code is set as follows:

root

 |-- ColumnA: string (nullable = true)

 |-- ColumnB: string (nullable = true)

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “This can be done by navigating to AWS Glue Studio console | Connectors | Marketplace Connectors and subscribing to Cloudwatch Metrics connector for AWS Glue.”

Tips or Important Notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at customercare@packtpub.com and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at copyright@packt.com with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

The rest of the chapter is locked

You have been reading a chapter from

Serverless ETL and Analytics with AWS Glue

Published in: Aug 2022Publisher: PacktISBN-13: 9781800564985

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €14.99/month. Cancel anytime

Authors (6)

Vishal Pathak

Vishal Pathak is a Data Lab Solutions Architect at AWS. Vishal works with customers on their use cases, architects solutions to solve their business problems, and helps them build scalable prototypes. Prior to his journey in AWS, Vishal helped customers implement business intelligence, data warehouse, and data lake projects in the US and Australia.
Read more about Vishal Pathak

Subramanya Vajiraya

Subramanya Vajiraya is a Big data Cloud Engineer at AWS Sydney specializing in AWS Glue. He obtained his Bachelor of Engineering degree specializing in Information Science & Engineering from NMAM Institute of Technology, Nitte, KA, India (Visvesvaraya Technological University, Belgaum) in 2015 and obtained his Master of Information Technology degree specialized in Internetworking from the University of New South Wales, Sydney, Australia in 2017. He is passionate about helping customers solve challenging technical issues related to their ETL workload and implementing scalable data integration and analytics pipelines on AWS.
Read more about Subramanya Vajiraya

Noritaka Sekiyama

Noritaka Sekiyama is a Senior Big Data Architect on the AWS Glue and AWS Lake Formation team. He has 11 years of experience working in the software industry. Based in Tokyo, Japan, he is responsible for implementing software artifacts, building libraries, troubleshooting complex issues and helping guide customer architectures
Read more about Noritaka Sekiyama

Tomohiro Tanaka

Tomohiro Tanaka is a senior cloud support engineer at AWS. He works to help customers solve their issues and build data lakes across AWS Glue, AWS IoT, and big data technologies such Apache Spark, Hadoop, and Iceberg.
Read more about Tomohiro Tanaka

Albert Quiroga

Albert Quiroga works as a senior solutions architect at Amazon, where he is helping to design and architect one of the largest data lakes in the world. Prior to that, he spent four years working at AWS, where he specialized in big data technologies such as EMR and Athena, and where he became an expert on AWS Glue. Albert has worked with several Fortune 500 companies on some of the largest data lakes in the world and has helped to launch and develop features for several AWS services.
Read more about Albert Quiroga

Ishan Gaur

Ishan Gaur has more than 13 years of IT experience in soft ware development and data engineering, building distributed systems and highly scalable ETL pipelines using Apache Spark, Scala, and various ETL tools such as Ab Initio and Datastage. He currently works at AWS as a senior big data cloud engineer and is an SME of AWS Glue. He is responsible for helping customers to build out large, scalable distributed systems and implement them in AWS cloud environments using various big data services, including EMR, Glue, and Athena, as well as other technologies, such as Apache Spark, Hadoop, and Hive.
Read more about Ishan Gaur

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Serverless ETL and Analytics with AWS Glue

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Share Your Thoughts

Unlock this book and the full library FREE for 7 days

Authors (6)

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook