You're reading from Data Wrangling on AWS

Product typeBook

Published inJul 2023

PublisherPackt

ISBN-139781801810906

Edition1st Edition

Tools

AWS

Concepts

Data Analysis

Authors (3):

Navnit Shukla

Sankar M

Sampat Palani

View More author details

Working with AWS Glue

In the preceding chapter, we discussed various data storage types, including data warehouses, data lakes, data lakehouses, and data meshes, along with their key differences.

This chapter will explore the distinct components of AWS Glue, providing insight into how they can aid in data wrangling tasks.

After completing this chapter, you will be able to comprehend and define how AWS Glue can be utilized for data wrangling. You will also be capable of explaining the fundamental concepts associated with various AWS Glue features, such as AWS Glue Data Catalog, AWS Glue connections, AWS Glue crawlers, AWS Glue Schema Registry, AWS Glue jobs, AWS Glue development endpoints, AWS Glue interactive sessions, and AWS Glue triggers.

The following topics will be covered in this chapter:

Spark basics
AWS Glue features
Data discovery using AWS Glue
Data ingestion using AWS Glue

What is Apache Spark?

Apache Spark is a unified analytics engine for processing big data, developed as an open source project in 2009 at the University of California, Berkeley’s AMPLab. Initially, it was created as a class project to address the limitations of the Hadoop framework in exchanging data between iterations through HDFS for machine learning use cases. The objective was to design a new framework for fast interactive processing, including machine learning and interactive data analysis, while retaining the implicit data parallelism and fault tolerance of MapReduce and HDFS from the Hadoop framework. It incorporates in-memory caching and is optimized for analytics workloads of any size.

Apache Spark was open sourced in 2010 under a BSD License, and in 2013, the project was contributed to the Apache Software Foundation. In 2014, Spark became a top-level Apache project. It has garnered over 1.7 thousand contributors and over 30K stargazers on GitHub.

According to...

Data discovery with AWS Glue

One of the unique features that sets AWS Glue apart from other ETL tools is its ability to create a centralized data catalog. This catalog is crucial for performing data discovery and relies on two important components of Glue:

Glue Data Catalog
Glue Data Crawler

AWS Glue Data Catalog

A data catalog is a centralized storage of metadata for data stored in different data stores, such as data lakes, data warehouses, relational databases, and non-relational databases. The metadata contains information about columns, data formats, locations, and serialization/deserialization mechanisms. Hive Metastore is one of the most popular metadata products used in the industry. However, it uses relational database management systems (RDBMSs) such as MySQL and PostgreSQL. The problem with using an RDBMS for Hive metadata is managing and maintaining it, especially for production workloads where high availability, scaling, and redundancy must be taken...

Data ingestion using AWS Glue ETL

In the previous section, we learned how to use various features of AWS Glue Crawler and AWS Glue Data Catalog to create a centralized data catalog for data discovery. In this section, we will explore the option of using AWS Glue ETL for data ingestion from various data sources, such as data lakes (Amazon S3), databases, streaming, and SaaS data stores. Additionally, we will learn about how to use job bookmarks to perform incremental data loads from Data Lake (S3) and JDBC.

Glue enables users to create ETL jobs using three different types of ETL frameworks – Spark ETL, Spark Streaming, and Python Shell. In the introduction section of Glue DataBrew, we learned how AWS Glue has evolved and that now, AWS Glue Studio is available to build ETL pipelines.

The AWS Glue user interface allows you to build your ETL pipeline with an interesting feature that converts your UI job into a script, which helps you scale when building similar pipelines or...

Summary

In this chapter, we covered Apache Spark, its connection with AWS Glue, and the various features available in AWS Glue, including AWS Glue Data Catalog for data discovery, AWS Glue Crawler for metadata extraction, and AWS Glue Studio for building UI-based ETL pipelines. We also explored how to use the AWS Glue Marketplace to subscribe to different connectors so that we can extract data from SaaS applications.

In the next chapter, we will discuss another essential service that plays a significant role in the data wrangling and discovery process: Amazon Athena.

The rest of the chapter is locked

You have been reading a chapter from

Data Wrangling on AWS

Published in: Jul 2023Publisher: PacktISBN-13: 9781801810906

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (3)

Navnit Shukla

Navnit Shukla is an accomplished Senior Solution Architect with a specialization in AWS analytics. With an impressive career spanning 12 years, he has honed his expertise in databases and analytics, establishing himself as a trusted professional in the field. Currently based in Orange County, CA, Navnit's primary responsibility lies in assisting customers in building scalable, cost-effective, and secure data platforms on the AWS cloud.
Read more about Navnit Shukla

Sankar M

Sankar Sundaram has been working in IT Industry since 2007, specializing in databases, data warehouses, analytics space for many years. As a specialized Data Architect, he helps customers build and modernize data architectures and help them build secure, scalable, and performant data lake, database, and data warehouse solutions. Prior to joining AWS, he has worked with multiple customers in implementing complex data architectures.
Read more about Sankar M

Sampat Palani

Sam Palani has over 18+ years as developer, data engineer, data scientist, a startup cofounder and IT leader. He holds a master's in Business Administration with a dual specialization in Information Technology. His professional career spans across 5 countries across financial services, management consulting and the technology industries. He is currently Sr Leader for Machine Learning and AI at Amazon Web Services, where he is responsible for multiple lines of the business, product strategy and thought leadership. Sam is also a practicing data scientist, a writer with multiple publications, speaker at key industry conferences and an active open source contributor. Outside work, he loves hiking, photography, experimenting with food and reading.
Read more about Sampat Palani

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages