You're reading from Data Engineering with AWS Cookbook A recipe-based approach to help you tackle data engineering problems with AWS services

Product type Paperback

Published in Nov 2024

Publisher Packt

ISBN-13 9781805127284

Length 528 pages

Edition 1st Edition

Languages

Python

Tools

AWS Glue

Concepts

Cloud Computing

Authors (4):

Trâm Ngọc Phạm

Gonzalo Herreros González

Viquar Khan

Huda Nofal

View More author details

Table of Contents (16) Chapters

Preface

1. Chapter 1: Managing Data Lake Storage FREE CHAPTER

2. Chapter 2: Sharing Your Data Across Environments and Accounts

3. Chapter 3: Ingesting and Transforming Your Data with AWS Glue

4. Chapter 4: A Deep Dive into AWS Orchestration Frameworks

5. Chapter 5: Running Big Data Workloads with Amazon EMR

6. Chapter 6: Governing Your Platform

7. Chapter 7: Data Quality Management

8. Chapter 8: DevOps – Defining IaC and Building CI/CD Pipelines

9. Chapter 9: Monitoring Data Lake Cloud Infrastructure

10. Chapter 10: Building a Serving Layer with AWS Analytics Services

11. Chapter 11: Migrating to AWS – Steps, Strategies, and Best Practices for Modernizing Your Analytics and Big Data Workloads

12. Chapter 12: Harnessing the Power of AWS for Seamless Data Warehouse Migration

13. Chapter 13: Strategizing Hadoop Migrations – Cost, Data, and Workflow Modernization with AWS

14. Index

Why subscribe?

15. Other Books You May Enjoy

Running pandas code using AWS Glue for Ray

The pandas library is a highly popular Python library for data manipulation and analysis, based on the well-established numpy library, handling data in a table-like format. It is so well established among Python analysts and data scientists, that it has become a de facto standard to the point that other libraries implement their interfaces so that they can run existing pandas code. This is often done to overcome pandas’ limitations, namely being a single process memory-based library, which limits scalability.

One such pandas-compatible library is Modin. It can run pandas code by just changing the imports while being able to scale by using an engine such as Dask or Ray. In this recipe, you will see how to run pandas code on Glue for Ray using Modin.

Getting ready

This recipe requires a bash shell with the AWS CLI installed and configured. The GLUE_ROLE_ARN and GLUE_BUCKET environment variables need to be set, as indicated in...