Running pandas code using AWS Glue for Ray
The pandas library is a highly popular Python library for data manipulation and analysis, based on the well-established numpy library, handling data in a table-like format. It is so well established among Python analysts and data scientists, that it has become a de facto standard to the point that other libraries implement their interfaces so that they can run existing pandas code. This is often done to overcome pandas’ limitations, namely being a single process memory-based library, which limits scalability.
One such pandas-compatible library is Modin. It can run pandas code by just changing the imports while being able to scale by using an engine such as Dask or Ray. In this recipe, you will see how to run pandas code on Glue for Ray using Modin.
Getting ready
This recipe requires a bash shell with the AWS CLI installed and configured. The GLUE_ROLE_ARN and GLUE_BUCKET environment variables need to be set, as indicated in...