You're reading from Data Wrangling on AWS

Product typeBook

Published inJul 2023

PublisherPackt

ISBN-139781801810906

Edition1st Edition

Tools

AWS

Concepts

Data Analysis

Authors (3):

Navnit Shukla

Sankar M

Sampat Palani

View More author details

Preface

Welcome to the world of Data Wrangling on AWS! In this comprehensive book, we will explore the exciting field of data wrangling and uncover the immense potential of leveraging Amazon Web Services (AWS) for efficient and effective data manipulation and preparation. Whether you are a data professional, a data scientist, or someone interested in harnessing the power of data, this book will provide you with the knowledge and tools to excel in the realm of data wrangling on the AWS platform.

Data wrangling, also known as data preparation or data munging, is a critical step in the data analysis process. It involves transforming raw data into a clean, structured format that is ready for analysis. With the exponential growth of data and the increasing need for data-driven decision-making, mastering the art of data wrangling has become essential for extracting valuable insights from vast and complex datasets.

In this book, we will guide you through a series of chapters, each focusing on a specific aspect of data wrangling on AWS. We will explore various AWS services and tools that empower you to efficiently manipulate, transform, and prepare your data for analysis. From AWS Glue and Athena to SageMaker Data Wrangler and QuickSight, we will delve into the powerful capabilities of these services and uncover their potential for unlocking valuable insights from your data.

Throughout the chapters, you will learn how to leverage AWS’s cloud infrastructure and robust data processing capabilities to streamline your data-wrangling workflows. You will discover practical techniques, best practices, and hands-on examples that will equip you with the skills to tackle real-world data challenges and extract meaningful information from your datasets.

So, whether you are just starting your journey in data wrangling or looking to expand your knowledge in the AWS ecosystem, this book is your comprehensive guide to mastering data wrangling on AWS. Get ready to unlock the power of data and unleash its full potential with the help of AWS’s cutting-edge technologies and tools.

Let’s dive in and embark on an exciting journey into the world of data wrangling on AWS!

Who this book is for

Data Wrangling on AWS is designed for a wide range of individuals who are interested in mastering the art of data wrangling and leveraging the power of AWS for efficient and effective data manipulation and preparation. The book caters to the following audience:

Data Professionals: Data engineers, data analysts, and data scientists who work with large and complex datasets and want to enhance their data-wrangling skills on the AWS platform
AWS Users: Individuals who are already familiar with AWS and want to explore the specific services and tools available for data wrangling
Business Analysts: Professionals involved in data-driven decision-making and analysis who need to acquire data-wrangling skills to derive valuable insights from their data
IT Professionals: Technology enthusiasts and IT practitioners who want to expand their knowledge of data wrangling on the AWS platform

While prior experience with data wrangling or AWS is beneficial, the book provides a solid foundation for beginners and gradually progresses to more advanced topics. Familiarity with basic programming concepts and SQL would be advantageous but is not mandatory. The book combines theoretical explanations with practical examples and hands-on exercises, making it accessible to individuals with different backgrounds and skill levels.

What this book covers

Chapter 1, Getting Started with Data Wrangling: In the opening chapter, you will embark on a journey into the world of data wrangling and discover the power of leveraging AWS for efficient and effective data manipulation and preparation. This chapter serves as a solid foundation, providing you with an overview of the key concepts and tools you’ll encounter throughout the book.

Chapter 2, Introduction to AWS Glue DataBrew: In this chapter, you will discover the powerful capabilities of AWS Glue DataBrew for data wrangling and data preparation tasks. This chapter will guide you through the process of leveraging AWS Glue DataBrew to cleanse, transform, and enrich your data, ensuring its quality and usability for further analysis.

Chapter 3, Introducing AWS SDK for pandas: In this chapter, you will be introduced to the versatile capabilities of AWS Data Wrangler for data wrangling tasks on the AWS platform. This chapter will provide you with a comprehensive understanding of AWS Data Wrangler and how it can empower you to efficiently manipulate and prepare your data for analysis.

Chapter 4, Introduction to SageMaker Data Wrangler: In this chapter, you will discover the capabilities of Amazon SageMaker Data Wrangler for data wrangling tasks within the Amazon SageMaker ecosystem. This chapter will equip you with the knowledge and skills to leverage Amazon SageMaker Data Wrangler’s powerful features to efficiently preprocess and prepare your data for machine learning projects.

Chapter 5, Working with Amazon S3: In this chapter, you will delve into the world of Amazon Simple Storage Service (S3) and explore its vast potential for storing, organizing, and accessing your data. This chapter will provide you with a comprehensive understanding of Amazon S3 and how it can be leveraged for effective data management and manipulation.

Chapter 6, Working with AWS Glue: In this chapter, you will dive into the powerful capabilities of AWS Glue, a fully managed extract, transform, and load (ETL) service provided by AWS. This chapter will guide you through the process of leveraging AWS Glue to automate and streamline your data preparation and transformation workflows.

Chapter 7, Working with Athena: In this chapter, you will explore the powerful capabilities of Amazon Athena, a serverless query service that enables you to analyze data directly in Amazon S3 using standard SQL queries. This chapter will guide you through the process of leveraging Amazon Athena to unlock valuable insights from your data, without the need for complex data processing infrastructure.

Chapter 8, Working with QuickSight: In this chapter, you will discover the power of Amazon QuickSight, a fast, cloud-powered business intelligence (BI) service provided by AWS. This chapter will guide you through the process of leveraging QuickSight to create interactive dashboards and visualizations, enabling you to gain valuable insights from your data.

Chapter 9, Building an End-to-End Data-Wrangling Pipeline with AWS SDK for Pandas: In this chapter, you will explore the powerful combination of AWS Data Wrangler and pandas, a popular Python library for data manipulation and analysis. This chapter will guide you through the process of leveraging pandas operations within AWS Data Wrangler to perform advanced data transformations and analysis on your datasets.

Chapter 10, Data Processing for Machine Learning with SageMaker Data Wrangler: In this chapter, you will delve into the world of machine learning (ML) data optimization using the powerful capabilities of AWS SageMaker Data Wrangler. This chapter will guide you through the process of leveraging SageMaker Data Wrangler to preprocess and prepare your data for ML projects, maximizing the performance and accuracy of your ML models.

Chapter 11, Data Lake Security and Monitoring: In this chapter, you will be introduced to Identity and Access Management (IAM) on AWS and how closely Data Wrangler integrates with AWS’ security features. We will show how you can interact directly with Amazon Cloudwatch logs, query against logs, and return the logs as a data frame.

To get the most out of this book

To get the most out of this book, it is helpful to have a basic understanding of data concepts and familiarity with data manipulation techniques. Some prior exposure to programming languages, such as Python or SQL, will also be beneficial, as we will be utilizing these languages for data-wrangling tasks. Additionally, a foundational understanding of cloud computing and AWS will aid in grasping the concepts and tools discussed throughout the book. While not essential, having hands-on experience with AWS services such as Amazon S3 and AWS Glue will further enhance your learning experience. By having these prerequisites in place, you will be able to fully engage with the content and successfully apply the techniques and practices covered in Data Wrangling on AWS.

Software/hardware covered in the book	Operating system requirements
AWS	Windows, macOS, or Linux

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Data-Wrangling-on-AWS. If there’s an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system.”

A block of code is set as follows:

import sysimport awswrangler as wr
print(wr.__version__)

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES (    "input.regex" = "^([\\w\\s.+-]{11})\\s([\\w\\s.+-]{8})\\s([\\
    w\\s.+-]{9})\\s([\\w]{4})\\s([\\d]{4})\\s([\\d]{4})")
LOCATION 's3://<<location your file/'

Any command-line input or output is written as follows:

git clone https://github.com/aws-samples/aws-database-migrationsamples.gitcd aws-database-migration-samples/mysql/sampledb/v1/

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “Click on the Upload button.”

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at customercare@packtpub.com and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at copyright@packt.com with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere? Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below

https://packt.link/free-ebook/9781801810906

Submit your proof of purchase
That’s it! We’ll send your free PDF and other benefits to your email directly

The rest of the chapter is locked

You have been reading a chapter from

Data Wrangling on AWS

Published in: Jul 2023Publisher: PacktISBN-13: 9781801810906

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (3)

Navnit Shukla

Navnit Shukla is an accomplished Senior Solution Architect with a specialization in AWS analytics. With an impressive career spanning 12 years, he has honed his expertise in databases and analytics, establishing himself as a trusted professional in the field. Currently based in Orange County, CA, Navnit's primary responsibility lies in assisting customers in building scalable, cost-effective, and secure data platforms on the AWS cloud.
Read more about Navnit Shukla

Sankar M

Sankar Sundaram has been working in IT Industry since 2007, specializing in databases, data warehouses, analytics space for many years. As a specialized Data Architect, he helps customers build and modernize data architectures and help them build secure, scalable, and performant data lake, database, and data warehouse solutions. Prior to joining AWS, he has worked with multiple customers in implementing complex data architectures.
Read more about Sankar M

Sampat Palani

Sam Palani has over 18+ years as developer, data engineer, data scientist, a startup cofounder and IT leader. He holds a master's in Business Administration with a dual specialization in Information Technology. His professional career spans across 5 countries across financial services, management consulting and the technology industries. He is currently Sr Leader for Machine Learning and AI at Amazon Web Services, where he is responsible for multiple lines of the business, product strategy and thought leadership. Sam is also a practicing data scientist, a writer with multiple publications, speaker at key industry conferences and an active open source contributor. Outside work, he loves hiking, photography, experimenting with food and reading.
Read more about Sampat Palani

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Data Wrangling on AWS

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Get in touch

Share Your Thoughts

Download a free PDF copy of this book

Unlock this book and the full library FREE for 7 days

Authors (3)

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook