Reader small image

You're reading from  Data Wrangling on AWS

Product typeBook
Published inJul 2023
PublisherPackt
ISBN-139781801810906
Edition1st Edition
Tools
Right arrow
Authors (3):
Navnit Shukla
Navnit Shukla
author image
Navnit Shukla

Navnit Shukla is an accomplished Senior Solution Architect with a specialization in AWS analytics. With an impressive career spanning 12 years, he has honed his expertise in databases and analytics, establishing himself as a trusted professional in the field. Currently based in Orange County, CA, Navnit's primary responsibility lies in assisting customers in building scalable, cost-effective, and secure data platforms on the AWS cloud.
Read more about Navnit Shukla

Sankar M
Sankar M
author image
Sankar M

Sankar Sundaram has been working in IT Industry since 2007, specializing in databases, data warehouses, analytics space for many years. As a specialized Data Architect, he helps customers build and modernize data architectures and help them build secure, scalable, and performant data lake, database, and data warehouse solutions. Prior to joining AWS, he has worked with multiple customers in implementing complex data architectures.
Read more about Sankar M

Sampat Palani
Sampat Palani
author image
Sampat Palani

Sam Palani has over 18+ years as developer, data engineer, data scientist, a startup cofounder and IT leader. He holds a master's in Business Administration with a dual specialization in Information Technology. His professional career spans across 5 countries across financial services, management consulting and the technology industries. He is currently Sr Leader for Machine Learning and AI at Amazon Web Services, where he is responsible for multiple lines of the business, product strategy and thought leadership. Sam is also a practicing data scientist, a writer with multiple publications, speaker at key industry conferences and an active open source contributor. Outside work, he loves hiking, photography, experimenting with food and reading.
Read more about Sampat Palani

View More author details
Right arrow

Best practices for data wrangling

There are many ways and tools available to perform data wrangling, depending on how data wrangling is performed and by whom. For example, if you are working on real-time use cases such as providing product recommendations or fraud detection, your choice of tool and process for performing data wrangling will be a lot different compared to when you are looking to build a business intelligence (BI) dashboard to show sales numbers.

Regardless of the kind of use cases you are looking to solve, some standard best practices can be applied in each case that will help make your job easier as a data wrangler.

Identifying the business use case

It’s recommended that you decide which service or tool you are looking to use for data wrangling before you write a single line of code. It is super important to identify the business use case as this will set the stage for data wrangling processes and make the job of identifying the services you are looking to use easier. For example, if you have a business use case such as analyzing HR data for small organizations where you just need to concatenate a few columns, remove a few columns, remove duplicates, remove NULL values, and so on from a small dataset that contains 10,000 records, and only a few users will be looking to access the wrangled data, then you don’t need to invest a ton of money to find a fancy data wrangling tool available on the market – you can simply use Excel sheets for your work.

However, when you have a business use case, such as processing claims data you receive from different partners where you need to work with semi-structured files such as JSON, or non-structured datasets such as XML files to extract only a few files’ data such as their claim ID and customer information, and you are looking to perform complex data wrangling processes such as joins, finding patterns using regex, and so on, then you should look to write scripts or subscribe to any enterprise-grade tool for your work.

Identifying the data source and bringing the right data

After identifying the business use case, it is important to identify which data sources are required to solve it. Identifying this source will help you choose what kind of services are required to bring the data, frequency, and end storage. For example, if you are looking to build a credit card fraud detection solution, you need to bring in credit card transaction data in real time; even cleaning and processing the data should be done in real time. Machine learning inference also needs to be run on real-time data.

Similarly, if you are building a sales dashboard, you may need to bring in data from a CRM system such as Salesforce or a transactional datastore such as Oracle, Microsoft SQL Server, and so on.

After identifying the right data sources, it is important to bring in the right data from these data sources as it will help you solve the business use cases and make the data wrangling process easy.

Identifying your audience

When you perform data wrangling, one important aspect is to identify your audience. Knowing your audience will help you identify what kind of data they are looking to consume. For example, marketing teams may have different data wrangling requirements compared to data science teams or business executives.

This will also give you an idea of where you are looking to present the data – for example, a data scientist team may need data in an object store such as Amazon S3, business analysts may need data in flat files such as CSV, BI developers may need data in a transactional data store, and business users may need data in applications.

With that, we have covered the best practices of data wrangling. Next, we will explore the different options that are available within AWS to perform data wrangling.

Previous PageNext Page
You have been reading a chapter from
Data Wrangling on AWS
Published in: Jul 2023Publisher: PacktISBN-13: 9781801810906
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Navnit Shukla

Navnit Shukla is an accomplished Senior Solution Architect with a specialization in AWS analytics. With an impressive career spanning 12 years, he has honed his expertise in databases and analytics, establishing himself as a trusted professional in the field. Currently based in Orange County, CA, Navnit's primary responsibility lies in assisting customers in building scalable, cost-effective, and secure data platforms on the AWS cloud.
Read more about Navnit Shukla

author image
Sankar M

Sankar Sundaram has been working in IT Industry since 2007, specializing in databases, data warehouses, analytics space for many years. As a specialized Data Architect, he helps customers build and modernize data architectures and help them build secure, scalable, and performant data lake, database, and data warehouse solutions. Prior to joining AWS, he has worked with multiple customers in implementing complex data architectures.
Read more about Sankar M

author image
Sampat Palani

Sam Palani has over 18+ years as developer, data engineer, data scientist, a startup cofounder and IT leader. He holds a master's in Business Administration with a dual specialization in Information Technology. His professional career spans across 5 countries across financial services, management consulting and the technology industries. He is currently Sr Leader for Machine Learning and AI at Amazon Web Services, where he is responsible for multiple lines of the business, product strategy and thought leadership. Sam is also a practicing data scientist, a writer with multiple publications, speaker at key industry conferences and an active open source contributor. Outside work, he loves hiking, photography, experimenting with food and reading.
Read more about Sampat Palani