Reader small image

You're reading from  Data Wrangling on AWS

Product typeBook
Published inJul 2023
PublisherPackt
ISBN-139781801810906
Edition1st Edition
Tools
Right arrow
Authors (3):
Navnit Shukla
Navnit Shukla
author image
Navnit Shukla

Navnit Shukla is an accomplished Senior Solution Architect with a specialization in AWS analytics. With an impressive career spanning 12 years, he has honed his expertise in databases and analytics, establishing himself as a trusted professional in the field. Currently based in Orange County, CA, Navnit's primary responsibility lies in assisting customers in building scalable, cost-effective, and secure data platforms on the AWS cloud.
Read more about Navnit Shukla

Sankar M
Sankar M
author image
Sankar M

Sankar Sundaram has been working in IT Industry since 2007, specializing in databases, data warehouses, analytics space for many years. As a specialized Data Architect, he helps customers build and modernize data architectures and help them build secure, scalable, and performant data lake, database, and data warehouse solutions. Prior to joining AWS, he has worked with multiple customers in implementing complex data architectures.
Read more about Sankar M

Sampat Palani
Sampat Palani
author image
Sampat Palani

Sam Palani has over 18+ years as developer, data engineer, data scientist, a startup cofounder and IT leader. He holds a master's in Business Administration with a dual specialization in Information Technology. His professional career spans across 5 countries across financial services, management consulting and the technology industries. He is currently Sr Leader for Machine Learning and AI at Amazon Web Services, where he is responsible for multiple lines of the business, product strategy and thought leadership. Sam is also a practicing data scientist, a writer with multiple publications, speaker at key industry conferences and an active open source contributor. Outside work, he loves hiking, photography, experimenting with food and reading.
Read more about Sampat Palani

View More author details
Right arrow

Introducing data wrangling

For organizations to become data-driven to provide value to customers or make more informed business decisions, they need to collect a lot of data from different data sources such as clickstreams, log data, transactional systems, and flat files and store them in different data stores such as data lakes, databases, and data warehouses as raw data. Once this data is stored in different data stores, it needs to be cleansed, transformed, organized, and joined from different data sources to provide more meaningful information to downstream applications such as machine learning models to provide product recommendations or look for traffic conditions. Alternatively, it can be used by business or data analytics to extract meaningful business information:

Figure 1.1: Data pipeline

Figure 1.1: Data pipeline

The 80-20 rule of data analysis

When organizations collect data from different data sources, it is not of much use initially. It is estimated that data scientists spend about 80% of their time cleaning data. This means that only 20% of their time will be spent analyzing and creating insights from the data science process:

Figure 1.2: Work distribution of a data scientist

Figure 1.2: Work distribution of a data scientist

Now that we understand the basic concept of data wrangling, we’ll learn why it is essential, and the various benefits we get from it.

Advantages of data wrangling

If we go back to the analogy of oil, when we first extract it, it is in the form of crude oil, which is not of much use. To make it useful, it has to go through a refinery, where the crude oil is put in a distillation unit. In this distillation process, the liquids and vapors are separated into petroleum components called fractions according to their boiling points. Heavy fractions are on the bottom while light fractions are on the top, as seen here:

Figure 1.3: Crude oil processing

Figure 1.3: Crude oil processing

The following figure showcases how oil processing correlates to the data wrangling process:

Figure 1.4: The data wrangling process

Figure 1.4: The data wrangling process

Data wrangling brings many advantages:

  • Enhanced data quality: Data wrangling helps improve the overall quality of the data. It involves identifying and handling missing values, outliers, inconsistencies, and errors. By addressing these issues, data wrangling ensures that the data used for analysis is accurate and reliable, leading to more robust and trustworthy results.
  • Improved data consistency: Raw data often comes from various sources or in different formats, resulting in inconsistencies in naming conventions, units of measurement, or data structure. Data wrangling allows you to standardize and harmonize the data, ensuring consistency across the dataset. Consistent data enables easier integration and comparison of information, facilitating effective analysis and interpretation.
  • Increased data completeness: Incomplete data can pose challenges during analysis and modeling. Data wrangling methods allow you to handle missing data by applying techniques such as imputation, where missing values are estimated or filled in based on existing information. By dealing with missing data appropriately, data wrangling helps ensure a more complete dataset, reducing potential biases and improving the accuracy of analyses.
  • Facilitates data integration: Organizations often have data spread across multiple systems and sources, making integration a complex task. Data wrangling helps in merging and integrating data from various sources, allowing analysts to work with a unified dataset. This integration facilitates a holistic view of the data, enabling comprehensive analyses and insights that might not be possible when working with fragmented data.
  • Streamlined data transformation: Data wrangling provides the tools and techniques to transform raw data into a format suitable for analysis. This transformation includes tasks such as data normalization, aggregation, filtering, and reformatting. By streamlining these processes, data wrangling simplifies the data preparation stage, saving time and effort for analysts and enabling them to focus more on the actual analysis and interpret the results.
  • Enables effective feature engineering: Feature engineering involves creating new derived variables or transforming existing variables to improve the performance of machine learning models. Data wrangling provides a foundation for feature engineering by preparing the data in a way that allows for meaningful transformations. By performing tasks such as scaling, encoding categorical variables, or creating interaction terms, data wrangling helps derive informative features that enhance the predictive power of models.
  • Supports data exploration and visualization: Data wrangling often involves exploratory data analysis (EDA), where analysts gain insights and understand patterns in the data before formal modeling. By cleaning and preparing the data, data wrangling enables effective data exploration, helping analysts uncover relationships, identify trends, and visualize the data using charts, graphs, or other visual representations. These exploratory steps are crucial for forming hypotheses, making data-driven decisions, and communicating insights effectively.

Now that we have learned about the advantages of data wrangling, let’s understand the steps involved in the data wrangling process.

Previous PageNext Page
You have been reading a chapter from
Data Wrangling on AWS
Published in: Jul 2023Publisher: PacktISBN-13: 9781801810906
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Navnit Shukla

Navnit Shukla is an accomplished Senior Solution Architect with a specialization in AWS analytics. With an impressive career spanning 12 years, he has honed his expertise in databases and analytics, establishing himself as a trusted professional in the field. Currently based in Orange County, CA, Navnit's primary responsibility lies in assisting customers in building scalable, cost-effective, and secure data platforms on the AWS cloud.
Read more about Navnit Shukla

author image
Sankar M

Sankar Sundaram has been working in IT Industry since 2007, specializing in databases, data warehouses, analytics space for many years. As a specialized Data Architect, he helps customers build and modernize data architectures and help them build secure, scalable, and performant data lake, database, and data warehouse solutions. Prior to joining AWS, he has worked with multiple customers in implementing complex data architectures.
Read more about Sankar M

author image
Sampat Palani

Sam Palani has over 18+ years as developer, data engineer, data scientist, a startup cofounder and IT leader. He holds a master's in Business Administration with a dual specialization in Information Technology. His professional career spans across 5 countries across financial services, management consulting and the technology industries. He is currently Sr Leader for Machine Learning and AI at Amazon Web Services, where he is responsible for multiple lines of the business, product strategy and thought leadership. Sam is also a practicing data scientist, a writer with multiple publications, speaker at key industry conferences and an active open source contributor. Outside work, he loves hiking, photography, experimenting with food and reading.
Read more about Sampat Palani