Reader small image

You're reading from  Data Wrangling with R

Product typeBook
Published inFeb 2023
PublisherPackt
ISBN-139781803235400
Edition1st Edition
Concepts
Right arrow
Author (1)
Gustavo R Santos
Gustavo R Santos
author image
Gustavo R Santos

Gustavo R Santos has worked in the Technology Industry for 13 years, improving processes, and analyzing datasets and creating dashboards. Since 2020, he has been working as a Data Scientist in the retail industry, wrangling, analyzing, visualizing and modeling data with the most modern tools like R, Python and Databricks. Gustavo also gives lectures from time to time at an online school about Data Science concepts. He has a background in Marketing, is certified as Data Scientist by the Data Science Academy Brazil and pursues his specialist MBA in Data Science at the University of São Paulo
Read more about Gustavo R Santos

Right arrow

Summary

In this chapter, we learned a little about the history of data wrangling and became familiar with its definition. Every task performed in order to transform or enhance the data and to make it ready for analysis and modeling is what we call data wrangling or data munging.

We also discussed some topics stating the importance of wrangling data before modeling it. A model is a simplified representation of reality, and an algorithm is like a student that needs to understand that reality to give us the best answer about the subject matter. If we teach this student with bad data, we cannot expect to receive a good answer. A model is as good as its input data.

Continuing further in the chapter, we reviewed the benefits of data wrangling, proving that we can improve the quality of our data, resulting in faster results and better outcomes.

In the final sections, we reviewed the basic steps of data wrangling and learned more about three of the most commonly used frameworks for Data Science – KDD, SEMMA, and CRISP-DM. I recommend that you review more information about them to have a holistic view of the life cycle of a Data Science project.

Now, it is important to notice how these three frameworks preach the selection of a representative dataset or subset of data. A nice example is given by Aurélien Géron (Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow, 2nd edition, (2019): 32-33). Suppose you want to build an app to take pictures of flowers and recognize and classify them. You could go to the internet and download thousands of pictures; however, they will probably not be representative of the kind of pictures that your model will receive from the app users. Ergo, the model could underperform. This example is relevant to illustrate the garbage in, garbage out idea. That is, if you don’t explore and understand your data thoroughly, you won’t know whether it is good enough for modeling.

The frameworks can lead the way, like a map, to explore, understand, and wrangle the data and to make it ready for modeling, decreasing the risk of having a frustrating outcome.

In the next chapter, let’s get our hands on R and start coding.

Previous PageNext Page
You have been reading a chapter from
Data Wrangling with R
Published in: Feb 2023Publisher: PacktISBN-13: 9781803235400
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Gustavo R Santos

Gustavo R Santos has worked in the Technology Industry for 13 years, improving processes, and analyzing datasets and creating dashboards. Since 2020, he has been working as a Data Scientist in the retail industry, wrangling, analyzing, visualizing and modeling data with the most modern tools like R, Python and Databricks. Gustavo also gives lectures from time to time at an online school about Data Science concepts. He has a background in Marketing, is certified as Data Scientist by the Data Science Academy Brazil and pursues his specialist MBA in Data Science at the University of São Paulo
Read more about Gustavo R Santos