Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
R Data Mining

You're reading from  R Data Mining

Product type Book
Published in Nov 2017
Publisher Packt
ISBN-13 9781787124462
Pages 442 pages
Edition 1st Edition
Languages
Concepts

Table of Contents (22) Chapters

Title Page
Credits
About the Author
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface
Why to Choose R for Your Data Mining and Where to Start A First Primer on Data Mining Analysing Your Bank Account Data The Data Mining Process - CRISP-DM Methodology Keeping the House Clean – The Data Mining Architecture How to Address a Data Mining Problem – Data Cleaning and Validation Looking into Your Data Eyes – Exploratory Data Analysis Our First Guess – a Linear Regression A Gentle Introduction to Model Performance Evaluation Don't Give up – Power up Your Regression Including Multiple Variables A Different Outlook to Problems with Classification Models The Final Clash – Random Forests and Ensemble Learning Looking for the Culprit – Text Data Mining with R Sharing Your Stories with Your Stakeholders through R Markdown Epilogue
Dealing with Dates, Relative Paths and Functions

Chapter 5. How to Address a Data Mining Problem – Data Cleaning and Validation

This chapter is where our real journey begins (finally!, I can hear you exclaiming). We are now familiar enough with R and the data mining process and architecture to get involved with a real problem.

I say real problem I actually mean real, since we are going to face something that actually happened and that actually puzzled a non-trivial number of people in a real company. Of course, we are going to use randomized dataF here and fictitious names, nevertheless, this will not remove any pathos to the problem. We are shortly going to get immersed into some kind of mystery that actually came up, and we will need to solve it, employing data mining techniques. 

I know you may be thinking: OK, don't make it too serious, is it something which actually already got solved?You would be right, but what if something similar pops up for you some day in the future? What would you do? The mystery we are going to face will not...

On a quiet day


It's a nice day today, you get into your 6th-floor office at Hippalus, Wholesaler Inc, grab a coffee, and sit down at your desk. Suddenly, an email pops in your inbox: Urgent - profits drop. As soon as the meaning of those three words is clear to you, you realize that everybody in the office has received the same email and they are starting to chat about it, there will be a meeting with the head of the office in 15 minutes. When you get into the meeting room, still missing the coffee you left at your desk, a big chart is projected on the wall:

OK, you have just read Chapter 2A First Primer on Data Mining - Analysing Your Banking Account Data so you know it is a line chart, and that we are looking at a time series, but no title is included so we can just observe the incredible drop at the end of the series. Whoa, those are our quarterly cash flows, says the astonished colleague sitting next to you.

How did we get that low?, replies back the one sitting on your other side. This...

Data cleaning


First of all, we need to actually import the data to our R environment (oh yeah, I was taking for granted that we are going to use R for this, hope you do not mind).

We can leverage our old friend the rio package, running it on all of the three files we were provided, once we have unzipped them. Take a minute to figure out if you can remember the function needed to perform the task.

Done? OK, find the solution as follows:

cash_flow_report <- import("cash_flow.csv")
customer_list    <- import("customer_list.txt")
stored_data      <- import("stored_data.rds")

Tidy data

Before actually looking at our data, we should define how we want it to be arranged in order to allow for future manipulation and analyses. Currently, one of the most adopted frameworks for data arrangement and handling is the so called tidy data framework. The concepts behind this framework were originally defined by Hadley Wickham, and nowadays come paired with a couple of R packages that help to apply it...

Further references


  • Hadley Wickham's paper, Tidy dataIntroducing the frameworkhttp://vita.had.co.nz/papers/tidy-data.pdf.
  • Shakespeare's Romeo and Juliet, just because it is so well written. I would suggest you read one of the best editions from the Oxford University Press.
  • A cheat sheet for the other dplyr join functions, which will help you gain a wider view of the possibilities available within dplyr for merging tables: http://stat545.com/bit001_dplyr-cheatsheet.html.

Summary


Our journey has begun within this chapter. Leveraging the knowledge gained within previous chapters, we have started facing a challenge that suddenly appeared: discover the origin of a heavy loss our company is suffering.

We received some dirty data to be cleaned, and this was the occasion to learn about data cleaning and tidy data. This was the first set of activities to make our data fit the analyses' needs, and the second a conceptual framework that can be employed to define which structure our data should have to fit those needs. We also learned how to evaluate the respect of the three main rules of tidy data (every row has a record, every column has an attribute, and every table is an entity).

We also learned about data quality and data validation, discovering which metrics define the level of quality of our data and a set of checks that can be employed to assess this quality and spot any needed improvements.

We applied all these concepts to our data, making our data through the...

lock icon The rest of the chapter is locked
You have been reading a chapter from
R Data Mining
Published in: Nov 2017 Publisher: Packt ISBN-13: 9781787124462
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}