Reader small image

You're reading from  The Data Analysis Workshop

Product typeBook
Published inJul 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781839211386
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Authors (3):
Gururajan Govindan
Gururajan Govindan
author image
Gururajan Govindan

Gururajan Govindan is a data scientist, intrapreneur, and trainer with more than seven years of experience working across domains such as finance and insurance. He is also an author of The Data Analysis Workshop, a book focusing on data analytics. He is well known for his expertise in data-driven decision-making and machine learning with Python.
Read more about Gururajan Govindan

Shubhangi Hora
Shubhangi Hora
author image
Shubhangi Hora

Shubhangi Hora is a data scientist, Python developer, and published writer. With a background in computer science and psychology, she is particularly passionate about healthcare-related AI, including mental health. Shubhangi is also a trained musician.
Read more about Shubhangi Hora

Konstantin Palagachev
Konstantin Palagachev
author image
Konstantin Palagachev

Konstantin Palagachev holds a Ph.D. in applied mathematics and optimization, with an interest in operations research and data analysis. He is recognized for his passion for delivering data-driven solutions and expertise in the area of urban mobility, autonomous driving, insurance, and finance. He is also a devoted coach and mentor, dedicated to sharing his knowledge and passion for data science.
Read more about Konstantin Palagachev

View More author details
Right arrow

8. Analyzing Online Retail II Dataset

Overview

In this chapter, you will search for and deal with missing values, outliers, and anomalies within a given dataset. You will learn how to create new columns from existing columns, conduct exploratory data analysis, and design visualizations. You will also practice summarizing the insights provided by your data. This chapter aims to guide you through various data analysis techniques pertaining to a specific dataset—the Online Retail II dataset—and therefore, a specific domain.

Introduction

In the previous chapter, we studied and analyzed a heart disease dataset and studied the relationships between different features of the dataset to gain a better understanding of the available data and derive useful insights from it.

This chapter follows a similar pattern as the previous chapters and guides you through data-specific analysis in a real-world domain and situation. This chapter targets the retail industry, and we're going to be analyzing data retrieved from an online retail company to observe patterns and correlations and to evaluate the business more accurately and in more depth.

Note

The dataset we are going to use has been obtained from the UCI repository of datasets and can be found at https://archive.ics.uci.edu/ml/datasets/Online+Retail+II#. To use the dataset in the exercises and activities, you can use the GitHub repo, at https://packt.live/3e7wZxs.

As you have seen in the previous chapters, transforming data into business insights...

Data Cleaning

When doing online projects or learning from a course, the data used is often already in perfect form; there are no missing values or outliers, and all the features are accurate and useful. In reality, though, this is almost never the case. There are often rows and rows of data with inconsistencies that, if used as is, will provide us with flawed business insights, which could be disastrous if actually used to make business decisions.

For example, you're monitoring your shop's most and least active hours. This is done by tracking and storing information regarding your customers, especially what time they're coming into the shop. You have been storing the time in 24-hour clock format.

The next day, however, another employee takes over this responsibility and starts storing the time in 12-hour clock format. You suddenly have a column of data that has been stored in two different ways, and now 8:00 can mean both AM and PM. You don't notice this...

Data Preparation and Feature Engineering

Once you have loaded and cleaned your data, you need to prepare it so that it's in a format that you can use to perform data analysis. Along with this, you need to identify features that will help you understand your data better and provide significant insights. These processes involve modifying already existing features and transforming them into new features.

For example, in the previous exercise, we saw that the dataset contains a date column consisting of day, month, and year. We can use this information to determine which months of the year were most popular for the online retail store. In order to do this, we need to modify the date column by breaking it down into columns such as day, month, year, and so on.

When preparing data for machine learning models, categorical features must be transformed into a numerical format so that the models can learn from them. However, since we are just going to be analyzing the data, we can...

Data Analysis

One important thing to keep in mind while analyzing data is you need to understand what it is capable of telling you. Noting down some questions you think your data can answer helps you determine a path for your analysis.

Take a look at the retail DataFrame that we updated in Exercise 8.02, Preparing Our Data; it has 15 features. These features define who bought how much of what in which country at what time on what day in which month in which year. We can use combinations of these features to provide us with a lot of insights that would answer the following questions:

  1. Which customers placed the most and fewest orders?
  2. Which customers spent the most and least money?
  3. Which months were the most and least popular for this online retail store?
  4. Which dates of the month were the most and least popular for this online retail store?
  5. Which days were the most and least popular for this online retail store?
  6. Which hours of the day were most and least...

Summary

In this chapter, we performed various data cleaning, preparation, and analysis techniques on the Online Retail II dataset and observed the importance of these processes. We learned how to make the decision between keeping outlier instances and deleting them and also how to break one feature into several features to enhance the analysis. Lastly, we learned how to ask our data the right questions and manipulate it to provide the answers—the definition of successful data analysis.

In the following chapter, we will follow a similar path with a different dataset and, thus, a new domain—that of appliance energy consumption. The techniques used depend on the data we have, and so while some of the actions might be repeated, some will be new.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Data Analysis Workshop
Published in: Jul 2020Publisher: PacktISBN-13: 9781839211386
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Gururajan Govindan

Gururajan Govindan is a data scientist, intrapreneur, and trainer with more than seven years of experience working across domains such as finance and insurance. He is also an author of The Data Analysis Workshop, a book focusing on data analytics. He is well known for his expertise in data-driven decision-making and machine learning with Python.
Read more about Gururajan Govindan

author image
Shubhangi Hora

Shubhangi Hora is a data scientist, Python developer, and published writer. With a background in computer science and psychology, she is particularly passionate about healthcare-related AI, including mental health. Shubhangi is also a trained musician.
Read more about Shubhangi Hora

author image
Konstantin Palagachev

Konstantin Palagachev holds a Ph.D. in applied mathematics and optimization, with an interest in operations research and data analysis. He is recognized for his passion for delivering data-driven solutions and expertise in the area of urban mobility, autonomous driving, insurance, and finance. He is also a devoted coach and mentor, dedicated to sharing his knowledge and passion for data science.
Read more about Konstantin Palagachev