Reader small image

You're reading from  Big Data Analytics with Java

Product typeBook
Published inJul 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781787288980
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
RAJAT MEHTA
RAJAT MEHTA
author image
RAJAT MEHTA

The author is a VP (Technical Architect) in technology in JP Morgan Chase in New York. The author is a sun certified java developer and has worked on java related technologies for more than 16 years. Current role for the past few years heavily involves the usage of bid data stack and running analytics on it. Author is also a contributor in various open source projects that are available on his GitHub repository and is also a frequent write on dev magazines.
Read more about RAJAT MEHTA

Right arrow

Chapter 10. Clustering and Customer Segmentation on Big Data

Up until now we have only used and worked on data that was prelabeled that is, supervised. Based on that prelabeled data, we trained our machine learning models and predicted our results. But what if the data is not labeled at all and we just get plain data? In that case, can we carry out any useful analysis of the data at all? Figuring out details from an unlabeled dataset is an example of unsupervised learning, where the machine learning algorithm makes deductions or predictions from raw unlabeled data. One of the most popular approaches to analyzing this unlabeled data is to find groups of similar items within a dataset. This grouping of data has several advantages and use cases, as we will see in this chapter.

In this chapter, we will cover the following topics:

  • The concepts of clustering and types of clustering, including k-means and bisecting k-means clustering

  • Advantages and use cases of clustering

  • Customer segmentation and...

Clustering


A customer using an online e-commerce store to buy a phone would generally type those words in the search box at the top of the site. As soon as you type your search query, the search results are displayed at the bottom, and on the left-hand side of the page you get a list of categories that you might be interested in based on the search text you just entered. The sub-search categories are shown in the following screenshot. How did the search engine figure out these sub-search categories just based on the searched text? Well, this is what clustering is used for. It's a no-brainer that the site's search engine is advanced and must be using some form of clustering technique to group the search results so as to form useful sub-search categories:

As seen in the preceding screenshot, the left-hand side shows the categories (groups) that are generated once the user searches for a term such as car. The left-hand side looks quite relevant as we are seeing sub-categories for car accessories...

Customer segmentation


Customers for any store either offline or online (that is, e-commerce) all exhibit different behaviors in terms of buying patterns. Some might buy in bulk, while others might buy lesser quantities of stuff but the transactions might be spread out throughout the year. Some might buy big items during festival times like Christmas and so on. Figuring out the buying patterns of the customers and grouping or segmenting the customers based on their buying patterns is of the utmost importance for the business owners, simply because it lays out the customers' needs in front of them and their importance. They could selectively market to the more important customers, thereby giving prime care and importance to the customers that generate maximum revenue for the stores.

Figuring out the buying patterns of the customers from historical data (of their purchase transactions) is easy for an online store as all the transaction data is readily available. Some approaches that people use...

Dataset


For our case study on customer segmentation using clustering, we will be using a dataset from UCI repository of datasets for a UK online retail store. This retail store has shared its data with UCI and the dataset is freely available on their website. This data is essentially the transactions of different customers made on the online retail store. The transactions were made from different countries and the dataset size is good (thousands of rows). Let's go through the attributes of the dataset:

Data exploration


In this section, we will explore this dataset and try to perform some simple and useful analytics on top of this dataset.

First, we will create the boilerplate code for Spark configuration and the Spark session:

SparkConf conf = ...
SparkSession session = ...

Next, we will load the dataset and find the number of rows in it:

Dataset<Row> rawData = session.read().csv("data/retail/Online_Retail.csv");

This will print the number of rows in the dataset as:

Number of rows --> 541909

As you can see, this is not a very small dataset but it is not big data either. Big data can run into terabytes. We have seen the number of rows, so let's look at the first few rows now.

rawData.show();

This will print the result as:

As you can see, this dataset is a list of transactions including the country name from where the transaction was made. But if you look at the columns of the tables, Spark has given a default name to the dataset columns. In order to provide a schema and better structure...

Clustering for customer segmentation


Here, we will now build a program that will use the k-means clustering algorithm and will make five clusters from our transactional dataset.

Before we crunch the data to figure out the clusters, we have made a few important assumptions and deductions regarding the data to preprocess it:

  • We are only going to do clustering for the data belonging to the United Kingdom. The reason being, most of the data belongs to the United Kingdom in this dataset.

  • For any missing or null values, we will simply discard that row of data. This is to keep things simple, and also because we have a good amount of data available for analysis. Leaving a few rows should not have much impact.

Let's now start our program. We will first build our boilerplate code to build the SparkSession and Spark configuration:

SparkConf conf = ...
SparkSession session = ...

Next, let's load the data from the file into a dataset:

Dataset<Row> rawData = session.read().csv("data/retail/Online_Retail...

Summary


In this chapter, we learnt about clustering and we saw how this approach helps to group different items into groups with each group having items which are similar to them in some form. Clustering is an example of unsupervised learning and there are lots of popular clustering algorithms that are shipped by default in the Apache Spark package. We learnt about two clustering approaches, the first being k-means approach where items that are closer to each other based on some mathematical formula like Euclidean distance and so on were grouped together. We also learnt about bisecting k-means approach which is essentially and improvement on the regular k-means clustering and is creating by being a combination of hierarchical and k-means clustering. We also applied clustering on a sample dataset of retail from UCI. On this sample case study we segmented the customers of the website using clustering and tried to figure out the important customers for an online e-commerce store.

In the next...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Big Data Analytics with Java
Published in: Jul 2017Publisher: PacktISBN-13: 9781787288980
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
RAJAT MEHTA

The author is a VP (Technical Architect) in technology in JP Morgan Chase in New York. The author is a sun certified java developer and has worked on java related technologies for more than 16 years. Current role for the past few years heavily involves the usage of bid data stack and running analytics on it. Author is also a contributor in various open source projects that are available on his GitHub repository and is also a frequent write on dev magazines.
Read more about RAJAT MEHTA

Attribute name

Description

Invoice number

Invoice number; a number uniquely assigned to each transaction

Stock code

Product (item) code; a 5-digit integral number uniquely assigned to each distinct product

Description

Product item name

Quantity

Quantity of items purchased in a single transaction

Invoice date

Date of the transaction

Unit price

Price of the item (in pounds)

Customer ID

Unique ID of the person making the transaction

Country

Country from where...