Reader small image

You're reading from  Big Data Analytics with Java

Product typeBook
Published inJul 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781787288980
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
RAJAT MEHTA
RAJAT MEHTA
author image
RAJAT MEHTA

The author is a VP (Technical Architect) in technology in JP Morgan Chase in New York. The author is a sun certified java developer and has worked on java related technologies for more than 16 years. Current role for the past few years heavily involves the usage of bid data stack and running analytics on it. Author is also a contributor in various open source projects that are available on his GitHub repository and is also a frequent write on dev magazines.
Read more about RAJAT MEHTA

Right arrow

Clustering for customer segmentation


Here, we will now build a program that will use the k-means clustering algorithm and will make five clusters from our transactional dataset.

Before we crunch the data to figure out the clusters, we have made a few important assumptions and deductions regarding the data to preprocess it:

  • We are only going to do clustering for the data belonging to the United Kingdom. The reason being, most of the data belongs to the United Kingdom in this dataset.

  • For any missing or null values, we will simply discard that row of data. This is to keep things simple, and also because we have a good amount of data available for analysis. Leaving a few rows should not have much impact.

Let's now start our program. We will first build our boilerplate code to build the SparkSession and Spark configuration:

SparkConf conf = ...
SparkSession session = ...

Next, let's load the data from the file into a dataset:

Dataset<Row> rawData = session.read().csv("data/retail/Online_Retail...
lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Big Data Analytics with Java
Published in: Jul 2017Publisher: PacktISBN-13: 9781787288980

Author (1)

author image
RAJAT MEHTA

The author is a VP (Technical Architect) in technology in JP Morgan Chase in New York. The author is a sun certified java developer and has worked on java related technologies for more than 16 years. Current role for the past few years heavily involves the usage of bid data stack and running analytics on it. Author is also a contributor in various open source projects that are available on his GitHub repository and is also a frequent write on dev magazines.
Read more about RAJAT MEHTA