Reader small image

You're reading from  Apache Mahout Essentials

Product typeBook
Published inJun 2015
Reading LevelIntermediate
Publisher
ISBN-139781783554997
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Jayani Withanawasam
Jayani Withanawasam
author image
Jayani Withanawasam

Jayani Withanawasam is R&D engineer and a senior software engineer at Zaizi Asia, where she focuses on applying machine learning techniques to provide smart content management solutions. She is currently pursuing an MSc degree in artificial intelligence at the University of Moratuwa, Sri Lanka, and has completed her BE in software engineering (with first class honors) from the University of Westminster, UK. She has more than 6 years of industry experience, and she has worked in areas such as machine learning, natural language processing, and semantic web technologies during her tenure. She is passionate about working with semantic technologies and big data.
Read more about Jayani Withanawasam

Right arrow

Chapter 2. Clustering

This chapter explains the clustering technique in machine learning and its implementation using Apache Mahout.

The K-Means clustering algorithm is explained in detail with both Java and command-line examples (sequential and parallel executions), and other important clustering algorithms, such as Fuzzy K-Means, canopy clustering, and spectral K-Means are also explored.

In this chapter, we will cover the following topics:

  • Unsupervised learning and clustering

  • Applications of clustering

  • Types of clustering

  • K-Means clustering

  • K-Means clustering with MapReduce

  • Other clustering algorithms

  • Text clustering

  • Optimizing clustering performance

Unsupervised learning and clustering


Information is a key driver for any type of organization. However, with the rapid growth in the volume of data, valuable information may be hidden and go unnoticed due to the lack of effective data processing and analyzing mechanisms.

Clustering is an unsupervised learning mechanism that can find the hidden patterns and structures in data by finding data points that are similar to each other. No prelabeling is required. So, you can organize data using clustering with little or no human intervention.

For example, let's say you are given a collection of balls of different sizes without any category labels, such as big and small, attached to them; you should be able to categorize them using clustering by considering their attributes, such as radius and weight, for similarity.

In this chapter, you will learn how to use Apache Mahout to perform clustering using different algorithms.

Applications of clustering


Clustering has many applications in different domains, such as biology, business, and information retrieval. A few of them are shown in the following image:

Computer vision and image processing

Clustering techniques are widely used in the computer vision and image processing domain. Clustering is used for image segmentation in medical image processing for computer aided disease (CAD) diagnosis. One specific area is breast cancer detection.

In breast cancer detection, a mammogram is clustered into several parts for further analysis, as shown in the following image. The regions of interest for signs of breast cancer in the mammogram can be identified using the K-Means algorithm, which is explained later in this chapter.

Image features such as pixels, colors, intensity, and texture are used during clustering:

Types of clustering


Clustering can be divided into different categories based on different criteria.

Hard clustering versus soft clustering

Clustering techniques can be divided into hard clustering and soft clustering based on the cluster's membership.

In hard clustering, a given data point in n-dimensional space only belongs to one cluster. This is also known as exclusive clustering. The K-Means clustering mechanism is an example of hard clustering.

A given data point can belong to more than one cluster in soft clustering. This is also known as overlapping clustering. The Fuzzy K-Means algorithm is a good example of soft clustering. A visual representation of the difference between hard clustering and soft clustering is given in the following figure:

Flat clustering versus hierarchical clustering

In hierarchical clustering, a hierarchy of clusters is built using the top-down (divisive) or bottom-up (agglomerative) approach. This is more informative and accurate than flat clustering, which is...

K-Means clustering


K-Means clustering is a simple and fast clustering algorithm that has been widely adopted in many problem domains. In this chapter, we will give a detailed explanation of the K-Means algorithm, as it will provide the base for other algorithms. K-Means clustering assigns data points to k number of clusters (cluster centroids) by minimizing the distance from the data points to the cluster centroids.

Let's consider a simple scenario where we need to cluster people based on their size (height and weight are the selected attributes) and different colors (clusters):

We can plot this problem in two-dimensional space, as shown in the following figure and solve it using the K-Means algorithm:

Getting your hands dirty!

Let's move on to a real implementation of the K-Means algorithm using Apache Mahout. The following are the different ways in which you can run algorithms in Apache Mahout:

  • Sequential

  • MapReduce

You can execute the algorithms using a command line (by calling the correct bin...

Distance measure


The clustering problem is based on evaluating the distance between data points. The distance measure is an indicator of the similarity of the data points. For any clustering algorithm, you need to make a decision on the appropriate distance measure for your context. Essentially, the distance measure is more important for accuracy than the number of clusters.

Further, the criteria for choosing the right distance measure depends on the application domain and the dataset, so it is important to understand the different distance measures available in Apache Mahout. A few important distance measures are explained in the following section. The distance measure is visualized using a two-dimensional visualization here.

The Euclidean distance is not suitable if the magnitude of possible values for each feature varies drastically (if all the features need to be assessed equally):

Euclidean distance

Class

org.apache.mahout.common.distance.EuclideanDistanceMeasure

Formula

Squared...

K-Means clustering with MapReduce


The key strength of Apache Mahout lies in its ability to scale. This is achieved by implementing machine learning algorithms according to the MapReduce programming paradigm.

If your dataset is small and fits into memory, then you can run Mahout in local mode. If your dataset is growing and it comes to a point where it can't fit into memory, then you should consider moving the computation to the Hadoop cluster. The complete guide on Hadoop installation is given in Chapter 5, Apache Mahout in Production.

In this section, we will explain how the K-Means algorithm is implemented in Apache Mahout in a scalable manner.

However, please note that it is not mandatory for you to thoroughly understand the MapReduce concept in order to use the algorithms in your applications. So, you can proceed with this section only if you are interested in understanding the internals.

Let's continue with the previous people size example, with height and weight as features. The data distribution...

Additional clustering algorithms


The K-Means algorithm is a simple and fast algorithm for clustering. However, this algorithm has its own limitations in certain scenarios. So, we will explain other clustering algorithms that are available in Apache Mahout here.

Canopy clustering

The accuracy of the K-Means algorithm depends on the number of clusters (K) and the initial cluster points that we randomly generated.

K-Means used org.apache.mahout.clustering.kmeans.RandomSeedGenerator to determine initial clusters randomly. However, with this approach, there is no guarantee about the time to converge, so it might take a long time for a large dataset to converge. Sometimes, premature convergence may occur due to the inability to pass a local optimum.

As a solution, canopy clustering is used with K-Means clustering as the initial step to determine the initial centroids (without getting initial centroids randomly). This will speed up the clustering process for the K-Means algorithm and provide more accurate...

Text clustering


Text clustering is a widely used application of clustering that is used in areas such as records, management systems, searches, and business intelligence.

The vector space model and TF-IDF

In text clustering, the terms of the documents are considered as features in text clustering. The vector space model is an algebraic model that maps the terms in a document into n-dimensional linear space.

However, we need to represent textual information (terms) as a numerical representation and create feature vectors using the numerical values to evaluate the similarity between data points.

Each dimension of the feature vector represents a separate term. If a particular term is present in the document, then the vector value is set using the Term Frequency (TF) or Term Frequency-Inverse Document Frequency (TF-IDF) calculation. TF indicates the frequency at which the term appears in a given document. TF-IDF is an improved way of TF, which indicates how important a word to a document.

In order...

Optimizing clustering performance


Developing a clustering algorithm without having any programming errors may not give you the expected results. There are many factors to consider based on the application's context and the requirements.

Selecting the right features

Selecting the right features requires a comprehensive knowledge of the applied domain. You can ensure that you select the right features to accurately represent the problem domain using the following techniques:

  • Selecting and weighing features properly

  • Normalizing dimensions when values are not comparable (For example, the number of bedrooms in a house and the land price in dollars are not comparable in their innate forms.)

  • Using effective preprocessing mechanisms (custom Lucene analyzers and noise reduction)

  • Selecting the suitable vector representation (SequentialAccessSparseVector, RandomAccessSparseVector or DenseVector)

  • Normalizing data points to remove the effect of outliers in each dimension

Selecting the right algorithms

An algorithm...

Summary


Clustering is an unsupervised learning mechanism that requires minimal human effort. Clustering has many applications in different areas, such as medical image processing, market segmentation, and information retrieval.

Clustering mechanisms can be divided into different types, such as hard, soft, flat, hierarchical, and model-based clustering based on different criteria.

Apache Mahout implements different clustering algorithms, which can be accessed sequentially or in parallel (using MapReduce).

The K-Means algorithm is a simple and fast algorithm that is widely applied. However, there are situations that the K-Means algorithm will not be able to cater to. For such scenarios, Apache Mahout has implemented other algorithms, such as canopy, Fuzzy K-Means, streaming, and spectral clustering.

Text clustering is an important area of clustering that requires special preprocessing steps, such as stop word removal, stemming, and TF-IDF vector generation. Topic modeling is a special case of...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Mahout Essentials
Published in: Jun 2015Publisher: ISBN-13: 9781783554997
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Jayani Withanawasam

Jayani Withanawasam is R&D engineer and a senior software engineer at Zaizi Asia, where she focuses on applying machine learning techniques to provide smart content management solutions. She is currently pursuing an MSc degree in artificial intelligence at the University of Moratuwa, Sri Lanka, and has completed her BE in software engineering (with first class honors) from the University of Westminster, UK. She has more than 6 years of industry experience, and she has worked in areas such as machine learning, natural language processing, and semantic web technologies during her tenure. She is passionate about working with semantic technologies and big data.
Read more about Jayani Withanawasam