Reader small image

You're reading from  Practical Machine Learning Cookbook

Product typeBook
Published inApr 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781785280511
Edition1st Edition
Languages
Right arrow
Author (1)
Atul Tripathi
Atul Tripathi
author image
Atul Tripathi

Atul Tripathi has spent more than 11 years in the fields of machine learning and quantitative finance. He has a total of 14 years of experience in software development and research. He has worked on advanced machine learning techniques, such as neural networks and Markov models. While working on these techniques, he has solved problems related to image processing, telecommunications, human speech recognition, and natural language processing. He has also developed tools for text mining using neural networks. In the field of quantitative finance, he has developed models for Value at Risk, Extreme Value Theorem, Option Pricing, and Energy Derivatives using Monte Carlo simulation techniques.
Read more about Atul Tripathi

Right arrow

Chapter 3. Clustering

In this chapter, we will cover the following recipes:

  • Hierarchical clustering - World Bank
  • Hierarchical clustering - Amazon rainforest burned between 1999-2010
  • Hierarchical clustering - gene clustering
  • Binary clustering - math test
  • K-means clustering - European countries protein consumption
  • K-means clustering - foodstuff

Introduction


Hierarchical clustering: One of the most important methods in unsupervised learning is Hierarchical clustering. In Hierarchical clustering for a given set of data points, the output is produced in the form of a binary tree (dendrogram). In the binary tree, the leaves represent the data points while internal nodes represent nested clusters of various sizes. Each object is assigned a separate cluster. Evaluation of all the clusters takes place based on a pairwise distance matrix. The distance matrix will be constructed using distance values. The pair of clusters with the shortest distance must be considered. The identified pair should then be removed from the matrix and merged together. The merged clusters' distance must be evaluated with the other clusters and the distance matrix should be updated. The process is to be repeated until the distance matrix is reduced to a single element.

An ordering of the objects is produced by hierarchical clustering. This helps with informative...

Hierarchical clustering - World Bank sample dataset


One of the main goals for establishing the World Bank was to fight and eliminate poverty. Continuous evolution and fine-tuning its policies in the ever-evolving world has been helping the institution to achieve the goal of poverty elimination. The barometer of success in the elimination of poverty is measured in terms of improvement of each of the parameters in health, education, sanitation, infrastructure, and other services needed to improve the lives of the poor. The development gains that will ensure the goals must be pursued in an environmentally, socially, and economically sustainable manner.

Getting ready

In order to perform Hierarchical clustering, we shall be using a dataset collected from the World Bank dataset.

Step 1 - collecting and describing data

The dataset titled WBClust2013 shall be used. This is available in the CSV format titled WBClust2013.csv. The dataset is in standard format. There are 80 rows of data and 14 variables...

Hierarchical clustering - Amazon rainforest burned between 1999-2010


Between 1999-2010, 33,000 square miles (85,500 square kilometers), or 2.8 percent of the Amazon rainforest burned down. This was found by NASA-led research. The main purpose of the research was to measure the extent of fire smolders under the forest canopy. The research found that burning forests destroys a much larger area compared to when forest lands are cleared for agriculture and cattle pasture. Yet, no correlation could be established between the fires and deforestation.

The answer to the query of no correlation between fires and deforestation lay in humidity data from the Atmospheric Infrared Sounder (AIRS) instrument aboard NASA's Aqua satellite. The fire frequency coincides with low night-time humidity, which allowed the low-intensity surface fires to continue burning.

Getting ready

In order to perform hierarchical clustering, we shall be using a dataset collected on the Amazon rainforest, which burned from 1999-2010...

Hierarchical clustering - gene clustering


The ability to gather genome-wide expression data is a computationally complex task. The human brain with its limitations cannot solve the problem. However, data can be fine-grained to an easily comprehensible level by subdividing the genes into a smaller number of categories and then analyzing them.

The goal of clustering is to subdivide a set of genes in such a way that similar items fall into the same cluster, whereas dissimilar items fall into different clusters. The important questions to be considered are decisions on similarity and usage for the items that have been clustered. Here we shall explore clustering genes and samples using the photoreceptor time series for the two genotypes.

Getting ready

In order to perform Hierarchical clustering, we shall be using a dataset collected on mice.

Step 1 - collecting and describing data

The datasets titled GSE4051_data and GSE4051_design shall be used. These are available in the CSV format titled GSE4051_data...

Binary clustering - math test


In the education system tests and examinations are major features. The advantage of examination system is that it can be one of the ways to differentiate between good and poor performers. The examination system puts the onus on students to upgrade for next standard for which they should appear and pass exams. It creates responsibility on students to study on regular basis. The exam systems prepare the students to meet the challenges of future. It helps them to analyze reason and communicate their ideas effectively in a fixed time period. On the other hand few draw backs are noticed such as slow learners cannot perform well in test and this creates inferior complexity among students.

Getting ready

In order to perform binary clustering, we shall be using a dataset collected on math tests.

Step 1 - collecting and describing data

The dataset titled math test shall be used. This is available in the TXT format titled math test.txt. The dataset is in standard format. There...

K-means clustering - European countries protein consumption


A food consumption pattern is of great interest in the field of medicine and nutrition. Food consumption is correlated to the overall health of an individual, the nutritional value of the food, the economics involved in purchasing a food item, and the environment in which it is consumed. This analysis is concerned with the relationship between meat and other food items in 25 European countries. It is interesting to observe the correlation between meat and other food items. The data includes measures of red meat, white meat, eggs, milk, fish, cereals, starchy foods, nuts (including pulses and oil-seeds), fruits, and vegetables.

Getting ready

In order to perform K-means clustering, we shall be using a dataset collected on protein consumption for 25 European countries.

Step 1 - collecting and describing data

The dataset titled protein which is in the CSV format shall be used. The dataset is in standard format. There are 25 rows of data...

K-means clustering - foodstuff


Nutrients in the food we consume can be classified by the role they play in building body mass. These nutrients can be divided into either macronutrients or essential micronutrients. Some examples of macronutrients are carbohydrates, protein, and fat while some examples of essential micronutrients are vitamins, minerals, and water.

Getting ready

Let's get started with the recipe.

Step 1 - collecting and describing data

In order to perform K-means clustering we shall be using a dataset collected on various food items and their respective Energy, Protein, Fat, Calcium, and Iron content. The numeric variables are:

  • Energy
  • Protein
  • Fat
  • Calcium
  • Iron

The non-numeric variable is:

  • Food

How to do it...

Let's get into the details.

Step 2 - exploring data

Note

Version info: Code for this page was tested in R version 3.2.3 (2015-12-10).

Loading the cluster() library.

> library(cluster)

Let's explore the data and understand relationships among the variables. We'll begin by importing the...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Practical Machine Learning Cookbook
Published in: Apr 2017Publisher: PacktISBN-13: 9781785280511
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Atul Tripathi

Atul Tripathi has spent more than 11 years in the fields of machine learning and quantitative finance. He has a total of 14 years of experience in software development and research. He has worked on advanced machine learning techniques, such as neural networks and Markov models. While working on these techniques, he has solved problems related to image processing, telecommunications, human speech recognition, and natural language processing. He has also developed tools for text mining using neural networks. In the field of quantitative finance, he has developed models for Value at Risk, Extreme Value Theorem, Option Pricing, and Energy Derivatives using Monte Carlo simulation techniques.
Read more about Atul Tripathi