Packt+ | Advance your knowledge in tech

You're reading from Practical Machine Learning Cookbook

Product typeBook

Published inApr 2017

Reading LevelIntermediate

PublisherPackt

ISBN-139781785280511

Edition1st Edition

Languages

Python

Tools

Apache Spark

Concepts

Machine Learning

Author (1)

Atul Tripathi

Chapter 3. Clustering

In this chapter, we will cover the following recipes:

Hierarchical clustering - World Bank
Hierarchical clustering - Amazon rainforest burned between 1999-2010
Hierarchical clustering - gene clustering
Binary clustering - math test
K-means clustering - European countries protein consumption
K-means clustering - foodstuff

Introduction

Hierarchical clustering: One of the most important methods in unsupervised learning is Hierarchical clustering. In Hierarchical clustering for a given set of data points, the output is produced in the form of a binary tree (dendrogram). In the binary tree, the leaves represent the data points while internal nodes represent nested clusters of various sizes. Each object is assigned a separate cluster. Evaluation of all the clusters takes place based on a pairwise distance matrix. The distance matrix will be constructed using distance values. The pair of clusters with the shortest distance must be considered. The identified pair should then be removed from the matrix and merged together. The merged clusters' distance must be evaluated with the other clusters and the distance matrix should be updated. The process is to be repeated until the distance matrix is reduced to a single element.

An ordering of the objects is produced by hierarchical clustering. This helps with informative...

Hierarchical clustering - World Bank sample dataset

One of the main goals for establishing the World Bank was to fight and eliminate poverty. Continuous evolution and fine-tuning its policies in the ever-evolving world has been helping the institution to achieve the goal of poverty elimination. The barometer of success in the elimination of poverty is measured in terms of improvement of each of the parameters in health, education, sanitation, infrastructure, and other services needed to improve the lives of the poor. The development gains that will ensure the goals must be pursued in an environmentally, socially, and economically sustainable manner.

Getting ready

In order to perform Hierarchical clustering, we shall be using a dataset collected from the World Bank dataset.

Step 1 - collecting and describing data

The dataset titled WBClust2013 shall be used. This is available in the CSV format titled WBClust2013.csv. The dataset is in standard format. There are 80 rows of data and 14 variables...

Hierarchical clustering - Amazon rainforest burned between 1999-2010

Between 1999-2010, 33,000 square miles (85,500 square kilometers), or 2.8 percent of the Amazon rainforest burned down. This was found by NASA-led research. The main purpose of the research was to measure the extent of fire smolders under the forest canopy. The research found that burning forests destroys a much larger area compared to when forest lands are cleared for agriculture and cattle pasture. Yet, no correlation could be established between the fires and deforestation.

The answer to the query of no correlation between fires and deforestation lay in humidity data from the Atmospheric Infrared Sounder (AIRS) instrument aboard NASA's Aqua satellite. The fire frequency coincides with low night-time humidity, which allowed the low-intensity surface fires to continue burning.

Getting ready

In order to perform hierarchical clustering, we shall be using a dataset collected on the Amazon rainforest, which burned from 1999-2010...

Hierarchical clustering - gene clustering

The ability to gather genome-wide expression data is a computationally complex task. The human brain with its limitations cannot solve the problem. However, data can be fine-grained to an easily comprehensible level by subdividing the genes into a smaller number of categories and then analyzing them.

The goal of clustering is to subdivide a set of genes in such a way that similar items fall into the same cluster, whereas dissimilar items fall into different clusters. The important questions to be considered are decisions on similarity and usage for the items that have been clustered. Here we shall explore clustering genes and samples using the photoreceptor time series for the two genotypes.

Getting ready

In order to perform Hierarchical clustering, we shall be using a dataset collected on mice.

Step 1 - collecting and describing data

The datasets titled GSE4051_data and GSE4051_design shall be used. These are available in the CSV format titled GSE4051_data...

Binary clustering - math test

In the education system tests and examinations are major features. The advantage of examination system is that it can be one of the ways to differentiate between good and poor performers. The examination system puts the onus on students to upgrade for next standard for which they should appear and pass exams. It creates responsibility on students to study on regular basis. The exam systems prepare the students to meet the challenges of future. It helps them to analyze reason and communicate their ideas effectively in a fixed time period. On the other hand few draw backs are noticed such as slow learners cannot perform well in test and this creates inferior complexity among students.

Getting ready

In order to perform binary clustering, we shall be using a dataset collected on math tests.

Step 1 - collecting and describing data

The dataset titled math test shall be used. This is available in the TXT format titled math test.txt. The dataset is in standard format. There...

K-means clustering - European countries protein consumption

A food consumption pattern is of great interest in the field of medicine and nutrition. Food consumption is correlated to the overall health of an individual, the nutritional value of the food, the economics involved in purchasing a food item, and the environment in which it is consumed. This analysis is concerned with the relationship between meat and other food items in 25 European countries. It is interesting to observe the correlation between meat and other food items. The data includes measures of red meat, white meat, eggs, milk, fish, cereals, starchy foods, nuts (including pulses and oil-seeds), fruits, and vegetables.

Getting ready

In order to perform K-means clustering, we shall be using a dataset collected on protein consumption for 25 European countries.

Step 1 - collecting and describing data

The dataset titled protein which is in the CSV format shall be used. The dataset is in standard format. There are 25 rows of data...

K-means clustering - foodstuff

Nutrients in the food we consume can be classified by the role they play in building body mass. These nutrients can be divided into either macronutrients or essential micronutrients. Some examples of macronutrients are carbohydrates, protein, and fat while some examples of essential micronutrients are vitamins, minerals, and water.

Getting ready

Let's get started with the recipe.

Step 1 - collecting and describing data

In order to perform K-means clustering we shall be using a dataset collected on various food items and their respective Energy, Protein, Fat, Calcium, and Iron content. The numeric variables are:

Energy
Protein
Fat
Calcium
Iron

The non-numeric variable is:

Food

How to do it...

Let's get into the details.

Step 2 - exploring data

Note

Version info: Code for this page was tested in R version 3.2.3 (2015-12-10).

Loading the cluster() library.

> library(cluster)

Let's explore the data and understand relationships among the variables. We'll begin by importing the...

The rest of the chapter is locked

You have been reading a chapter from

Practical Machine Learning Cookbook

Published in: Apr 2017Publisher: PacktISBN-13: 9781785280511

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Atul Tripathi

Atul Tripathi has spent more than 11 years in the fields of machine learning and quantitative finance. He has a total of 14 years of experience in software development and research. He has worked on advanced machine learning techniques, such as neural networks and Markov models. While working on these techniques, he has solved problems related to image processing, telecommunications, human speech recognition, and natural language processing. He has also developed tools for text mining using neural networks. In the field of quantitative finance, he has developed models for Value at Risk, Extreme Value Theorem, Option Pricing, and Energy Derivatives using Monte Carlo simulation techniques.
Read more about Atul Tripathi

Other recommended products

Related to this chapter

Regression Analysis with R

Regression analysis is a statistical process which enables prediction of relationships between variables. This book will give you a rundown explaining what regression analysis is, explaining you the process from scratch. Each chapter starts with explaining the theoretical concepts and once the reader gets comfortable with the theory, we move to the practical examples to support the understanding. By the end of this book you will know all the concepts and pain-points related to regression analysis, and you will be able to implement your learning in your projects.

BookJan 2018422 pages

Machine Learning with R Cookbook

The R language is a powerful open source functional programming language. At its core, R is a statistical language that provides impressive tools to analyze data and create high-level graphics. This book covers the basics of R by setting up a user-friendly programming environment and programming ETL in R. Data exploration examples are provided that demonstrate how powerful data visualisation and machine learning is in discovering hidden relationships. You will also explore air quality data, steps to fix the missing values and visualising the same. You will then dive into important machine learning topics, including data classification, regression, survival analysis, time series analysis, clustering association rule mining, and dimension reduction.This book will include the latest code and examples based on R 3.3 and above—updated for better computation, accuracy, and speed with R.

BookOct 2017572 pages

Statistical Application Development with R and Python

Statistical Analysis involves collecting and examining data to describe the nature of data that needs to be analyzed. It helps you explore the relation of data and build models to make better decisions. You will begin with a brief understanding of the nature of data and end with modern and advanced statistical models like CART. Every step is taken with DATA and R code, and further enhanced by Python. By the end of this book you will be able to apply your statistical learning in major domains at work or in your projects.

BookAug 2017432 pages

Neural Networks with R

The book helps you learn neural networks and implement them in R. It covers real-world use cases that will help you better understand their concepts. A basic understanding of R and mathematics is required.

BookSep 2017270 pages

Learning Quantitative Finance with R

This book covers applications of quantitative finance in R. It starts with the basics of quantitative finance and goes to complexity at the end of the book along with a varying degree of R complexity. This will guide you to implement different trading strategies for various financial instruments using basic to complex techniques along with its optimization and keeping the risk of financial instruments in check.

BookMar 2017284 pages

Hands-On Ensemble Learning with R

This book introduces you to the concept of ensemble learning and demonstrates how different machine learning algorithms can be combined to build efficient machine learning models. Use R to implement the popular trilogy of ensemble techniques, i.e. bagging, random forest and boosting, to build faster and more accurate machine learning models.

BookJul 2018376 pages

R Data Analysis Cookbook

Data analytics with R has emerged as a very important focus for organizations of all kinds. R enables even those with only an intuitive grasp of the underlying concepts, without a deep mathematical background, to unleash powerful and detailed examinations of their data. This book empowers you by showing you ways to use R to generate professional analysis reports. The book also teaches you to quickly adapt the example code for your own needs and save yourself the time needed to construct code from scratch.

BookSep 2017560 pages

R Statistics Cookbook

With this book, you will learn to execute a series of intermediate to advanced statistical tasks as you walk through each chapter. You will not only get well versed with the traditional statistics but you will also cover the necessary statistics required for machine learning and deep learning concepts.

BookMar 2019448 pages3

Machine Learning with R

Brett Lantz teaches you how to uncover key insights and make new predictions with this hands-on, practical guide to machine learning with R. This third edition is for experienced R users and beginners. The book is fully updated to R 3.6, featuring newer and better libraries, advice on ethical and bias issues, and an introduction to deep learning.

BookApr 2019458 pages

Practical Predictive Analytics

This book teaches six specific steps needed to implement predictive analytics using R. It also teaches how team collaboration is critical and how it increases the chances of implementing a successful model. The book uses cases from healthcare, marketing, and government to build practical skills. Big Data is also covered, in this book, which will extend your skill sets by learning Databricks and RSpark.

BookJun 2017576 pages

Mastering Machine Learning with R

Machine learning is the field of Artificial Intelligence where we build systems that learn from data. Given the growing prominence of R—a cross-platform, zero-cost statistical programming environment—there has never been a better time to start applying machine learning to your data. This book will teach you advanced techniques in machine learning with the latest code in R 3.3.2.

BookApr 2017420 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages