Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Machine Learning with Scala Quick Start Guide

You're reading from  Machine Learning with Scala Quick Start Guide

Product type Book
Published in Apr 2019
Publisher Packt
ISBN-13 9781789345070
Pages 220 pages
Edition 1st Edition
Languages
Authors (2):
Md. Rezaul Karim Md. Rezaul Karim
Profile icon Md. Rezaul Karim
Ajay Kumar N Ajay Kumar N
Profile icon Ajay Kumar N
View More author details

Table of Contents (9) Chapters

Preface Introduction to Machine Learning with Scala Scala for Regression Analysis Scala for Learning Classification Scala for Tree-Based Ensemble Techniques Scala for Dimensionality Reduction and Clustering Scala for Recommender System Introduction to Deep Learning with Scala Other Books You May Enjoy

Scala for Dimensionality Reduction and Clustering

In the previous chapters, we saw several examples of supervised learning, covering both classification and regression. We performed supervised learning techniques on structured and labelled data. However, as we mentioned previously, with the rise of cloud computing, IoT, and social media, unstructured data is increasing unprecedentedly. Collectively, more than 80% of this data is unstructured and which most of them are unlabeled.

Unsupervised learning techniques, such as clustering analysis and dimensionality reduction, are two of the key applications in data-driven research and industry settings for finding hidden structures in unstructured datasets. There are many clustering algorithms being proposed for this, such as k-means, bisecting k-means, and the Gaussian mixture model. However, these algorithms cannot perform with high...

Technical requirements

Overview of unsupervised learning

In unsupervised learning, an input set is provided to the system during the training phase. In contrast to supervised learning, the input objects are not labeled with their class. Although in classification analysis the training dataset is labeled, we do not always have that advantage when we collect data in the real world, but still we want to find important values or hidden structures of the data. In NeuralIPS' 2016, Facebook AI Chief Yann LeCun introduced the cake analogy:

"If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake. We know how to make the icing and the cherry, but we don't know how to make the cake."

In order to create such a cake, several unsupervised learning tasks, including clustering...

Clustering analysis through examples

One of the most important tasks in clustering analysis is the analysis of genomic profiles to attribute individuals to specific ethnic populations, or the analysis of nucleotide haplotypes for diseases susceptibility. Human ancestry from Asia, Europe, Africa, and the Americas can be separated based on their genomic data. Research has shown that the Y chromosome lineage can be geographically localized, forming the evidence for clustering the human alleles of the human genotypes. According to National Cancer Institute (https://www.cancer.gov/publications/dictionaries/genetics-dictionary/def/genetic-variant):


"Genetic variants are an alteration in the most common DNA nucleotide sequence. The term variant can be used to describe an alteration that may be benign, pathogenic, or of unknown significance. The term variant is increasingly being...

Dimensionality reduction

Since humans are visual creatures, understanding a high dimensional dataset (even with more than three dimensions) is impossible. Even for a machine (or say, our machine learning algorithm), it's difficult to model the non-linearity from correlated and high-dimensional features. Here, the dimensionality reduction technique is a savior.

Statistically, dimensionality reduction is the process of reducing the number of random variables to find a low-dimensional representation of the data while preserving as much information as possible.

The overall step in PCA can be visualized naively in the following diagram:

PCA and singular-value decomposition (SVD) are the most popular algorithms for dimensionality reduction. Technically, PCA is a statistical technique that's used to emphasize variation and extract the most significant patterns (that is, features...

Summary

In this chapter, we discussed some clustering analysis techniques, such as k-means, bisecting k-means, and GMM. We saw a step-by-step example of how to cluster ethnic groups based on their genetic variants. In particular, we used the PCA for dimensionality reduction, k-means for clustering, and H2O and ADAM for handling large-scale genomics datasets. Finally, we learned about the elbow and silhouette methods for finding the optimal number of clusters.

Clustering is the key to most data-driven applications. Readers can try to apply clustering algorithms on higher-dimensional datasets, such as gene expression or miRNA expression, in order to cluster similar and correlated genes. A great resource is the gene expression cancer RNA-Seq dataset, which is open source. This dataset can be downloaded from the UCI machine learning repository at https://archive.ics.uci.edu/ml/datasets...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Machine Learning with Scala Quick Start Guide
Published in: Apr 2019 Publisher: Packt ISBN-13: 9781789345070
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}