Scala for Dimensionality Reduction and Clustering

In the previous chapters, we saw several examples of supervised learning, covering both classification and regression. We performed supervised learning techniques on structured and labelled data. However, as we mentioned previously, with the rise of cloud computing, IoT, and social media, unstructured data is increasing unprecedentedly. Collectively, more than 80% of this data is unstructured and which most of them are unlabeled.

Unsupervised learning techniques, such as clustering analysis and dimensionality reduction, are two of the key applications in data-driven research and industry settings for finding hidden structures in unstructured datasets. There are many clustering algorithms being proposed for this, such as k-means, bisecting k-means, and the Gaussian mixture model. However, these algorithms cannot perform with high...

Technical requirements

Make sure Scala 2.11.x and Java 1.8.x are installed and configured on your machine.

The code files of this chapters can be found on GitHub:

https://github.com/PacktPublishing/Machine-Learning-with-Scala-Quick-Start-Guide/tree/master/Chapter05

Check out the following video to see the Code in Action:
http://bit.ly/2ISwb3o

Overview of unsupervised learning

In unsupervised learning, an input set is provided to the system during the training phase. In contrast to supervised learning, the input objects are not labeled with their class. Although in classification analysis the training dataset is labeled, we do not always have that advantage when we collect data in the real world, but still we want to find important values or hidden structures of the data. In NeuralIPS' 2016, Facebook AI Chief Yann LeCun introduced the cake analogy:

"If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake. We know how to make the icing and the cherry, but we don't know how to make the cake."

In order to create such a cake, several unsupervised learning tasks, including clustering...

Clustering analysis through examples

One of the most important tasks in clustering analysis is the analysis of genomic profiles to attribute individuals to specific ethnic populations, or the analysis of nucleotide haplotypes for diseases susceptibility. Human ancestry from Asia, Europe, Africa, and the Americas can be separated based on their genomic data. Research has shown that the Y chromosome lineage can be geographically localized, forming the evidence for clustering the human alleles of the human genotypes. According to National Cancer Institute (https://www.cancer.gov/publications/dictionaries/genetics-dictionary/def/genetic-variant):

"Genetic variants are an alteration in the most common DNA nucleotide sequence. The term variant can be used to describe an alteration that may be benign, pathogenic, or of unknown significance. The term variant is increasingly being...

Dimensionality reduction

Since humans are visual creatures, understanding a high dimensional dataset (even with more than three dimensions) is impossible. Even for a machine (or say, our machine learning algorithm), it's difficult to model the non-linearity from correlated and high-dimensional features. Here, the dimensionality reduction technique is a savior.

Statistically, dimensionality reduction is the process of reducing the number of random variables to find a low-dimensional representation of the data while preserving as much information as possible.

The overall step in PCA can be visualized naively in the following diagram:

PCA and singular-value decomposition (SVD) are the most popular algorithms for dimensionality reduction. Technically, PCA is a statistical technique that's used to emphasize variation and extract the most significant patterns (that is, features...

Summary

In this chapter, we discussed some clustering analysis techniques, such as k-means, bisecting k-means, and GMM. We saw a step-by-step example of how to cluster ethnic groups based on their genetic variants. In particular, we used the PCA for dimensionality reduction, k-means for clustering, and H2O and ADAM for handling large-scale genomics datasets. Finally, we learned about the elbow and silhouette methods for finding the optimal number of clusters.

Clustering is the key to most data-driven applications. Readers can try to apply clustering algorithms on higher-dimensional datasets, such as gene expression or miRNA expression, in order to cluster similar and correlated genes. A great resource is the gene expression cancer RNA-Seq dataset, which is open source. This dataset can be downloaded from the UCI machine learning repository at https://archive.ics.uci.edu/ml/datasets...