Working with Data – Exploratory Data Analysis

In this article by Erik Rodriguez Pacheco author of the book Unsupervised Learning with R aims to explain and apply some techniques for exploratory data analysis: summarization, manipulation, correlation, and data visualization. An adequate knowledge of data by exploration is essential in order to apply unsupervised learning algorithms correctly. This asseveration is true not only for unsupervised learning but also for any efforts invested in data mining.

Once we have finalized the business understanding phase, it implies that we are clear about the context of the problem and objectives pursued. It is then that we enter the second phase—the understanding of the data—and we will do this through exploratory analysis techniques.

R is a versatile programming language as far as data management is concerned; in this article, we will explain some of the potential in relation to data exploration.

(For more resources related to this topic, see here.)

Exploratory data analysis

In any project aimed at knowledge discovery, the exploratory analysis of the data should not be underestimated. It's a very important phase and it is necessary for us to dedicate a lot of time on this.

For anyone who has worked with data, the use of a methodology might help to easily understand what I mean by exploratory analysis intuitively. However, before we get to a definition, I would like to explain it with an analogy:

In an editorial process, as is the creation of this book, I develop and propose a lot of material such as: concepts, topics, examples, code, in short, plenty of information. In fact, much of this information will not be used in the published version of the book, indeed, and it is likely that the finished version of the book does not match the order in which it was developed. This would be the equivalent of raw data in a process of knowledge discovery such as unsupervised learning.

The process continues and the data enters the editing stage, in which several actors will understand, refine, and verify the consistency and presentation of them. Exploratory data analysis is what happens during the editing phase and allows us to understand the relations between variables to identify initial problems with the data and also to determine if the original data requires any transformation.

In short, the data begins to tell us a story, and to tell this story, we can make use of visualization techniques, summarization, transformation, and handling of data. In this task, the statistical techniques play an important role, as well as specialized software tools that facilitate our work.

Fundamentals of clustering techniques

Clustering is based on the concepts of similarity and distance, while proximity is determined by a distance function. It allows the generation of clusters where each of these groups consists of individuals who have common features with each other.

Overall, the analysis of clusters is similar to the classification models, with the difference that the groups are not preset. The goal is to perform a partition of data into clusters that can be disjoint or not.

An important point in clustering techniques is that the groups are not given a priori and this implies that the person doing the analysis should support the interpretation of the groups found.

There are many methods, and the most popular are based on Hierarchical Classification and dynamic clouds or K-Means.

The K-Means clustering

In very general terms, the K-Means algorithm aims to partition a set of observations into clusters so that each observation belongs to the cluster that possesses the nearest mean.

Although it is a computationally difficult problem, there are very efficient implementations to quickly find the local optimum. In an optimization problem, the optimum is the value that maximizes or minimizes the condition that we are looking for.

Given a set of observations (X1, X2, …, XN), K-Means clustering aims to partition the N observations into K (≤ N) sets S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares (WCSS):

The intention of this book is not to enter in deep mathematical detail; however, it is important to understand the standard algorithm, assuming that we are seeking to create three clusters, that is, k = 3:

The method of Forgy and Random partition are the most common initialization approaches. Forgy chooses k observations from the data and uses these as the initial means. The method first assigns a cluster to each observation at random, then proceeds to the update phase, thereby computing the initial mean to become the centroid of the cluster's randomly assigned points. The assignment phase is also referred to as the expectation phase, and the update phase as the maximization phase, making this algorithm a variant of the generalized expectation maximization algorithm.

In R, there are several packages that allow us to use K-Means. The following is an implementation using the Iris dataset:

 # K Means
 # Load the Iris dataset (see chapter 2 for details)
  Iris<-iris
 
 # Show the head of numerical section of dataset
 head(Iris[1:4])

  Sepal.Length Sepal.Width Petal.Length Petal.Width

1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
4          4.6         3.1          1.5         0.2
5          5.0         3.6          1.4         0.2
6          5.4         3.9          1.7         0.4

 # Build the K-Means standard model
 set.seed(42)
KM.Iris<-kmeans(Iris[1:4], 3,iter.max=1000,
algorithm = c("Forgy") )

At this point, we have created a model for K-Means clustering and that is stored in the KM.Iris object. We can see some information about the outcome:

 # Get some information about the model built
 # Size of clusters

KM.Iris$size

[1] 62 38 50

 # Centers of clusters three clusters by variable
 KM.Iris$centers

  Sepal.Length Sepal.Width Petal.Length Petal.Width

1     5.901613    2.748387     4.393548    1.433871
2     6.850000    3.073684     5.742105    2.071053
3     5.006000    3.428000     1.462000    0.246000

# Table with Clusters recounts by species
  table(Iris$Species,KM.Iris$cluster)
      
              1  2  3
  setosa      0  0 50
  versicolor 48  2  0
  virginica  14 36  0

We must remember that clustering is not a method for classification. In fact, we should not know the species mapped in the Iris dataset. For teaching purposes, they are compared here to see how the clustering model works.

Based on the numerical data of Iris, model K-Means builds three groups, as we indicated, and then proceeds to classify each observation, in one of those groups. All of the type Setosa were assigned to group 3, most of the plants in group 2 are type Virginica, and group 1 has had assigned all plants type Versicolor and part of plants type Virginica.

It is interesting to plot the result. This allows for a better appreciation of the information, so we must reduce the dataset to be represented in two dimensions:

# Translate into a two dimensions using
# Multidimensional scaling.
Iris.dist <- dist(Iris[1:4])
Iris.mds <- cmdscale(Iris.dist)
# Plot points in 2 dimensional space
# Open a multiple plots array
par(mfrow = c(1, 2))
# Load or install the package scatterplot3d
suppressWarnings(
        suppressMessages(if
                         (!require(scatterplot3d, quietly=TRUE))
                install.packages("scatterplot3d")))
library("scatterplot3d ")
# Set the characters points to 1,2,3 numbers
chars <- c("1", "2", "3")[as.integer(iris$Species)]
# Plot a 3d Graphic
g3d=scatterplot3d(Iris.mds,pch=chars)
g3d$points3d(Iris.mds,col=KM.Iris$cluster,pch=chars)
# Plot a 2d Graphic
plot(Iris.mds, col = KM.Iris$cluster, pch = chars, xlab = "Index", ylab = "Y")

The graphic expression of clusters can help to evaluate whether the result makes sense. If we determine that it does, we can also include the results of cluster analysis in the original dataset. Then we can work on R or also export it to other tools:

 # Add cluster to original dataset
 Iris.Cluster<-cbind(Iris,KM.Iris[1])
 
 head(Iris.Cluster[3:6])

 Petal.Length Petal.Width Species cluster

1          1.4         0.2  setosa       3
2          1.4         0.2  setosa       3
3          1.3         0.2  setosa       3
4          1.5         0.2  setosa       3
5          1.4         0.2  setosa       3
6          1.7         0.4  setosa       3

The K-Means function requires that we properly define two important parameters: the number of clusters that it is convenient to use and the type of the function algorithm to be used to establish the optimum.

Defining the number of clusters

One of the most frequent questions in relation to the use of K-Means is the definition of the number of clusters to be used. Considering the impact that this can have on the outcome of the analysis we do, we will dedicate part of this article to make some recommendations about how to set this parameter.

Remember that the goal of the clustering algorithm is to minimize the within-cluster sum of squares.

We can make a first approximation using current computational capacity. You can create a repeating cycle or loop that generates a large amount of K-Means models, in which, on each iteration, the value of K is increased by 1.

The location of the elbow in the resulting plot suggests a suitable number of clusters for the K-Means:

# 30 K Means Loop
InerIC = rep(0, 30)
for (k in 1:30) {
    set.seed(42)
    groups = kmeans(Iris[1:4], k)
    InerIC[k] = groups$tot.withinss
}
plot(InerIC, col = "blue", type = "b")
abline(v = 4, col = "black", lty = 3)
text(4, 60, "4 Clusters", col = "black", adj = c(0,
    -0.1), cex = 0.7)

The preceding graph works for a visual approach. At this point, we can approximate the amount of appropriate clusters, and there would be between three and four clusters, which, at first, seems to make sense, given what we know of the Iris dataset.

We can use many other methods individually; however, I recommend the use of a package that integrates 30 methods to determine the optimal number of clusters, the NbClust package:

The NbClust package provides 30 indices for determining the number of clusters and proposes to the user the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods.

The package decided by voting considers all indices on a count, as the number of clusters that each method chosen as the best option, so that the number of clusters having more frequently, that is, who receives more votes, it is considered the best option.

It's a great way to compare indices, and if we do not want to use the package's recommendation, we can reach our own conclusions when using the information that it generates.

In the following we find the suggested amount of clusters, directly in R, using the NbClust package:

# Load or install the NbClust Package
suppressWarnings(suppressMessages(if (!require(NbClust,
    quietly = TRUE)) install.packages("NbClust")))
library("NbClust")
# Load the dataset Iris and assign the numerical
# variables to a new data frame 'data'.
Iris <- iris
data <- Iris[, -5]

# Find the best number of clusters using all
# indices
Best <- NbClust(data, diss = NULL, distance = "euclidean",
    min.nc = 2, max.nc = 15, method = "complete", index = "alllong")

By default, once finalized, the NbClust package offers a summary of the main results of the vote of the 30 indices:

*************************************************************
* Among all indices:
* 2 proposed 2 as the best number of clusters
* 15 proposed 3 as the best number of clusters
* 5 proposed 4 as the best number of clusters
* 1 proposed 6 as the best number of clusters
* 1 proposed 14 as the best number of clusters
* 3 proposed 15 as the best number of clusters

                   ***** Conclusion *****

 * According to the majority rule, the best number of
clusters is  3 

*************************************************************

An important thing to note in the preceding example is that we use a wide range of clusters when we use the NbClust function: min.nc = 2, max.nc = 15. This causes a more distributed voting. However, in this case, 55 percent of indices vote for three clusters, which we know is right for the Iris dataset.

The package also has some graphical indices.

The Hubert index is a graphical method for determining the number of clusters, and in a Hubert index plot, we seek a knee that corresponds to a significant increase of the value of the measure:

The D index is a graphical method for determining the number of clusters. In the D index plot, we seek a significant knee that corresponds to a significant increase of the value of the measure:

In addition to the default information, NbClust stores valuable information in the object being created. In our example, we call that object as Best. For example, if we want to see the exact count of the vote for each index, we could create a table whose rows show the choice of each index in relation to the number of clusters:

#Build a table with results of indices

table(names(Best$Best.nc[1,]),
       Best$Best.nc[1,])
    
             0 1 2 3 4 6 14 15
  Ball       0 0 0 1 0 0  0  0
  Beale      0 0 0 1 0 0  0  0
  CCC        0 0 0 1 0 0  0  0
  CH         0 0 0 0 1 0  0  0
  Cindex     0 0 0 1 0 0  0  0
  DB         0 0 0 1 0 0  0  0
  Dindex     1 0 0 0 0 0  0  0
  Duda       0 0 0 0 1 0  0  0
  Dunn       0 0 0 0 0 0  0  1
  Frey       0 1 0 0 0 0  0  0
  Friedman   0 0 0 0 1 0  0  0
  Gamma      0 0 0 0 0 0  1  0
  Gap        0 0 0 1 0 0  0  0
  Gplus      0 0 0 0 0 0  0  1
  Hartigan   0 0 0 1 0 0  0  0

If you do not want to make a table and prefer visual support, we could make a frequency graph with the count:

# sets 1x2 grid for Plotting
par(mfrow = c(1, 2))

# Making Graph of recounts
hist(Best$Best.nc[1,],
     breaks = max(na.omit(Best$Best.nc[1,])))
barplot(table(Best$Best.nc[1,]))

Looking at the preceding chart, we can quickly see that most of the indices proposed three clusters as the best alternative for the Iris dataset.

Defining the cluster K-Mean algorithm

Clustering models generated by K-Means can use several different algorithms and that affects the overall outcome of the analysis. Considering that not all datasets are equal, it is important to test the algorithms to determine which one fits best.

We won't explain each algorithm in depth, but we will explain how we can choose between them, in a practical way:

## Choosing between 4 algorithms

# Set vectors for storing results
Hartigan <- 0
Lloyd <- 0
Forgy <- 0
MacQueen <- 0

# to make it reproducible
set.seed(42)

# Running 500 KMeans with 3 clusters and 1000 max

# iterations for each method
for (i in 1:500) {
    KM <- kmeans(Iris[1:4], 3, iter.max = 1000, algorithm = "Hartigan-Wong")
    Hartigan <- Hartigan + KM$betweenss
    KM <- kmeans(Iris[1:4], 3, iter.max = 1000, algorithm = "Lloyd")
    Lloyd <- Lloyd + KM$betweenss
    KM <- kmeans(Iris[1:4], 3, iter.max = 1000, algorithm = "Forgy")
    Forgy <- Forgy + KM$betweenss
    KM <- kmeans(Iris[1:4], 3, iter.max = 1000, algorithm = "MacQueen")
    MacQueen <- MacQueen + KM$betweenss
}

# Build a data frame with results
Methods <- c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen")
Results <- as.data.frame(round(c(Hartigan, Lloyd, Forgy,
    MacQueen)/500, 2))
Results <- cbind(Methods, Results)
names(Results) <- c("Method", "Betweenss")

Results

The data frame Results stores the averages betweenss, calculated as a result of 500 iterations. Considering that the intention is to maximize the betweenss, then, the best algorithm could be Hartingan - Wong:

 Results
         Method Betweenss
1 Hartigan-Wong    591.14
2         Lloyd    588.47
3         Forgy    589.62
4      MacQueen    589.67

Alternatives for plotting clusters

In addition to traditional charts, we can use special alternatives for cluster analysis. These variants will help in the analysis of groups and also in the presentation of results:

par(mfrow = c(1, 1))

# Load Iris Data
Iris<-iris

# K-Means Clustering with 3 clusters
KM <-kmeans(Iris[1:4], 3, iter.max = 1000, algorithm = "Hartigan-Wong")

# Load or install the library Cluster
suppressWarnings(
        suppressMessages(if
                         (!require(cluster, quietly=TRUE))
                install.packages("cluster")))
     library("cluster")

# Cluster Plot against 1st 2 principal components
clusplot(Iris[1:4], KM$cluster, color=TRUE, shade=TRUE,
         labels=2, lines=1,main='Cluster Analysis for Iris')

Another interesting alternative is to use "Silhouettes" graphics through which each cluster is represented, based on the comparison of "tightness" and "separation":

# Load or Install packages
suppressWarnings(suppressMessages(if (!require(HSAUR,
    quietly = TRUE)) install.packages("HSAUR")))
suppressWarnings(suppressMessages(if (!require(cluster,
    quietly = TRUE)) install.packages("cluster")))
library("HSAUR")
library("cluster")

# K-Means Clustering with 3 clusters
KM <- kmeans(Iris[1:4], 3, iter.max = 1000, algorithm = "Hartigan-Wong")

# Dissimilarity Matrix Calculation
diss <- daisy(Iris[1:4])
dE2 <- diss^2

# silhouette Calculation
obj <- silhouette(KM$cl, dE2)

# Making a silhouette Plot
plot(obj, col = c("red", "green", "blue"))

If you want to delve into the construction and interpretation of silhouette graphics, refer to the article at http://www.sciencedirect.com/science/article/pii/0377042787901257.

Hierarchical clustering

Another one of the most used methods for clustering analysis is the Hierarchical Clustering Analysis (HCA). This method, as its name suggests, aims to build a hierarchy of clusters and generally this is done in two ways:

  • Agglomerative methods: This method uses a bottom up approach in which each observation begins in its own clusters and pairs of cluster are merged as one moves up the Hierarchical structure.
  • Divisive methods: This method uses a top-down approach in which each of the observations begin in one cluster and split recursively as one moves up the Hierarchical structure.

The agglomerative methods, using a recursive algorithm that follows the next phases:

  • Find the two closest points in the dataset
  • Link these points and consider them as a single point
  • The process starts again, now using the new dataset that contains the new point

This methodology requires measuring the distance between points. The aim is that the measured distances between observations of the same cluster are as small as possible and the distances between clusters are as large as possible.

In a hierarchical clustering, there are two very important parameters in relation to the above: the distance metric and the linkage method.

Summary

In this article, we discussed the importance of exploratory data analysis.

The content of this article is not intended to be exhaustive or exclusive. There are many approaches, tools, and techniques that can be used in the exploratory phase of analysis. However, we hope that it will help you to create a style or at least know about the different tools available.

Resources for Article:


Further resources on this subject:


You've been reading an excerpt of:

Unsupervised Learning with R

Explore Title
comments powered by Disqus