Unsupervised Learning with R

4.6 (8 reviews total)
By Erik Rodríguez Pacheco
  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies

About this book

The R Project for Statistical Computing provides an excellent platform to tackle data processing, data manipulation, modeling, and presentation. The capabilities of this language, its freedom of use, and a very active community of users makes R one of the best tools to learn and implement unsupervised learning.

If you are new to R or want to learn about unsupervised learning, this book is for you. Packed with critical information, this book will guide you through a conceptual explanation and practical examples programmed directly into the R console.

Starting from the beginning, this book introduces you to unsupervised learning and provides a high-level introduction to the topic. We quickly move on to discuss the application of key concepts and techniques for exploratory data analysis. The book then teaches you to identify groups with the help of clustering methods or building association rules. Finally, it provides alternatives for the treatment of high-dimensional datasets, as well as using dimensionality reduction techniques and feature selection techniques.

By the end of this book, you will be able to implement unsupervised learning and various approaches associated with it in real-world projects.

Publication date:
December 2015
Publisher
Packt
Pages
192
ISBN
9781785887093

 

Chapter 1. Welcome to the Age of Information Technology

Machine learning is one of the disciplines that is most frequently used in data mining and can be subdivided into two main tasks: supervised learning and unsupervised learning. This book will concentrate mainly on unsupervised learning.

So, let's begin this journey right from the start. This particular chapter aims to introduce you to the unsupervised learning context. We will begin by explaining the concept of data mining and mentioning the main disciplines that we use in data mining.

Next, we will provide a high-level introduction of some key concepts about information theory. Information theory studies the transmission, processing, utilization, and even extraction of information and has been successfully applied in the data mining context.

Additionally, we will introduce CRISP DM because it is important to use a specialized methodology for management knowledge discovery projects.

Finally, we will introduce the software tools that we will use in this book, mentioning some of the reasons why they are highly recommended.

In brief, we will cover the following topics:

  • The data mining concepts

  • Machine learning

    • Supervised learning

    • Unsupervised learning

  • Information Theory

    • Entropy

    • Information gain

  • CRISP-DM

  • Benefits of using R

 

The information age


At present, the amount of data we are able to produce, transmit, and store is growing at an unprecedented rate. Within these large volumes of information, we can find deposits of valuable knowledge to be extracted. However, the main problem is to find such information and it is reasonable to say that it will become increasingly difficult.

Eric Emerson Schmidt, who was the chief executive of Google warned:

"Between the origin of the Earth and 2003 five Exabytes of information were created; today that amount is created every two weeks..."

In this context, it is easy to understand that it is virtually impossible to identify these deposits of knowledge by manual methods, and this makes it necessary to resort to specialized disciplines such as data mining.

Data mining

The term began to be used in the '80s by database developers. Data mining can be defined as the process of discovery of new and meaningful relationships by exploring large volumes of information.

Data mining, as an oriented knowledge discovery process, uses various disciplines, such as mathematics, statistics, artificial intelligence, databases, pattern recognition, or machine learning. Indeed, sometimes some of these terms are considered synonymous, which is in fact incorrect. Rather, there is an overlap of these disciplines.

The following diagram illustrates some disciplines involved in the process of data mining:

Machine learning

Machine learning is a subfield of computer science, and is defined as the study and creation of algorithms that are able to learn, in the context of this book, from the relationships between the information contained in a dataset.

In general terms, machine learning can be divided into several categories; the two most common ones are supervised learning and unsupervised learning.

Supervised learning

This is a task of machine learning, which is executed by a set of methods aimed to infer a function from the training data.

Normally, the training data is composed of a set of observations. Each observation possesses a diverse number of variables named predictors, and one variable that we want to predict, also known as labels or classes. These labels or classes represent the teachers because the models learn from them.

The ultimate aim of the function created by the model is the extrapolation of its behavior towards new observations, that is, prediction. This prediction corresponds to the output value of a supervised learning model that could be numeric, as in the case of a regression problem; or a class, as in the case of classification problems.

To explain the process of supervised learning, we can resort to the following diagram:

For reference, some examples of supervised learning models are:

  • Regression models

  • Neural networks

  • Support vector machines

  • Random forests

  • Boosting algorithms

We can divide the process into two stages Modeling and Predicting:

In the modeling stage, we start with raw data that will be used to train the model. The following is the definition of the variables used to build the model, as it is possible to reduce the number or transform them.

We proceed to train the model, and finally carry out the evaluation. It is important to note that training, construction, and validation of the model form an iterative process aimed at achieving the best possible model, which means we may need to return to a previous step to make adjustments.

The second stage is the prediction. We already have the model and a number of new observations. Using the model that we built and tested, a prediction for new data is executed and the results are generated.

Unsupervised learning

The unsupervised learning objective of this book is a machine learning task that aims to describe the associations and patterns in relation to a set of input variables. The fundamental difference from supervised learning is that input data has no class labels, so it has no variables to predict and rather tries to find data structures by their relationship.

We could say that unsupervised learning aims to simulate the human learning process, which implies learning without explicit supervision, that is, without a teacher as is the case with supervised learning.

In unsupervised learning, we can also speak of two stages: Modeling and profiting:

In the modeling phase, we take the input data and proceed to apply techniques of feature selection or feature extraction. Once we define the most convenient variables, we proceed to choose the best method of unsupervised learning to solve the problem at hand. For example, it could be a problem of clustering or association rules.

After choosing the method, we proceed to build the model and execute an iterative tuning process until we are satisfied with the results.

In contrast to supervised learning, in which the model value is derived mostly from prediction, in unsupervised learning, the findings obtained during the modeling phase could be enough to fulfill the purpose, in which case, the process would stop. For example, if the objective is to make a customer group, once done, the modeling phase will have an idea of the existing groups, and that could be the goal of the analysis.

Assuming that the model was subsequently used, there is a second stage, which is when we have the model and want to exploit it again. We will receive new data and use the model that we built to run on them and get results.

Throughout this book, we will explain in greater depth, many aspects of unsupervised learning.

 

Information theory


Information theory studies the transmission, processing, utilization, and even extraction of information and has been successfully applied in the data mining context.

Information theory, also known as the mathematical theory of communication or the mathematical information theory is a theory proposed by Claude E. Shannon and Warren Weaver in the late 1940s.

Information theory is a concept that has been extrapolated to other contexts and is widely used in relation to machine learning and unsupervised learning. Considering that in several examples of this book, some concepts will be mentioned. In this regard, we will explain them in this section.

Information theory defines the degree of dependence between two variables based on the concept of mutual information; that is, the information that is common between two variables and therefore, it can be considered a measure of the reduction of uncertainty about the value of a variable once we know the other.

In relation to the above, there are two important concepts that we want to clarify: the entropy and information gain.

Entropy

Entropy, also known as the information media, gives the mean value of the information by a variable. It can be considered a measure of uncertainty, because it is a measure of how pure or impure a variable is. The entropy ranges from 0 when all instances of a variable have the same value, to 1 when there exists an equal number of instances of each value.

Formally, the entropy can be defined with the help of the following formula:

Explaining the mathematical concepts of information theory is beyond the scope of this book. However, considering its importance, we will explain the concept of entropy and information gain using an example:

Suppose we have the following dataset consisting of four variables (Color, Size, Shape, and Result) and 16 observations:

Considering it contains 16 instances: 9 TRUE and 7 FALSE, we proceed to apply the formula of entropy as follows:

The entropy for the example is 0.9887, which makes much sense because 7/16 and 9/16 is almost a coin flip; hence, the entropy is close to 1.

Information gain

When we are trying to decide the relevance of an attribute, we can examine the information gain associated with the variable. Information gains are usually a good measure to decide the relevance of an attribute. It is a measure related to entropy and can be defined as the expected reduction in entropy caused by a partitioning of features. In general terms, the expected information gain is the change in information entropy.

We can calculate the expected entropy of each possible attribute. In other words, the degree to which the entropy would change.

Continuing with the example, we consider the variable size and proceed to the calculation of information gain.

We want to calculate the information gain (entropy reduction), that is, the reduction in uncertainty using the feature size.

The first thing to do is calculate the entropy of each subset of the variable:

Then we must add the entropies calculated according to the proportion of observations. In the example, both Size = Small and Size = Large contain eight observations:

As the information gained by definition is the change from entropy:

So, we gained 0.1059 bits of information about the dataset by choosing the size feature.

 

Data mining methodology and software tools


To conclude this introductory chapter, we consider it important to note two additional points: a suggested methodology for data mining projects and some important aspects of the software that we use in this book.

CRISP-DM

CRISP-DM is an acronym for Cross Industry Standard Process for Data Mining. Although it is a process model for data mining projects in general, it is a good framework to use in unsupervised learning projects. It is not the only existing standard, but currently, is the most often used.

CRISP-DM, is divided into 4 levels of abstraction organized hierarchically in tasks ranging from the most general level to the most specific cases and organizes the development of a data mining project, in a series of six phases:

CRISP-DM Stages

Purpose

Business understanding

This aims to understand the project objectives and requirements from a business perspective and convert this knowledge into a data mining problem.

Data understanding

This pretends to get familiar with data, to identify quality problems, and to get first insights into the data.

Data preparation

This covers all activities to construct the final dataset to feed into the modeling tool. The data preparation phase might include tasks such as attribute selection, data transformation, data cleaning, and any other task considered necessary.

Modeling

The modeling techniques are selected, calibrated, and applied.

Evaluation

Before proceeding to the final deployment of the model, it is important to perform a more thorough evaluation, reviewing the steps executed for its construction, and to be sure it properly achieves the business objectives.

Deployment

This is the exploitation phase of the project.

These phases interact in an ordered process, as shown in the following diagram:

We will not delve into an explanation of each of these phases. Instead, we simply suggest the methodology as a framework. However, if you want to investigate further, there is much information available online.

 

Benefits of using R


This book is based entirely on the use of R—a tool that was originally developed by Robert Gentleman and Ross Ihaka from the Department of Statistics at the University of Auckland in 1993. Its current development is the responsibility of the R Development Core Team.

There are many tools for data mining, so let's take a look at a few of the benefits of using R:

  • R is Free! And not just free, R is open source software.

  • It is probably the most used tool for the scientific community to carry out research, and certainly the most used by professionals working in data mining.

  • Perhaps one of the best features it has is a giant collaborative repository called CRAN, which currently has more than 7,300 packages for many different purposes. Very few applications have this diversity.

  • It has a very active community along with multiple forums where we can discuss our queries with others and solve our problems.

  • R has great capacities for information visualization.

  • And a huge so on...

 

Summary


In this chapter, we have contextualized the concept of unsupervised learning relative to machine learning, supervised learning, and the theory of information.

In addition, we presented a methodology for data mining project management. Finally, we presented the software we plan to use throughout the chapters of this book.

In the next chapter, we will explain some exploratory techniques and thus proceed to execute the next step in the CRISP-DM methodology.

About the Author

  • Erik Rodríguez Pacheco

    Erik Rodríguez Pacheco works as a manager in the business intelligence unit at Banco Improsa in San José, Costa Rica, where he holds 11 years of experience in the financial industry. He is currently a professor of the business intelligence specialization program at the Instituto Tecnológico de Costa Rica's continuing education programs. Erik is an enthusiast of new technologies, particularly those related to business intelligence, data mining, and data science. He holds a bachelor's degree in business administration from Universidad de Costa Rica, a specialization in business intelligence from the Instituto Tecnológico de Costa Rica, a specialization in data mining from Promidat (Programa Iberoamericano de Formación en Minería de Datos), and a specialization in business intelligence and data mining from Universidad del Bosque, Colombia. He is currently enrolled in an online specialization program in data science from Johns Hopkins University.

    He has served as the technical reviewer of R Data Visualization Cookbook and Data Manipulation with R - Second Edition, both from Packt Publishing.

    He can be reached at https://www.linkedin.com/in/erikrodriguezp.

    Browse publications by this author

Latest Reviews

(8 reviews total)
Great books! Loved the $5 deal yall had
Excellent
Good
Book Title
Access this book, plus 7,500 other titles for FREE
Access now