Machine learning is one of the disciplines that is most frequently used in data mining and can be subdivided into two main tasks: supervised learning and unsupervised learning. This book will concentrate mainly on unsupervised learning.
So, let's begin this journey right from the start. This particular chapter aims to introduce you to the unsupervised learning context. We will begin by explaining the concept of data mining and mentioning the main disciplines that we use in data mining.
Next, we will provide a high-level introduction of some key concepts about information theory. Information theory studies the transmission, processing, utilization, and even extraction of information and has been successfully applied in the data mining context.
Additionally, we will introduce CRISP DM because it is important to use a specialized methodology for management knowledge discovery projects.
Finally, we will introduce the software tools that we will use in this book, mentioning some of the reasons why they are highly recommended.
In brief, we will cover the following topics:
The data mining concepts
Benefits of using R
At present, the amount of data we are able to produce, transmit, and store is growing at an unprecedented rate. Within these large volumes of information, we can find deposits of valuable knowledge to be extracted. However, the main problem is to find such information and it is reasonable to say that it will become increasingly difficult.
Eric Emerson Schmidt, who was the chief executive of Google warned:
"Between the origin of the Earth and 2003 five Exabytes of information were created; today that amount is created every two weeks..."
In this context, it is easy to understand that it is virtually impossible to identify these deposits of knowledge by manual methods, and this makes it necessary to resort to specialized disciplines such as data mining.
The term began to be used in the '80s by database developers. Data mining can be defined as the process of discovery of new and meaningful relationships by exploring large volumes of information.
Data mining, as an oriented knowledge discovery process, uses various disciplines, such as mathematics, statistics, artificial intelligence, databases, pattern recognition, or machine learning. Indeed, sometimes some of these terms are considered synonymous, which is in fact incorrect. Rather, there is an overlap of these disciplines.
The following diagram illustrates some disciplines involved in the process of data mining:
Machine learning is a subfield of computer science, and is defined as the study and creation of algorithms that are able to learn, in the context of this book, from the relationships between the information contained in a dataset.
In general terms, machine learning can be divided into several categories; the two most common ones are supervised learning and unsupervised learning.
Normally, the training data is composed of a set of observations. Each observation possesses a diverse number of variables named predictors, and one variable that we want to predict, also known as labels or classes. These labels or classes represent the teachers because the models learn from them.
The ultimate aim of the function created by the model is the extrapolation of its behavior towards new observations, that is, prediction. This prediction corresponds to the output value of a supervised learning model that could be numeric, as in the case of a regression problem; or a class, as in the case of classification problems.
To explain the process of supervised learning, we can resort to the following diagram:
Support vector machines
In the modeling stage, we start with raw data that will be used to train the model. The following is the definition of the variables used to build the model, as it is possible to reduce the number or transform them.
We proceed to train the model, and finally carry out the evaluation. It is important to note that training, construction, and validation of the model form an iterative process aimed at achieving the best possible model, which means we may need to return to a previous step to make adjustments.
The second stage is the prediction. We already have the model and a number of new observations. Using the model that we built and tested, a prediction for new data is executed and the results are generated.
The unsupervised learning objective of this book is a machine learning task that aims to describe the associations and patterns in relation to a set of input variables. The fundamental difference from supervised learning is that input data has no class labels, so it has no variables to predict and rather tries to find data structures by their relationship.
We could say that unsupervised learning aims to simulate the human learning process, which implies learning without explicit supervision, that is, without a teacher as is the case with supervised learning.
In unsupervised learning, we can also speak of two stages: Modeling and profiting:
In the modeling phase, we take the input data and proceed to apply techniques of feature selection or feature extraction. Once we define the most convenient variables, we proceed to choose the best method of unsupervised learning to solve the problem at hand. For example, it could be a problem of clustering or association rules.
After choosing the method, we proceed to build the model and execute an iterative tuning process until we are satisfied with the results.
In contrast to supervised learning, in which the model value is derived mostly from prediction, in unsupervised learning, the findings obtained during the modeling phase could be enough to fulfill the purpose, in which case, the process would stop. For example, if the objective is to make a customer group, once done, the modeling phase will have an idea of the existing groups, and that could be the goal of the analysis.
Assuming that the model was subsequently used, there is a second stage, which is when we have the model and want to exploit it again. We will receive new data and use the model that we built to run on them and get results.
Throughout this book, we will explain in greater depth, many aspects of unsupervised learning.
Information theory, also known as the mathematical theory of communication or the mathematical information theory is a theory proposed by Claude E. Shannon and Warren Weaver in the late 1940s.
Information theory is a concept that has been extrapolated to other contexts and is widely used in relation to machine learning and unsupervised learning. Considering that in several examples of this book, some concepts will be mentioned. In this regard, we will explain them in this section.
Information theory defines the degree of dependence between two variables based on the concept of mutual information; that is, the information that is common between two variables and therefore, it can be considered a measure of the reduction of uncertainty about the value of a variable once we know the other.
In relation to the above, there are two important concepts that we want to clarify: the entropy and information gain.
Entropy, also known as the information media, gives the mean value of the information by a variable. It can be considered a measure of uncertainty, because it is a measure of how pure or impure a variable is. The entropy ranges from 0 when all instances of a variable have the same value, to 1 when there exists an equal number of instances of each value.
Formally, the entropy can be defined with the help of the following formula:
Explaining the mathematical concepts of information theory is beyond the scope of this book. However, considering its importance, we will explain the concept of entropy and information gain using an example:
Suppose we have the following dataset consisting of four variables (Color, Size, Shape, and Result) and 16 observations:
Considering it contains 16 instances: 9 TRUE and 7 FALSE, we proceed to apply the formula of entropy as follows:
The entropy for the example is 0.9887, which makes much sense because 7/16 and 9/16 is almost a coin flip; hence, the entropy is close to 1.
When we are trying to decide the relevance of an attribute, we can examine the information gain associated with the variable. Information gains are usually a good measure to decide the relevance of an attribute. It is a measure related to entropy and can be defined as the expected reduction in entropy caused by a partitioning of features. In general terms, the expected information gain is the change in information entropy.
Continuing with the example, we consider the variable size and proceed to the calculation of information gain.
We want to calculate the information gain (entropy reduction), that is, the reduction in uncertainty using the feature size.
The first thing to do is calculate the entropy of each subset of the variable:
Then we must add the entropies calculated according to the proportion of observations. In the example, both Size = Small and Size = Large contain eight observations:
As the information gained by definition is the change from entropy:
To conclude this introductory chapter, we consider it important to note two additional points: a suggested methodology for data mining projects and some important aspects of the software that we use in this book.
CRISP-DM is an acronym for Cross Industry Standard Process for Data Mining. Although it is a process model for data mining projects in general, it is a good framework to use in unsupervised learning projects. It is not the only existing standard, but currently, is the most often used.
CRISP-DM, is divided into 4 levels of abstraction organized hierarchically in tasks ranging from the most general level to the most specific cases and organizes the development of a data mining project, in a series of six phases:
These phases interact in an ordered process, as shown in the following diagram:
We will not delve into an explanation of each of these phases. Instead, we simply suggest the methodology as a framework. However, if you want to investigate further, there is much information available online.
This book is based entirely on the use of Râa tool that was originally developed by Robert Gentleman and Ross Ihaka from the Department of Statistics at the University of Auckland in 1993. Its current development is the responsibility of the R Development Core Team.
There are many tools for data mining, so let's take a look at a few of the benefits of using R:
R is Free! And not just free, R is open source software.
It is probably the most used tool for the scientific community to carry out research, and certainly the most used by professionals working in data mining.
Perhaps one of the best features it has is a giant collaborative repository called CRAN, which currently has more than 7,300 packages for many different purposes. Very few applications have this diversity.
It has a very active community along with multiple forums where we can discuss our queries with others and solve our problems.
R has great capacities for information visualization.
And a huge so on...
In this chapter, we have contextualized the concept of unsupervised learning relative to machine learning, supervised learning, and the theory of information.
In addition, we presented a methodology for data mining project management. Finally, we presented the software we plan to use throughout the chapters of this book.
In the next chapter, we will explain some exploratory techniques and thus proceed to execute the next step in the CRISP-DM methodology.