Chapter 2. Data Mining Techniques Used in Recommender Systems
Though the primary objective of this book is to build recommender systems, a walkthrough of the commonly used data-mining techniques is a necessary step before jumping into building recommender systems. In this chapter, you will learn about popular data preprocessing techniques, data-mining techniques, and data-evaluation techniques commonly used in recommender systems. The first section of the chapter tells you how a data analysis problem is solved, followed by data preprocessing steps such as similarity measures and dimensionality reduction. The next section of the chapter deals with data mining techniques and their evaluation techniques.
Similarity measures include:
Euclidean distance
Cosine distance
Pearson correlation
Dimensionality reduction techniques include:
Data-mining techniques include:
Solving a data analysis problem
Any data analysis problem involves a series of steps such as:
Identifying a business problem.
Understanding the problem domain with the help of a domain expert.
Identifying data sources and data variables suitable for the analysis.
Data preprocessing or a cleansing step, such as identifying missing values, quantitative and qualitative variables and transformations, and so on.
Performing exploratory analysis to understand the data, mostly through visual graphs such as box plots or histograms.
Performing basic statistics such as mean, median, modes, variances, standard deviations, correlation among the variables, and covariance to understand the nature of the data.
Dividing the data into training and testing datasets and running a model using machine-learning algorithms with training datasets, using cross-validation techniques.
Validating the model using the test data to evaluate the model on the new data. If needed, improve the model based on the results of the validation...
Data preprocessing techniques
Data preprocessing is a crucial step for any data analysis problem. The model's accuracy depends mostly on the quality of the data. In general, any data preprocessing step involves data cleansing, transformations, identifying missing values, and how they should be treated. Only the preprocessed data can be fed into a machine-learning algorithm. In this section, we will focus mainly on data preprocessing techniques. These techniques include similarity measurements (such as Euclidean distance, Cosine distance, and Pearson coefficient) and dimensionality-reduction techniques, such as
Principal component analysis (PCA), which are widely used in recommender systems. Apart from PCA, we have singular value decomposition (SVD), subset feature selection methods to reduce the dimensions of the dataset, but we limit our study to PCA.
As discussed in the previous chapter, every recommender system works on the concept of similarity between items or users...
In this section, we will look at commonly used data-mining algorithms, such as k-means clustering, support vector machines, decision trees, bagging, boosting, and random forests. Evaluation techniques such as cross validation, regularization, confusion matrix, and model comparison are explained in brief.
Cluster analysis is the process of grouping objects together in a way that objects in one group are more similar than objects in other groups.
An example would be identifying and grouping clients with similar booking activities on a travel portal, as shown in the following figure.
In the preceding example, each group is called a cluster, and each member (data point) of the cluster behaves in a manner similar to its group members.
Cluster analysis is an unsupervised learning method. In supervised methods, such as regression analysis, we have input variables and response variables. We fit a statistical model to the input variables to predict the response variable. Whereas in unsupervised learning methods, however, we do not have any response variable to predict; we only have input variables. Instead of fitting a model to the input variables to predict the response variable, we just try to find patterns within the dataset. There are three popular clustering algorithms...
Decision trees are a simple, fast, tree-based supervised learning algorithm to solve classification problems. Though not very accurate when compared to other logistic regression methods, this algorithm comes in handy while dealing with recommender systems.
We define the decision trees with an example. Imagine a situation where you have to predict the class of flower based on its features such as petal length, petal width, sepal length, and sepal width. We will apply the decision tree methodology to solve this problem:
Consider the entire data at the start of the algorithm.
Now, choose a suitable question/variable to divide the data into two parts. In our case, we chose to divide the data based on petal length > 2.45 and <= 2.45. This separates flower class setosa
from the rest of the classes.
Now, further divide the data having petal length >2.45, based on the same variable with petal length < 4.5 and >= 4.5, as shown in the following image.
This splitting of the...
In data mining, we use ensemble methods, which means using multiple learning algorithms to obtain better predictive results than applying any single learning algorithm on any statistical problem. This section will provide an overview of popular ensemble methods such as bagging, boosting, and random forests
Bagging is also known as Bootstrap aggregating. It is designed to improve the stability and accuracy of machine-learning algorithms. It helps avoid over fitting and reduces variance. This is mostly used with decision trees.
Bagging involves randomly generating Bootstrap samples from the dataset and trains the models individually. Predictions are then made by aggregating or averaging all the response variables:
For example, consider a dataset (Xi, Yi), where i=1 …n, contains n data points.
Now, randomly select B samples with replacements from the original dataset using Bootstrap technique.
Next, train the B samples with regression/classification models independently....
Evaluating data-mining algorithms
In the previous sections, we have seen various data-mining techniques used in recommender systems. In this section, you will learn how to evaluate models built using data-mining techniques. The ultimate goal for any data analytics model is to perform well on future data. This objective could be achieved only if we build a model that is efficient and robust during the development stage.
While evaluating any model, the most important things we need to consider are as follows:
Under fitting, also known as bias, is a scenario when the model doesn't even perform well on training data. This means that we fit a less robust model to the data. For example, say the data is distributed non-linearly and we are fitting the data with a linear model. From the following image, we see that data is non-linearly distributed. Assume that we have fitted a linear model (orange...
In this chapter, you learned about popular data preprocessing techniques, data-mining techniques, and evaluation techniques commonly used in recommender systems. In the next chapter, you will learn about the recommender systems introduced in Chapter 1, Getting Started with Recommender Systems, in more detail.