Packt+ | Advance your knowledge in tech

You're reading from Building a Recommendation System with R

Product type Book

Published in Sep 2015

Publisher

ISBN-13 9781783554492

Pages 158 pages

Edition 1st Edition

Languages

Concepts

Machine Learning

Table of Contents (13) Chapters

Building a Recommendation System with R

Credits

About the Authors

About the Reviewer

www.PacktPub.com

Preface

1. Getting Started with Recommender Systems

2. Data Mining Techniques Used in Recommender Systems

3. Recommender Systems

4. Evaluating the Recommender Systems

5. Case Study – Building Your Own Recommendation Engine

References

Index

Chapter 2. Data Mining Techniques Used in Recommender Systems

Though the primary objective of this book is to build recommender systems, a walkthrough of the commonly used data-mining techniques is a necessary step before jumping into building recommender systems. In this chapter, you will learn about popular data preprocessing techniques, data-mining techniques, and data-evaluation techniques commonly used in recommender systems. The first section of the chapter tells you how a data analysis problem is solved, followed by data preprocessing steps such as similarity measures and dimensionality reduction. The next section of the chapter deals with data mining techniques and their evaluation techniques.

Similarity measures include:

Euclidean distance
Cosine distance
Pearson correlation

Dimensionality reduction techniques include:

Principal component analysis

Data-mining techniques include:

k-means clustering
Support vector machine
Ensemble methods, such as bagging, boosting, and random forests

Solving a data analysis problem

Any data analysis problem involves a series of steps such as:

Identifying a business problem.
Understanding the problem domain with the help of a domain expert.
Identifying data sources and data variables suitable for the analysis.
Data preprocessing or a cleansing step, such as identifying missing values, quantitative and qualitative variables and transformations, and so on.
Performing exploratory analysis to understand the data, mostly through visual graphs such as box plots or histograms.
Performing basic statistics such as mean, median, modes, variances, standard deviations, correlation among the variables, and covariance to understand the nature of the data.
Dividing the data into training and testing datasets and running a model using machine-learning algorithms with training datasets, using cross-validation techniques.
Validating the model using the test data to evaluate the model on the new data. If needed, improve the model based on the results of the validation...

Data preprocessing techniques

Data preprocessing is a crucial step for any data analysis problem. The model's accuracy depends mostly on the quality of the data. In general, any data preprocessing step involves data cleansing, transformations, identifying missing values, and how they should be treated. Only the preprocessed data can be fed into a machine-learning algorithm. In this section, we will focus mainly on data preprocessing techniques. These techniques include similarity measurements (such as Euclidean distance, Cosine distance, and Pearson coefficient) and dimensionality-reduction techniques, such as Principal component analysis (PCA), which are widely used in recommender systems. Apart from PCA, we have singular value decomposition (SVD), subset feature selection methods to reduce the dimensions of the dataset, but we limit our study to PCA.

Similarity measures

As discussed in the previous chapter, every recommender system works on the concept of similarity between items or users...

Data mining techniques

In this section, we will look at commonly used data-mining algorithms, such as k-means clustering, support vector machines, decision trees, bagging, boosting, and random forests. Evaluation techniques such as cross validation, regularization, confusion matrix, and model comparison are explained in brief.

Cluster analysis

Cluster analysis is the process of grouping objects together in a way that objects in one group are more similar than objects in other groups.

An example would be identifying and grouping clients with similar booking activities on a travel portal, as shown in the following figure.

In the preceding example, each group is called a cluster, and each member (data point) of the cluster behaves in a manner similar to its group members.

Cluster analysis

Cluster analysis is an unsupervised learning method. In supervised methods, such as regression analysis, we have input variables and response variables. We fit a statistical model to the input variables to predict the response variable. Whereas in unsupervised learning methods, however, we do not have any response variable to predict; we only have input variables. Instead of fitting a model to the input variables to predict the response variable, we just try to find patterns within the dataset. There are three popular clustering algorithms...

Decision trees

Decision trees are a simple, fast, tree-based supervised learning algorithm to solve classification problems. Though not very accurate when compared to other logistic regression methods, this algorithm comes in handy while dealing with recommender systems.

We define the decision trees with an example. Imagine a situation where you have to predict the class of flower based on its features such as petal length, petal width, sepal length, and sepal width. We will apply the decision tree methodology to solve this problem:

Consider the entire data at the start of the algorithm.
Now, choose a suitable question/variable to divide the data into two parts. In our case, we chose to divide the data based on petal length > 2.45 and <= 2.45. This separates flower class setosa from the rest of the classes.
Now, further divide the data having petal length >2.45, based on the same variable with petal length < 4.5 and >= 4.5, as shown in the following image.
This splitting of the...

Ensemble methods

In data mining, we use ensemble methods, which means using multiple learning algorithms to obtain better predictive results than applying any single learning algorithm on any statistical problem. This section will provide an overview of popular ensemble methods such as bagging, boosting, and random forests

Bagging

Bagging is also known as Bootstrap aggregating. It is designed to improve the stability and accuracy of machine-learning algorithms. It helps avoid over fitting and reduces variance. This is mostly used with decision trees.

Bagging involves randomly generating Bootstrap samples from the dataset and trains the models individually. Predictions are then made by aggregating or averaging all the response variables:

For example, consider a dataset (Xi, Yi), where i=1 …n, contains n data points.
Now, randomly select B samples with replacements from the original dataset using Bootstrap technique.
Next, train the B samples with regression/classification models independently....

Evaluating data-mining algorithms

In the previous sections, we have seen various data-mining techniques used in recommender systems. In this section, you will learn how to evaluate models built using data-mining techniques. The ultimate goal for any data analytics model is to perform well on future data. This objective could be achieved only if we build a model that is efficient and robust during the development stage.

While evaluating any model, the most important things we need to consider are as follows:

Whether the model is over fitting or under fitting
How well the model fits the future data or test data

Under fitting, also known as bias, is a scenario when the model doesn't even perform well on training data. This means that we fit a less robust model to the data. For example, say the data is distributed non-linearly and we are fitting the data with a linear model. From the following image, we see that data is non-linearly distributed. Assume that we have fitted a linear model (orange...

Summary

In this chapter, you learned about popular data preprocessing techniques, data-mining techniques, and evaluation techniques commonly used in recommender systems. In the next chapter, you will learn about the recommender systems introduced in Chapter 1, Getting Started with Recommender Systems, in more detail.