Reader small image

You're reading from  F# for Machine Learning Essentials

Product typeBook
Published inFeb 2016
Reading LevelExpert
Publisher
ISBN-139781783989348
Edition1st Edition
Languages
Right arrow
Author (1)
Sudipta Mukherjee
Sudipta Mukherjee
author image
Sudipta Mukherjee

Sudipta Mukherjee was born in Kolkata and migrated to Bangalore. He is an electronics engineer by education and a computer engineer/scientist by profession and passion. He graduated in 2004 with a degree in electronics and communication engineering. He has a keen interest in data structure, algorithms, text processing, natural language processing tools development, programming languages, and machine learning at large. His first book on Data Structure using C has been received quite well. Parts of the book can be read on Google Books. The book was also translated into simplified Chinese, available from Amazon.cn. This is Sudipta's second book with Packt Publishing. His first book, .NET 4.0 Generics , was also received very well. During the last few years, he has been hooked to the functional programming style. His book on functional programming, Thinking in LINQ, was released in 2014. He lives in Bangalore with his wife and son. Sudipta can be reached via e-mail at sudipto80@yahoo.com and via Twitter at @samthecoder.
Read more about Sudipta Mukherjee

Right arrow

Chapter 5. Collaborative Filtering

"People who bought this also bought these."

When you shop online, the site recommends some items to you based on your shopping history and the products that you looked at and probably liked. The system that generates these recommendations is known as a recommendation engine. There are several types of state-of-the-art algorithms used to create a recommendation engine. In this chapter, a couple of such types of algorithms will be discussed, namely "collaborative filtering" and "content-based filtering".

Objective


After reading this chapter, you will be able to understand how some of the sites recommend items based on what products you have rated and based on your item browsing history. You will understand the mathematics behind collaborative filtering and how these can be applied to your problem domain.

Different classification algorithms you will learn


The following algorithms will be discussed at length:

  • General non-personalized recommendations for a category of items

  • User-user collaborative filtering

  • User-user collaborative filtering with variations on similarity measures

  • Item-item collaborative filtering

  • Item-item collaborative filtering with variations

You will also learn how to evaluate recommendation engines using several evaluation metrics for different purposes, such as prediction accuracy, ranking accuracy in the context of top-N recommendations, and so on.

Vocabulary of collaborative filtering


The design of collaborative filters is influenced by two factors, which are as follows:

  • Users (the people for whom the recommendation is being provided)

  • Items (the products for which the recommendation is being provided)

Based on these two entities, there are two major varieties of collaborative filtering methods. The first of these methods takes the similarity between users to recommend items. This is known as User-User collaborative filtering or User k-Nearest Neighbors.

If the number of users of a recommender system is denoted by m and the number of items is denoted by n and if m >> n (m is much greater than n) then user-user collaborative filtering suffers from performance hiccups. In this case, item-item collaborative filtering (which relies on the similarity of the items) is often implemented.

In the next few sections, these two algorithms will be discussed at length.

Baseline predictors


Before delving into true collaborative filtering, let's look at some baseline predictors that can predict ratings for new users who haven't rated anything yet, which makes it almost impossible to find out the neighborhood of such users. For such users, a basic baseline rating can be the average of all ratings. The problem with applying collaborative filtering in order to predict the ratings of items for new users is referred to as Cold Start in collaborative filtering literature.

The baseline predictor is normally denoted by for user and item . The base case where the baseline is set as equal to the global average of all ratings is given by the following formula:

However, this can be optimized using the average of that user's rating for other items (if any are available) or the average rating for that particular item (given by all other users). In these two cases, the baseline predictor is calculated using the following formulae:

Note that the subscript is used to denote...

Item-item collaborative filtering


Like the user-user approach, items can be used to find similarities and then items can be recommended. The idea is to locate similar items based on who is purchasing them. This approach is generally taken when user bases grow. So, if the number of users is denoted by M and number of items is denoted by N, when M >> N (read: M is much greater than N) then this approach yields a good result. Most of the e-commerce sites use this algorithm to recommend items to users.

Amazon's "People who bought this also bought these" is a simple output of this approach. You can think of this as an inverted index of items and users in a very simple way, devoid of all the details about how ratings are persisted.

Items

Users

I1

U1,U2,U3,U5

I2

U3,U1

I3

U1,U5

Now, if a new user buys item I2, then the system will list all the other items that have been purchased by U3 and U1, which are I1 and I3. On the other hand, if the new user buys item I3, then the system will...

Top-N recommendations


So far, the discussion has been around prediction. A recommendation is nothing but the sorting of items based on their predicted rating values. So, for a new user who has rated a few items, the system calculates rating predictions for several items and then presents the sorted list of items based on their ratings in a descending order.

Evaluating recommendations


Understanding how good a collaborative filtering system is can be broadly determined by measuring three types of accuracy parameters, namely:

  • Prediction Accuracy Metrics

    These measures help to understand how accurately the recommender works. These measures work by calculating the differences between previously rated items and their ratings estimated by the recommender system.

  • Decision Support Metrics (a.k.a Confusion Matrix)

    These measures are used to find how well a supervised learning algorithm has performed.

  • Ranking Accuracy Metrics

    These metrics are used to find out how well the recommender has placed the items in the final recommended list.

Prediction accuracy

Metrics help us to understand how good the predicted ratings are. Here are some of the prediction accuracy metrics that are used frequently:

Note

denotes the predicted rating for user on item . And is the actual rating. So the closer the value of these metrics to zero, the better the prediction algorithm...

Ranking accuracy metrics


A third view of the task of a recommender system is that it ranks all items with respect to a user (or ranks all user-item pairs), such that the higher-ranked recommendations are more likely to be relevant to users. Individual rating predictions may be incorrect, but, as long as the order is caught correctly, rank accuracy measures will evaluate the system as having a high accuracy.

Prediction-rating correlation

If the variance of one variable can be explained by the variance in another, the two variables are said to correlate. Let be items and be their true order rank. Let the recommender system predict the ranks for these items (i.e., is the true rank of the item and is the predicted rank). Let be the mean of , and be the mean of . The Spearman's correlation is defined as follows:

The following code finds the coefficient:

This produces the following output:

val p : float = 0.9338995047...

Working with real movie review data (Movie Lens)


You can download the Movie Lens 100K dataset (for collaborative filtering) from http://files.grouplens.org/datasets/movielens/ml-100k/.

This dataset has movie ratings given by 943 users for 1,682 movies. These ratings are stored in the u.data file at http://files.grouplens.org/datasets/movielens/ml-100k/u.data. The full u data set has 100,000 ratings by 943 users on 1,682 items.

Each user has rated at least 20 movies. The users and items are numbered consecutively from 1. The data is randomly ordered. This is a tab separated list of:

user id | item id | rating | timestamp.

The time stamps are in Unix seconds since 1/1/1970 UTC.

The following C# program translates this data to an F# array so that this data can be fed to the collaborative filtering algorithms implemented earlier in the chapter:

https://gist.github.com/sudipto80/606418978f4a86fe93aa

Once you generate this array, you can then plug this into the algorithms described earlier.

Summary


In this chapter, the most commonly used memory-based approaches for recommendations were discussed. There are several other approaches to recommender system building, which have not been discussed here, such as model-based and hybrid recommendations systems that take cues from several other recommendation algorithms to produce final recommendations. However, We hope this chapter gave you a nice introduction and hands-on guide to implementations for these collaborative filtering ideas. All source code is available at https://gist.github.com/sudipto80/7002e66350ca7b2a7551.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
F# for Machine Learning Essentials
Published in: Feb 2016Publisher: ISBN-13: 9781783989348
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Sudipta Mukherjee

Sudipta Mukherjee was born in Kolkata and migrated to Bangalore. He is an electronics engineer by education and a computer engineer/scientist by profession and passion. He graduated in 2004 with a degree in electronics and communication engineering. He has a keen interest in data structure, algorithms, text processing, natural language processing tools development, programming languages, and machine learning at large. His first book on Data Structure using C has been received quite well. Parts of the book can be read on Google Books. The book was also translated into simplified Chinese, available from Amazon.cn. This is Sudipta's second book with Packt Publishing. His first book, .NET 4.0 Generics , was also received very well. During the last few years, he has been hooked to the functional programming style. His book on functional programming, Thinking in LINQ, was released in 2014. He lives in Bangalore with his wife and son. Sudipta can be reached via e-mail at sudipto80@yahoo.com and via Twitter at @samthecoder.
Read more about Sudipta Mukherjee