In this article, by **Saleem Ansari**, the author of the book Building a Recommendation Engine with Scala, we will discuss content-based recommendation.

The main idea in content-based recommendation system is to recommend items to a customer X similar to previous items rated highly by the same customer X. Notice in this definition that we find "similar" items, which means we need to have a measure of similarity between items. To measure similarity of two items, we decode item features and then apply a similarity function.

What is a similarity function? A similarity function takes two items (or their feature representations) and returns a value that indicates degree of similarity of two items. Now, there are many different kinds of similarity functions because there are different ways we can represent an item.

*(For more resources related to this topic, see here.)*

We can represent an item as a set of features. For example, an item has the color red, is square shaped, and is made in India. Another item could be colored red, square shaped, and made in Italy. So, we can represent these items as two sets:

- = {color red, square-shaped, made in India }
- = { color red, square-shaped, made in Italy}

So, how similar are these two items? We will see that later in this article.

We need to have a common representation that is well understood by a similarity function. Also, there can be thousands or millions of features that we can extract for an item. For this, we can use a vector-based representation for an item. Notice that this is different from a set-based representation. So, for and , with vector representation how do we now calculate the similarity?

## Similarity measures

To answer these questions, we have different similarity (or dissimilarity/distance) metrics. Some of these we have already discussed earlier, but let's visit them again here. We will discuss following similarity metrics:

- Pearson correlation
- Euclidean distance
- Cosine measure
- Spearman correlation
- Tanimoto coefficient
- Log likelihood test

### Pearson correlation

Pearson correlation value is in the range -1.0 to 1.0, where 1.0 indicates very high correlation or high similarity and -1.0 indicates opposite or high dis-similarity. As we had seen in an earlier article, Pearson correlation of two series is the ratio of their covariance to the product of their variances. For a detailed review of Pearson correlation, read this Wikipedia article.

#### Challenges with Pearson correlation

Pearson correlation expects the series to have same length, and it doesn't take into account the number of features in which two series preferences overlap. Pearson correlation is also undefined if either of the series of feature values are all identical because that makes the variance zero. Therefore, the result is undefined.

### Euclidean distance

Euclidean distance is the geometric distance between two *n*-dimensional points. So, given two series of feature values, we can consider both as vectors and calculate the geometric distance. Note that this is a distance metric, but we want a similarity measure. So, to convert it into a similarity metric, we can use following formula:

Here is Euclidean similarity and = Euclidean distance.

#### Challenges with Euclidean distance

It only works when the two feature vectors are of same length. Features with larger values affect the similarity measure more than the features with smaller values.

### Cosine measure

Cosine measure or cosine similarity is the cosine of angle between two vectors (or points with respect to origin). The value of cosine ranges from -1.0 to 1.0.

Also note that the meaning of cosine measure value is exactly similar to Pearson correlation. So, we can also use Pearson correlation to the same affect. Therefore, same challenges apply to cosine measure too.

### Spearman correlation

Essentially, Spearman correlation is applicable when we have features in a vector that can be ranked somehow (for example, time and rating both have an implicit ordering). Well, this measure works only if there is an implicit ordering of features. This free video gives a very simple and concise description of calculating Spearman correlation.

### Tanimoto coefficient

Tanimoto coefficient, also known as Jackard coefficent, is the measure of overlap between two sets.

TC = intersection of preferences / union of preferences

Here, is the number of feature overlap, and is the joint number of features. The value of this measure is always from 0 to 1. So, it is easy to convert into a distance measure using the following formula:

Note that this measure is defined for sets, so we need to make an appropriate representation for our item features to make it work. For example, we can have item vector with 1s wherever a feature is present. So, for *item1* and *item2* examples that we saw earlier, we can have following representation:

*= [1,1,1,0]**= [1,1,0,1]*

Here, the universal set would be *[1,1,1,1]* representing [color red, square-shaped, made in India, made in Italy]. So, for *item1* and *item2*, the similarity measure is .

### Log likelihood test

It is quite similar to Tanimoto coefficient, which measures an overlap. However, log likelihood test, is an expression of how unlikely will users have so much overlap, given the total number of items and the number of items each user has a preference for.

Two similar users are likely to rate a movie common to them. However, two dissimilar users are unlikely to rate a common movie. Therefore, the more unlikely the two users would rate the same movie, and still rate the same movie, the more similar two users should be. The resulting value can be interpreted as a probability that an overlap isn't just due to chance. You may also want to refer to this video for a nice explanation.

# Content-based recommendation steps

We follow these steps to arrive at a mode to make content-based recommendations:

- Compute vectors to describe items.
- Build profiles of user preferences.
- Predict user interest in items.

First, we take our items dataset and identify the features we want to encode for each item. Next, we generate a **pretend item** for a user, based on user's interaction with items. We can use user's activity with items such as clicks, likes, purchases, and reviews. So, essentially each user now encoded is with the same features, and is represented same as other items in the dataset. Therefore, we have a set of feature vectors for all items, and also a pretend feature vector for the target user.

*User -> Likes -> Item profile*

Now, the idea is simple. Apply a similarity function for the user features with all the items, sort them by most similar at the top. For example:

Let item feature vectors be , and let the pretend-item for user be . Also, let our similarity function be . So, in a pseudo-code form, finding top K similar items would be:

Or:

This gives us top K items that are most similar to the pretend-item we created for our target user. And this is exactly what we would recommend to the user. However, this is an expensive operation. If you can afford to perform these matches, then it is perfectly fine. However, when the data size is huge, this won't be an option.

To enhance it, we should understand what exactly is happening here. We will discuss that in next section. Can you think about it on your own for now?

Note here that the key ideas are:

- To model items according to relevant attributes
- To infer user likes by modeling user as set of item features, and then use the model built above to make recommendations

Let's now discuss how clustering can give us some performance benefits.

## Clustering for performance

In our items and user preceding example, we have to scan all the items every time we find top K recommendations for a user. First, we need to choose a similarity function. Then, we are essentially finding the items nearest to the user. From these items, we are choosing only the K nearest items. However, we really don't need to perform similarity calculation of user with all the items every time. We can preprocess items and cluster them together such that the most similar items are always grouped together. This is where we will use K-Means clustering algorithm. K-Means algorithm gives us a nice model to find closest set of points to a given point using cluster centroids. Here is how it will look like:

In this figure, there are three users labeled as A, B, and C. These three points are the pretend-points based on their interaction with the system. First, we cluster all the items into three clusters Hexagon, Star, and Ring. Once we have done this, we can find which cluster a user likely belongs to and only recommend items from that cluster.

We will run our implementation of this recommendation approach on Amazon dataset. So, first let's extract the data for all the items from MongoDB, into a CSV file:

`$ mongo amazon_dataset --quiet < ../scripts/item-features.js > datasets/content-based-dataset-new.csv`

Since we need to extract each item as a set of features, we have following attributes for each item:

- Title
- Group
- Sales rank
- Average rating
- Categories

Let's look at a sample entry:

**ASIN**: 078510870X**Title**: Ultimate Marvel Team-Up**Group**: Book**Sales rank**: 612475**Average rating**: 3.5**Categories**: Books::Subjects::Children's Books::Literature::Science Fiction, Fantasy, Mystery and Horror::Comics and Graphic Novels::Books::Subjects::Comics and Graphic Novels::Publishers::Marvel::Books::Subjects::Teens::Science Fiction & Fantasy::Fantasy::Books::Subjects::Comics and Graphic Novels::Graphic Novels::Superheroes

val dim = math.pow(2, 10).toInt

val hashingTF = new HashingTF(dim)

Notice how we create a user-defined function to extract values and create a vector:

Now, based on the limited and mixed data (both numeric and textual), we would still like to be able to extract numeric features. The reason being that K-Means (or any distance based clustering) works only with numeric data. Here is what we will do:

- We will extract title terms, group, and categories terms.
- We will calculate term frequencies of all these terms using HashingTF.
- We will create a vector out of these term frequencies, and append two features sales rank and average rating to this vector.

For this, we will create a Spark **UDF** (**user defined function**), that will operate very nicely on a data frame. Once we have converted every item into a vector representation, we learn a K-Means model, evaluate it, update its hyper parameters, and so on. So, finally when we have obtained a good K-Means model, we will label each of the items with the cluster ID to which they belong:

First, we load the CSV data into a data frame, and then transform it into feature vectors using the UDF we defined earlier. Next, we will train the model and assign cluster IDs to all the items. Later, to make recommendations, we will also need to store the K-Means model to disk. The bad news is this feature is only available in Spark 1.4, and we are using Spark 1.3 (see SPARK-5986). But don't worry, we only need to save the K centroids (see *spark/pull/4951*).

Once we have a K-Means model, and the many clusters of items, we are only left with one taskâ€”picking a user and making item recommendations:

```
$ sbt 'set fork := true' 'run-main chapter06.ContentBasedRSExample'
[info] asin clusterID
[info] 0827229534 4
[info] 0313230269 1
[info] B00004W1W1 4
[info] 1559362022 4
[info] 1859677800 3
[info] B000051T8A 4
...
[info] 1577943082 4
[info] 0486220125 9
[info] B00000AU3R 0
[info] 0231118597 5
[info] 0375709363 9
[info] 0939165252 5
[info] Within Set Sum of Squared Errors = 2.604849376963733E14
[success] Total time: 28 s, completed 31 Jul, 2015 9:36:54 PM
```

Some points to note here are:

- Spark 1.3 doesn't have a mechanism to store K-Means model to disk. This feature is available in Spark 1.4 and is pretty easy to use.
- To convert users to pretend-items, you have many options such as:
- Taking an aggregate sum of the user's items feature vectors
- Taking an average of the user's item feature vectors
- Taking a weighted sum (using average ratings)
- Taking a weighted sum (using distance of an item from its cluster centroid)

However, keep in mind that if you do scaling/normalization on feature vectors while learning the models, you also need to perform same scaling/normalization operation on pretend item vectors too.

# Summary

In this article, we discussed content-based recommendation and implemented the recommendation algorithmâ€”content-based recommender using K-Means clustering.

# Resources for Article:

**Further resources on this subject:**

- RESTServices with Finagle and Finch [article]
- Getting Started with Apache Spark DataFrames [article]
- Docker in Production [article]