31. Mar 2012 07:03
No comments

Collaboratively filtered recommender systems

Recommender systems weirdly influence the way we browse and experience content online. Whether it is advertisement, shopping or social suggestions, there barely is a way around being recommended something by such systems. A wide range of systems and algorithms to provide full suggestions and recommendation frameworks such as Mahout are available. To get a better understanding we want to use a simple example and take it through becoming a recommender system from scratch.


First of all, to recommend something one has to know the recipient of a recommendation and the subject to be recommended. There are different solutions as to how you can gather the needed information on what a user might desire. The easiest of which is simply asking him, for example via a like button or a collection of favorites. There is a few issues with asking the user for his opinion:

  • additional work required by the user
  • taste may change and previously made input could invalidate
  • exposure of the intend to recommend

However it does provide a certain level of transparency and gives the user the opportunity to control his own recommendations (something that might not always be desirable, think of adverts for example). I always favored the magical behind the scenes recommendation. Just by being on a web page a user exposes a lot of information about himself. For example:

  • hovering hesitation before clicks
  • time spend looking at pages
  • amount of pages looked at per session
  • amount of active and passive participation
  • ...

There is a lot one can measure about a user, without even asking the user to provide anything. Now we want to look into how such information can be made into recommendations.

The sociological aspect

A big part of providing recommendations is the argument of whether past user actions could lead to any assumptions on possible future events, triggered by the same or a similar user. As mentioned above, the fact that a user liked or bought an item last week does not mean that he has intend to buy a similar item in the future. In fact the exact opposite might be the case, and a user might be fully satisfied with his purchase and never need any item of a similar kind again. It is also unclear whether a user would go on and buy another item just because 50% of all other users, who looked at the item he currently looks at, have done so. Further more it is questionable whether users can be grouped by the items they purchase and later on be matched to users who made similar purchases.

The amount of argumentation seems unmeasurable and goes back to questions as fundamental as free will. But we don't want to discuss such deep matters. In this example we make the assumption that two (or more) users who behave similarly are likely to like the same things. We base this on the idea that people who enjoy each others company often tend to subconsciously copy each others habits. Something that can often be observed in movement or resting patterns or even with smaller ticks.  Translating the idea into the web, leads us to the assumption that two users who have similar behavioral pattern throughout browsing the web, are likely to desire the same recommendation. Whether or not this is true is not part of this article and as stated before goes far beyond the intended technological purpose of this article.

Gathering behavioral data

The gathering of behavioral data is nothing new, in fact it is highly unlikely that your behavior has never been captured by a web page. A vast amount of web pages measure your every move to improve their selling performance. Systems such as Omniture/Adobe site analytics use measurements of your behavior to allow their clients to optimize their web page accordingly. For our purpose we want to concentrate on 3 very simple behavioral measures. We chose three as it makes it easy to visualize results. Even with more than 3 measure one could still visualize in the same manner, but would simply have to provide choice as to which measure to be visualized. The measures we make are the basis of our strategy. They define whether we are performing an item based or a user based recommendation.

Item based

In an item based recommendation items are recommended depending on their relation to other items. For example if an item was often bought together with another. Or if an item was rated similarly well as another, by a set of users (see Slope One). Item based recommendations can still be driven by behavioral data, for example by the amount of times an item was hovered across before it was clicked.

User based

In a user based recommendation we try to find similarities in users and assume they keep having a similar taste. Such that when one user buys an item, we can suggest it to another similar user. In a simplified two user example, one could argue these assumptions are likely to be inaccurate. Which is why we don't just compare two users with each other, but instead collect user into cluster and smaller sub-cluster to allow us to make assumptions on entire groups with similar attributes. Just like a school yard, where cliques of friends gather in smaller groups and discuss different topics in different groups of peers with similar interests.

Our example

For this example we are using a simple and non commercial online library. Users are able to borrow books and browse the content in an interactive manner. It is important that users don't just get their books and leave right away, we want to keep a user browsing around as the longer he browses the more behavioral data we can gather.

As clarified earlier we are interested in how to use gathered data, rather than the sociological assumptions behind the correlation of data. So for the sake of our example we will be using the following three measures:

  • how many different items a user looked at over all
  • the average time a user looks at items
  • how often a user looks at items

Our example measures are very broad, usually we would want to try to measure as specific as possible. For example, we would want to concentrate measuring a user's behavior just before each of his purchases rather than just his overall behavior. But let's keep it simple for now.

Structuring the data

As we are planning to make assumptions on relations between users and correlations of their behavior, it is important that we structure the data to allow fast and simple retrieval of the relations. The schema used for our gathered data is the most important part, as it plays a vital role in the accuracy we can provide. Complex and heavy relations will require large and long lasting queries. Such queries would slow down the processing of data and calculation of the correlations, which might mean we can't recalculate as frequent as we might want to. Again, in this example we will keep the schema as small as possible for simplicities sake, as our intention is to ease understanding rather than high performance.

Our Data Structure

For our simple example we will require:

  • a table of user information

  • a table for the behavioral attributes of a user to to book relation

  • and a table to store the results of our calculations

On a side note, storing the user registration date already provides us with one way of getting a set of user into a group, assuming that users of similar attributes would take a similar amount of time to get to the page and sign up. Users of the same social group might join at similar times as they might have shared the idea beforehand.

Gathering the distances

We have already defined the kind of recommendation by the data we gathered. Now we have to define how we will measure the differences. We have deliberately chosen no more than 3 measures to allow an easily visualizable cluster. In fact it might be of interest to you to provide the user with this visualization, to provide full transparency of why an item has been recommended or why a user is similar. Before we visualize the results, we should calculate the distances/differences between users, such that we can only show the closest users and don't have to transfer vast amounts of data to the client system.


As mentioned earlier, these operations are likely to be costly ones. We will want to do this over night or on separate cold instances of the data. How this could be handled in production will be discussed later, for the moment we will concentrate on how we calculate and what we calculate with. First we will need to query the data we gathered earlier. In a system with a huge amount of users you will want to optimize your queries to retrieve larger amounts of data rather than having one query for each user, but for our example this should be enough.

Now we need to get another user's data, using the same query with another user_id (again, on a larger scale you would want to keep the number of  queries low to avoid transactions waiting for locks and to keep load on your data layer as low as possible).

We mentioned earlier that there are many ways of calculating the difference/distance between two user behaviors, item ratings, user to item relations and so on. We mentioned Slope-One as a way to calculate purely by differences of values in different columns of a table. In our scenario we want to think in matters of distance in a n-dimensional cluster. One way of achieving a sense of distance is by comparing the cosine of the angle at absolute 0, in a triangle drawn between the absolute 0 and the coordinates of our two points. However, should either of the two sides gain or lose in length (due to increase or decrease of all three parameters) the angle would not change, thus the cosine would not change either. So our assumption of distance would stay the same. Even though, as clearly shown below the actual distance/difference between the two points (users) has changed.

diagram showing the usage of the cosine to establish the distance between two points

diagram showing the usage of the cosine to falsy establish the distance between two points

So this technique allows us to take a slightly fuzzy approach to finding distances, and thus could be a possibility to find completely new recommendations on top of existing ones. Those recommendations could come through users with larger differences. After a recognized long period of failed recommendations, this might be a way to freshen up the cluster and provide users with the opportunity to try something new, something different.

Of course, we don't always want to make literally far fetched suggestions. So we need a distance more accurate, something that can be provided by the Euclidean distance which allows us to find the distance of two points in a n dimensional (here 3 dimensional) space.

the euclidian distance for N dimensions

Where q is the first user, p the second and n is the dimension for each user. So q1 and p1 would be dimension x, so the value of x for each user. Which in this example is "how many different items a user looked at over all". Then q2 and p2 are the next dimension and thus the next measure and so on. This works for n dimensions, so it is not restricted to a 3 dimensional approach, thus it would work with as many measures as you like. The assumption is, with an increased amount of measures we can equally increase the accuracy of the recommendation. The distance between the two users can now be stored in our recommendations table.

Presenting recommendations

Now we know how close each user is to another. So we can start asking for the user closest to the one getting a recommendation. To get the user closest to the current one, we need to find the user with the minimal distance of all distances to the current user, within the recommendations table.

This is where we would benefit from storing users in smaller sub-cluster. As your user data grows you could store groups of users either by certain attributes (such as their registration date) or simply by their distances. Thus allowing you to perform faster queries on smaller datasets.

We have yet to present a decent recommendation. It really depends on the system to create a logical recommendation from the data at hand. In a social network you would now have two users whom you know to be similar so you could recommend one to the other. In a sales scenario you could recommend an item, either of the users have bought, to the other one and so on. The query from above returns not just a single distance, but all distances to other users, starting with the smallest one. So you have a range of items you could suggest, all the items the closest user has bought, a range of the items the second closest bought and so on.

Visualizing recommendations

a 3d representation of the collaboratively filtered recommendation in our example
a 3d representation of the collaboratively filtered recommendation in our example

As our recommendations were made in a 3 dimensional space, we could visualize that space, with the users in it. Even if we had more than 3 dimensions we could allow a user to chose 3 criteria to be shown in the 3D diagram. Of course there is a range of ways to visualize multidimensional content in a browser, such as Google Motion Charts where you could even use the timeline as another dimension to visualize your results. Or you could use Java Applets to render 3 dimensional objects. I have chosen a small Javascript library that allows me to draw standard ascii characters in a simulated 3D space.


Possibly the greatest challenge around recommendations is performance. Massive calculations of relations between all items or all users will require a huge amount of processing power and time. This topic is a massive topic on it's own so we will only be able to broadly mention a few things to keep in mind.

Cachable URLs

As a huge user base and vast amount of items will hinder you from continuously calculating, your data will only change over night or even less frequently. That allows you to cache returned sorted differences between users considerably long. Think of a good url scheme for your recommender to make sure that multiple user can hit the same urls when asking for the same set of data. Also make sure not to ask for the same piece of data twice in the same session.

Bulk measurements

Don't try to measure your user behavior in real time. You are only calculating over night, so you have no benefit from sending every newly collected behavior straight through to your DBs. Gather the collected data on the client system, and wait until a certain amount is gathered before sending it off. Of course sometimes you don't know when a user leaves, but with the massive amount of data you will gather, the loss of 1 or 2 user behaviors will make little difference. Don't forget, you can also store data in a cookie and send it once it has reached the cookies maximum size. This way the user can even walk away with his data and you can continue collecting when he comes back.

Group queries

Try to group your queries to ask for smaller bulks rather than making one query per user. This way you can access the data from memory, thus allowing faster calculations. It also means a smaller amount of transactions, thus less waiting for locks.


As soon as your data has reached a certain size you need to separate it into sub cluster, as otherwise your queries will become far too heavy, as massive tables will need to be sorted and run through functions like minimal(). To avoid that, you will want to re-arrange your data into sub cluster. By what criteria you split data depends on your system, but you can always separate by distance between items or users. If you start your sub-clusters early, it will become a lot easier to spread your data across multiple machines once you require increased capacity.

Scheduled Calculations

As this will be your heaviest operation, make sure to avoid peak times. Ideally you would have a cold replica of all your data on which you could perform the calculations even throughout the day, those results could then simply be transferred onto a hot instance of the DB.

Books about the topic „collaborative filtering“

The following books are all about the topic "collaborative filtering" and are highly recommended. I have not read all of these books, but a good number of them and some of them I used for my research as well.
Order now »
Advanced Analytics with Spark, 2e
Uri Laserson, O′Reilly
Order now »
Word of Mouse: The Marketing Power of Collaborative Filtering
John Riedl, Warner Books
Order now »
Collaborative filtering: A Clear and Concise Reference
Gerardus Blokdyk, CreateSpace Independent Publishing Platform
Order now »
Collaborative Filtering: Using Machine Learning and Statistical Techniques
Xiaoyuan Su, LAP LAMBERT Academic Publishing

Comments about the topic „Collaboratively filtered recommender systems“

If you like you can leave a comment about the topic and exchange with other reads. In order to comment you need to login and then you can start immediately.
Login now to comment