This is to try and answer the "how to" part of the question for those who want to practically implement sparse-SVD recommendations or inspect source code for the details. You can use an off-the-shelf FOSS software to model sparse-SVD. For example, vowpal wabbit
, libFM
, or redsvd
.
vowpal wabbit
has 3 implementations of "SVD-like" algorithms (each selectable by one of 3 command line options). Strictly speaking these should be called "approximate, iterative, matrix factorization" rather than pure "classic "SVD" but they are closely related to SVD. You may think of them as a very computationally-efficient approximate SVD-factorization of a sparse (mostly zeroes) matrix.
Here's a full, working recipe for doing Netflix style movie recommendations with vowpal wabbit
and its "low-ranked quadratic" (--lrq
) option which seems to work best for me:
Data set format file ratings.vw
(each rating on one line by user and movie):
5 |user 1 |movie 37
3 |user 2 |movie 1019
4 |user 1 |movie 25
1 |user 3 |movie 238
...
Where the 1st number is the rating (1 to 5 stars) followed by the ID of user who rated and and the movie ID that was rated.
Test data is in the same format but can (optionally) omit the ratings column:
|user 1 |movie 234
|user 12 |movie 1019
...
optionally because to evaluate/test predictions we need ratings to compare the predictions to. If we omit the ratings, vowpal wabbit
will still predict the ratings but won't be able to estimate the prediction error (predicted values vs actual values in the data).
To train we ask vowpal wabbit
to find a set of N
latent interaction factors between users and movies they like (or dislike). You may think about this as finding common themes where similar users rate a subset of movies in a similar way and using these common themes to predict how a user would rate a movie he hasn't rated yet.
vw
options and arguments we need to use:
--lrq <x><y><N>
finds "low-ranked quadratic" latent-factors.
<x><y>
: "um" means cross the u[sers] and m[ovie] name-spaces in the data-set. Note that only the 1st letter in each name-space is used with the --lrq
option.
<N>
: N=14
below is the number of latent factors we want to find
-f model_filename
: write the final model into model_filename
So a simple full training command would be:
vw --lrq um14 -d ratings.vw -f ratings.model
Once we have the ratings.model
model file, we can use it to predict additional ratings on a new data-set more_ratings.vw
:
vw -i ratings.model -d more_ratings.vw -p more_ratings.predicted
The predictions will be written to the file more_ratings.predicted
.
Using demo/movielens
in the vowpalwabbit
source tree, I get ~0.693 MAE (Mean Absolute Error) after training on 1 million user/movie ratings ml-1m.ratings.train.vw
with 14 latent-factors (meaning that the SVD middle matrix is a 14x14 rows x columns matrix) and testing on the independent test-set ml-1m.ratings.test.vw
. How good is 0.69 MAE? For the full range of possible predictions, including the unrated (0) case [0 to 5], a 0.69 error is ~13.8% (0.69/5.0) of the full range, i.e. about 86.2% accuracy (1 - 0.138).
You can find examples and a full demo for a similar data-set (movielens) with documentation in the vowpal wabbit
source tree on github:
Notes:
- The
movielens
demo uses several options I omitted (for simplicity) from my example: in particular --loss_function quantile
, --adaptive
, and --invariant
- The
--lrq
implementation in vw
is much faster than --rank
, in particular when storing and loading the models.
Credits:
--rank
vw option was implemented by Jake Hofman
--lrq
vw option (with optional dropout) was implemented by Paul Minero
- vowpal wabbit (aka vw) is the brain child of John Langford