Homework II
Due Thursday, November 1.
Problem A
In this problem, you will use matrix factorization methods. Here
rectools provides wrappers to the fast and powerful
recosystem package. We'll use the MovieLens data (original 100K
version), without covariates.
Details:
- Use Nonnegative Matrix Factorization (NMF), fitting with
trainReco() and predicting with the provided method for generic
predict().
- Split your data into training and test sets, as in Hwk I. Follow
those instructions exactly, except with MovieLens instead of InstEval.
- For a range of values of the rank r, fit NMF on the training
set and predict on the test set, calculating MAPE. Graph MAPE against
r.
- Repeat part (c), but in this case predict on the training set.
Graph, plotting this curve on the same graph as (c), i.e. two
curves on one graph.
- Write about your results -- prose and graphs -- in
ProblemA.tex.
- Place your code in a file ProblemA.R that you will include in
your submission. Your .tex image files and the resulting .pdf
file will also be in your submission.
Problem B
When working with RS data, one must be very careful with user and item
IDs, as they may not be consecutive. This can be disastrous if
not accounted for.
- Demonstrate that we run into major problems with the InstEval data using NMF
if we mistakenly assume consecutive IDs.
- Extend trainReco() and the associated method for the generic
predict() to check for, and correct, this problem. This will
entail adding another component to the return value of the former, and
another argument to the latter.
- Name your .tex file ProblemB.tex,
and place your R code in a file ProblemB.R.
Include these, your image files and the resulting .pdf
file in your submission.
Problem C
Here you will write a function, intended for use with the small
MovieLens data, that determines, for each user, her favorite genre.
Here are the details:
- The call form will be simply, favGenre(), no arguments.
- At the time of the call, the function does not assume any MovieLens
data is currently in memory.
- The return value will be a vector of numeric genre codes, in the
range 0 through 18. The ith element will be the favorite
genre of user i. If for a given user there is a tie for most
frequently-rated genre, take the lowest ID. (Ideally, one would
randomize, but I'd like submission of homework to be deterministic if
possible.)
- Place your code in a file ProblemC.R.