Homework II

Due Thursday, November 1.

Problem A

In this problem, you will use matrix factorization methods. Here rectools provides wrappers to the fast and powerful recosystem package. We'll use the MovieLens data (original 100K version), without covariates.

Details:

Use Nonnegative Matrix Factorization (NMF), fitting with trainReco() and predicting with the provided method for generic predict().

Split your data into training and test sets, as in Hwk I. Follow those instructions exactly, except with MovieLens instead of InstEval.

For a range of values of the rank r, fit NMF on the training set and predict on the test set, calculating MAPE. Graph MAPE against r.

Repeat part (c), but in this case predict on the training set. Graph, plotting this curve on the same graph as (c), i.e. two curves on one graph.

Write about your results -- prose and graphs -- in ProblemA.tex.

Place your code in a file ProblemA.R that you will include in your submission. Your .tex image files and the resulting .pdf file will also be in your submission.

Problem B

When working with RS data, one must be very careful with user and item IDs, as they may not be consecutive. This can be disastrous if not accounted for.

Demonstrate that we run into major problems with the InstEval data using NMF if we mistakenly assume consecutive IDs.

Extend trainReco() and the associated method for the generic predict() to check for, and correct, this problem. This will entail adding another component to the return value of the former, and another argument to the latter.

Name your .tex file ProblemB.tex, and place your R code in a file ProblemB.R. Include these, your image files and the resulting .pdf file in your submission.

Problem C

Here you will write a function, intended for use with the small MovieLens data, that determines, for each user, her favorite genre. Here are the details:

The call form will be simply, favGenre(), no arguments.

At the time of the call, the function does not assume any MovieLens data is currently in memory.

The return value will be a vector of numeric genre codes, in the range 0 through 18. The i^th element will be the favorite genre of user i. If for a given user there is a tie for most frequently-rated genre, take the lowest ID. (Ideally, one would randomize, but I'd like submission of homework to be deterministic if possible.)

Place your code in a file ProblemC.R.