Homework II

Due Monday, February 24

Problem A

Here you will analyze the MovieLens data, 100K version, predicting movie rating. The assignment is largely open-ended, subject to some rules:

You'll use only lm() and NMF matrix factorization methods, and in general are allowed to use only methods from our course. You may run lars() instead of lm() if you wish.

You must use at least some covariates (age, gender, genre etc.), and compare their effectivess to using none. You must also try different rank values in the NMF case.

In all cases you present in your writeup, you must run both approaches, lm() and NMF.

For NMF, you must use the recosystem package. It uses R6 classes, which are quite different from S3 and S4. You are welcome to look at the wrappers in the rectools package for guidance, but your code must use recosystem directly. (Note: This does not constitute an endorsement of R6 on my part.)

If you use ZIP code as a feature, I would suggest using only the first 2 digits.

Make a complete report, stating what you did, why you did it, and what outcome ensued. Use graphics to help illustrate your narrative.

Negative results are just as important as positive ones. If you find that, say, genre is not a useful predictor, state your results and venture an opinion as to why it wasn't helpful.

In the end, you will choose one final best model for each of the linear and NMF approaches. Below is how to package them. Make sure to follow the directions to the letter, as the TA will write scripts to automate the process, and you don't want the TA in a bad mood when his scripts crash. :-)

The TA will form the data frame u.big from p.30, with 19 extra columns for genre. For the latter, he will use the order in u.genre, with the genre names listed there. He will then partition the data into secret training and test/holdout sets of size 95,000 and 5,000. He will read and unpack a file HwkII.tar that you submit. That file will contain files HwkII.R and WH.RData. (NO subdirectories in your .tar file,!)

HwkII.R will contain functions with call forms
```
lmFinalModel(u.big.tst)
nmfFinalModel(u.big.tst)
```
The first of these two will have your best-model coefficients hardwired in (which you will have found previously; don't include the code for computation that in this function). The second will read your best-model W and H from the saved file. Note that the coefficients, W and H are what you got from YOUR training data.

They will use the TA's u.big.tst (NOT yours), extract the necessary columns, create any new ones, etc. Then they will predict the ratings in u.big.tst using your best model, and return the Mean Absolute Prediction Error (MAPE).

WH.RData will contain your best W and H matrices, named 'W' and 'H', and saved via R's save() function.

The TA will call the two functions above with his own partitioned data: The TA will generate his own training and test/holdout datasets, and call your functions. Your code must make use u.big from p.30 of our text, modified so that it has an extra column at the end, 'genre'.

Example: Say your best linear model uses user ID, movie ID, age, gender and 2-digit ZIP as features. You would write your lmFinalModel() to:
- Extract the columns for user and movie ID, age, gender,rating, and ZIP from u.big.tst.
- Replace ZIP by the first two digits.
- Use your hardwired coefficients to predict the ratings in u.big.tst, and compute MAPE.

Example: Your NMF best-model function nmfFinalModel() would then do the same thing, except that instead of using hard-wired coefficients, it would load the save W and H matrices and use them to predict u.big.tst.

EVALUATION AND CREDIT:
- There is a lot of work to be done here, so I will count this as a double assignment.
- If you do only the minimum -- linear and matrix factorization approaches, with at least one covariate -- but you do well in answering the TA's questions during interactive grading, you should get an A grade on this assignment. However, this is really a warmup for the Term Project, so you should go much deeper.
- In addition to the TA's grade, I will read each of your reports, making comments that should be helpful for your term project, and will give Extra Credit to those that explore this data thoroughly and write a clear report.
- There will also be Extra Credit given to the three teams with the best prediction results, as measured on the TA's holdout set. (This Extra Credit will be independent of the one I might give.)
- It is REQUIRED that your report contain a section titled, "Who Did What," explaining the contribution of each team member.
Enjoy! You'll learn a lot here.