Due Monday, February 24
Here you will analyze the MovieLens data, 100K version, predicting movie rating. The assignment is largely open-ended, subject to some rules:
The TA will form the data frame u.big from p.30, with 19 extra columns for genre. For the latter, he will use the order in u.genre, with the genre names listed there. He will then partition the data into secret training and test/holdout sets of size 95,000 and 5,000. He will read and unpack a file HwkII.tar that you submit. That file will contain files HwkII.R and WH.RData. (NO subdirectories in your .tar file,!)
HwkII.R will contain functions with call forms
lmFinalModel(u.big.tst) nmfFinalModel(u.big.tst)
The first of these two will have your best-model coefficients hardwired in (which you will have found previously; don't include the code for computation that in this function). The second will read your best-model W and H from the saved file. Note that the coefficients, W and H are what you got from YOUR training data.
They will use the TA's u.big.tst (NOT yours), extract the necessary columns, create any new ones, etc. Then they will predict the ratings in u.big.tst using your best model, and return the Mean Absolute Prediction Error (MAPE).
WH.RData will contain your best W and H matrices, named 'W' and 'H', and saved via R's save() function.
The TA will call the two functions above with his own partitioned data: The TA will generate his own training and test/holdout datasets, and call your functions. Your code must make use u.big from p.30 of our text, modified so that it has an extra column at the end, 'genre'.
Example: Say your best linear model uses user ID, movie ID, age, gender and 2-digit ZIP as features. You would write your lmFinalModel() to: