Homework 2

Due Monday, February 7, 11:59 pm

Submission Rules and Advice

In this homework, you are asked to write up a report on what you did, why you did it, and what the results were. Your writeup must be in LaTeX. Include your .tex and .pdf files, as well as any image files, in your submission, along with your R files. Images must be generated in R, and the associated code included.

In this and subsequent assignments, you are required to use qeML for all basic prediction operations (linear model, RFs, etc.). For matrix factorization (coming soon(, use rectools.

As before, if asked to write a general function, its code cannot be tailored to MovieLens,say.

Problem 1

As promised, this is a "gentler" version of the original Problem 1.

Develop a class 'virtualRatings' via which users (i.e. those who write code using this class) will have the illusion that they are accessing a physical ratings matrix stored in memory. This illusion will come from the fact that they access the ratings in the typical "a[i,j]" fashion. Here are the details:

Example:

> iDF <- rbind(c(2,5,1),c(3,5,4),c(2,5,2),c(6,1,5))
> iDF
     [,1] [,2] [,3]
[1,]    2    5    1
[2,]    3    5    4
[3,]    2    5    2
[4,]    6    1    5
> ex <- virtualRatings(iDF)
> ex
$inputDF
     [,1] [,2] [,3]
[1,]    2    5    1
[2,]    3    5    4
[3,]    2    5    2
[4,]    6    1    5

attr(,"class")
[1] "virtualRatings"

> ex[3,5]
[1] 4
> ex[5,5]
[1] NA
> ex[2,5]
[1] 1
Warning message:
In `[.virtualRatings`(ex, 2, 5) : muliple instances found

Problem 2

Apply collaborative filtering to the following datasets:

Here you will use only linear or logistic models, but you definitely are allowed to use side information (including generating new variables from the original data). See how well you can do!

But keep in mind that we are working with limited data, both in size and in richness of information. Results will likely be modest.

Extra Credit

Let A denote the ratings matrix, r x s, for r users and s items. Keep in mind that r might be, say, in the hundreds of millions, and s in the hundreds of thousands. The product rs could be huge! We probably don't want to store it in memory, and maybe not even on disk.

One solution is to generate the elements of A on demand, keeping a cache of the ones requested so far. Of course, this could be quite sophisticated, but we will keep it simple here.