Homework 2

Due Monday, February 7, 11:59 pm

Submission Rules and Advice

In this homework, you are asked to write up a report on what you did, why you did it, and what the results were. Your writeup must be in LaTeX. Include your .tex and .pdf files, as well as any image files, in your submission, along with your R files. Images must be generated in R, and the associated code included.

In this and subsequent assignments, you are required to use qeML for all basic prediction operations (linear model, RFs, etc.). For matrix factorization (coming soon(, use rectools.

As before, if asked to write a general function, its code cannot be tailored to MovieLens,say.

Problem 1

As promised, this is a "gentler" version of the original Problem 1.

Develop a class 'virtualRatings' via which users (i.e. those who write code using this class) will have the illusion that they are accessing a physical ratings matrix stored in memory. This illusion will come from the fact that they access the ratings in the typical "a[i,j]" fashion. Here are the details:

The data will be read-only and uncached.

The class will have just one component, inputDf, which will also be the sole argument of the constructor function, virtualRatings().

The inputDf argument will be a 3-column matrix/data frame in the usual (userID, itemID, rating) format.

In a[i,j], there will be exactly one i and exactly one j; e.g. a[5,] is not allowed.

The return value to `[.virtualRatings()` will be the rating that the specified user gave to the specified item. If there is none, NA is returned. If there is more than one, the first instance will be returned, but the function will also call the built-in R function warning() with an appropriate message.

Example:

> iDF <- rbind(c(2,5,1),c(3,5,4),c(2,5,2),c(6,1,5))
> iDF
     [,1] [,2] [,3]
[1,]    2    5    1
[2,]    3    5    4
[3,]    2    5    2
[4,]    6    1    5
> ex <- virtualRatings(iDF)
> ex
$inputDF
     [,1] [,2] [,3]
[1,]    2    5    1
[2,]    3    5    4
[3,]    2    5    2
[4,]    6    1    5

attr(,"class")
[1] "virtualRatings"

> ex[3,5]
[1] 4
> ex[5,5]
[1] NA
> ex[2,5]
[1] 1
Warning message:
In `[.virtualRatings`(ex, 2, 5) : muliple instances found

Problem 2

Apply collaborative filtering to the following datasets:

MovieLens, 100K version, as we have been using.

InstEval, included with rectools. Use v.1.0.7, calling getInstEval().

House Voting, from the UCI Machine Learning Repository. Here the "users" are Members of Congress, the "items" are bills, and the "ratings" are Yes or No votes. Not every MC voted on every bill, so it's a classical collaborative filtering problem.
Note carefully: You must first convert the data to (user ID, item ID, rating, side information format. Make sure to include your code for this, as well as a test for a couple of cases.

Here you will use only linear or logistic models, but you definitely are allowed to use side information (including generating new variables from the original data). See how well you can do!

But keep in mind that we are working with limited data, both in size and in richness of information. Results will likely be modest.

Extra Credit

Let A denote the ratings matrix, r x s, for r users and s items. Keep in mind that r might be, say, in the hundreds of millions, and s in the hundreds of thousands. The product rs could be huge! We probably don't want to store it in memory, and maybe not even on disk.

One solution is to generate the elements of A on demand, keeping a cache of the ones requested so far. Of course, this could be quite sophisticated, but we will keep it simple here.

You will develop an R S3 class 'ratingsCache'.

There are many tutorials on S3. Here is one by someone I highly respect. R also offers more advanced OOP with S4 and R6 classes, but we will use S3, preferred by many, maybe most, serious R developers. In brief: To create an S3 class, one makes an R list with the desired components, then bestows a class name on that list.

One can do operator overloading in S3 on R generic functions, e.g. create a class 'abc' and then define plot.abc(). Whenever plot() is called on an object of this class, the call will be dispatched to this function.

Note that "In R, everything is an object, and all operations are functions," e.g. +:
```
> 3+8
[1] 11
> '+'(3,8)
[1] 11
```
Operations can thus be overloaded:
```
> u <- list(v=2)
> class(u) <- 'abc'
> v <- list(v=5)
> class(v) <- 'abc'
> '+.abc' <- function(a,b) a$v * b$v
> u + v
[1] 10
```
Assume the user and item IDs start at 1 and are consecutive, but the code will be otherwise general.

The class will consist of these components:
- inputDF: Data frame consisting of user ID, item ID, rating and possible side information (latter not directly used here).
- Aij: These are the cached values. Every time a query is made on the ratings matrix, Aij will be checked first. If present, the cached value is returned. Otherwise, it is determined (by going through the input file), added to the cache, and returned; if the rating is not in the input file, NA is returned.
Aij will be an R list, element i of which will be a vector of all the cached values for user i. The vector will have its elements named according to the ID of item j.

For instance, say Aij[['3']]['8'] = 2. That means that user 3 has given item 8 a rating of 2. It also means that there has previously been a query for this i and j.

Aij[['k']] won't exist until the first query for user k comes in. Similarly, Aij[['k']]['m'] won't exist until the rating for that user-item combination is queried.

You will overload the R subscripting operator! Use this kind of approach:

> w <- rbind(1:3,6:8)
> w[2,1]
[1] 6
> '['(w,2,1)
[1] 6

> a <- list(x=3, y=8)
> class(a) <- 'z'
> '[.z' <- function(obj,i,j) i*obj$x - j*obj$y
> a[1,3]
[1] -21

Include a function named ratingsCache() that acts as a constructor. It will: create an R list that will serve as the class object; set the inputDF component of that list to the input data frame (e.g. ml100); initialize Aij to NULL; and bestow the class name.

Include code where you demonstrate the above on the MovieLens data: You create an instance ml of the class from the MovieLens data, then do, say,
```
> ml[196,242]  # should return 3
> ml[244,51]  # should return 2
```
Remember, the point is to give the illusion that we are actually accessing the matrix in memory!