Homework 2
Due Monday, February 7, 11:59 pm
Submission Rules and Advice
In this homework, you are asked to write up a report on what you did,
why you did it, and what the results were. Your writeup must be in
LaTeX. Include your .tex and .pdf files, as well as any
image files, in your submission, along with your R files. Images must
be generated in R, and the associated code included.
In this and subsequent assignments, you are required to use
qeML for all basic
prediction operations (linear model, RFs, etc.). For matrix
factorization (coming soon(, use rectools.
As before, if asked to write a general function, its code cannot
be tailored to MovieLens,say.
Problem 1
As promised, this is a "gentler" version of the original Problem 1.
Develop a class 'virtualRatings' via which users (i.e. those who
write code using this class) will have the illusion that they are
accessing a physical ratings matrix stored in memory. This illusion
will come from the fact that they access the ratings in the typical
"a[i,j]" fashion. Here are the details:
- The data will be read-only and uncached.
- The class will have just one component, inputDf, which will
also be the sole argument of the constructor function,
virtualRatings().
- The inputDf argument will be a 3-column matrix/data frame in
the usual (userID, itemID, rating) format.
- In a[i,j], there will be exactly one i and exactly one j; e.g.
a[5,] is not allowed.
- The return value to `[.virtualRatings()` will be the rating
that the specified user gave to the specified item. If there is none,
NA is returned. If there is more than one, the first instance will be
returned, but the function will also call the built-in R function
warning() with an appropriate message.
Example:
> iDF <- rbind(c(2,5,1),c(3,5,4),c(2,5,2),c(6,1,5))
> iDF
[,1] [,2] [,3]
[1,] 2 5 1
[2,] 3 5 4
[3,] 2 5 2
[4,] 6 1 5
> ex <- virtualRatings(iDF)
> ex
$inputDF
[,1] [,2] [,3]
[1,] 2 5 1
[2,] 3 5 4
[3,] 2 5 2
[4,] 6 1 5
attr(,"class")
[1] "virtualRatings"
> ex[3,5]
[1] 4
> ex[5,5]
[1] NA
> ex[2,5]
[1] 1
Warning message:
In `[.virtualRatings`(ex, 2, 5) : muliple instances found
Problem 2
Apply collaborative filtering to the following datasets:
- MovieLens, 100K version, as we have been using.
- InstEval, included with rectools. Use v.1.0.7, calling
getInstEval().
-
House Voting,
from the UCI Machine Learning Repository. Here the "users" are Members
of Congress, the "items" are bills, and the "ratings" are Yes or No
votes. Not every MC voted on every bill, so it's a classical collaborative
filtering problem.
Note carefully: You must first convert the data to (user ID, item
ID, rating, side information format. Make sure to include your code for
this, as well as a test for a couple of cases.
Here you will use only linear or logistic models, but you definitely are
allowed to use side information (including generating new variables from
the original data). See how well you can do!
But keep in mind that we are working with limited data, both in size and in
richness of information. Results will likely be modest.
Extra Credit
Let A denote the ratings matrix, r x s, for r users and s items. Keep
in mind that r might be, say, in the hundreds of millions, and s in the
hundreds of thousands. The product rs could be huge! We probably don't
want to store it in memory, and maybe not even on disk.
One solution is to generate the elements of A on demand, keeping a cache
of the ones requested so far. Of course, this could be quite
sophisticated, but we will keep it simple here.
- You will develop an R S3 class 'ratingsCache'.
-
There are many tutorials on S3. Here
is one by someone I highly respect. R also offers more advanced OOP
with S4 and R6 classes, but we will use S3, preferred by many, maybe
most, serious R developers. In brief: To create an S3 class, one makes an R
list with the desired components, then bestows a class name on that
list.
One can do operator overloading in S3 on R generic functions, e.g.
create a class 'abc' and then define plot.abc(). Whenever
plot() is called on an object of this class, the call will be
dispatched to this function.
Note that "In R, everything is an object, and all operations
are functions," e.g. +:
> 3+8
[1] 11
> '+'(3,8)
[1] 11
Operations can thus be overloaded:
> u <- list(v=2)
> class(u) <- 'abc'
> v <- list(v=5)
> class(v) <- 'abc'
> '+.abc' <- function(a,b) a$v * b$v
> u + v
[1] 10
-
Assume the user and item IDs start at 1 and are consecutive, but the
code will be otherwise general.
- The class will consist of these components:
- inputDF: Data frame consisting of user ID, item ID,
rating and possible side information (latter not directly used here).
- Aij: These are the cached values. Every time a query is
made on the ratings matrix, Aij will be checked first. If
present, the cached value is returned. Otherwise, it is determined
(by going through the input file), added to the cache, and returned;
if the rating is not in the input file, NA is returned.
-
Aij will be an R list, element i of which will be a
vector of all the cached values for user i. The
vector will have its elements named according to the ID of item
j.
-
For instance, say Aij[['3']]['8'] = 2. That means that user 3 has
given item 8 a rating of 2. It also means that there has previously
been a query for this i and j.
-
Aij[['k']] won't exist until the first query for user k
comes in. Similarly, Aij[['k']]['m'] won't exist
until the rating for that user-item combination is queried.
- You will overload the R subscripting operator! Use this kind of
approach:
> w <- rbind(1:3,6:8)
> w[2,1]
[1] 6
> '['(w,2,1)
[1] 6
> a <- list(x=3, y=8)
> class(a) <- 'z'
> '[.z' <- function(obj,i,j) i*obj$x - j*obj$y
> a[1,3]
[1] -21
- Include a function named ratingsCache() that acts as a
constructor. It will: create an R list that will serve as the class
object; set the inputDF component of that list to the input data
frame (e.g. ml100); initialize Aij to NULL;
and bestow the class name.
- Include code where you demonstrate the above on the MovieLens data:
You create an instance ml of the class from the MovieLens data, then do,
say,
> ml[196,242] # should return 3
> ml[244,51] # should return 2
Remember, the point is to give the illusion that we are actually
accessing the matrix in memory!