Homework I

Due Thursday, October 18

Notes on Submission Packages

In any Homework assignment, include all the files in the same .tar file, NO SUBDIRECTORIES. When the TA's grading script unpacks the file, all the files should be in the directory from which she issued the unpack command. The name of the .tar file must be as specified in the course Syllabus.

Homework problems will typically ask you to both write code and do data experiments using your code. The writeups for such experiments must be in L^AT_EX (see Syllabus), submitted in a .tex file. Tons of tutorials on the Web; mine is quick.

If you have any .tex, .pdf or image files, include them in your submission.

Be sure to use exactly the same function names, file names etc. as in the specs!

Unless otherwise stated, your R code file will consist only of functions, no executable code. In grading your code, the TA will simply call R's source() function from the R command line, and she will then run her own test code that calls your functions.

Problem A

In this problem, you will extend the k-nearest neighbor capabiity of the rectools package, in two ways:

In finding our predicted value for a new user/item combination, the function predict.usrData() normally finds the average rating among the near neighbors of that U/I combination. You will extend the function to give the analyst the option of using other measures of location (fancy term for mean, median etc.; things like standard deviation are called measures of dispersion) such as the median, mode or a custom function written by the analyst herself.

(I use the term "analyst" here instead of "user" to distinguish from the latter term in the context of "users and items" in RSs. The analyst is the one who writes the code to call predict.usrData() (actually calling predict(), the generic function.)

You will write a function to update the database created by the function formUserData() that will update the database when new data becomes available.

Details:

The top lines of your revised predict.usrData() will be

predict.usrData  <- function(origData,newData,newItem,
      k,wtcovs=NULL,wtcats=NULL,locMeasure=mean)

The function will use whatever location measure the analyst specifies. (If none is specified, then the default is to use mean().) For instance, the analyst may issue the call
```
predict(oD,nD,nI,k,locMeasure=median) 
```
in which case the built-in R function median() will be used.

An example of a custom function supplied by the analyst might be, say, the average of the minimum and maximum values:
```
minMaxAvg <- function(x) (min(x) + max(x)) / 2
```
Do NOT put this in your file, as it is an example of what they analyst might supply, not you. However, DO offer the analyst the choice of using the mode as a location measure: Include in your file a function with call form
```
vecmode(x)
```
where x is a numeric vector. The term mode means the most frequent value, e.g. 8 in the vector (5,12,13,8,3,4,5,8,6,8). If the case of a tie, return the largest value among those involved in the tie. Try to make good use of R constructs.

The database update function will have call form

updateUserData(usrData,newData)

Here usrData is the existing database, and newData is a data frame of the same type as input to formUserData(). The return value is the new database. Note that newData may include both new items for existing users, and new users. Try to make good use of R constructs.

Here is an example for updateUserData():

> d <- data.frame(uID=c(1,2,1),iID=c(5,1,8),rat=c(3,3,5))
> d
  uID iID rat
1   1   5   3
2   2   1   3
3   1   8   5
> db <- formUserData(d)
> ndt <- 
+       data.frame(uID=c(5,2),iID=c(3,6),rat=c(4,1))
> ndt
  uID iID rat
1   5   3   4
2   2   6   1
> newdb <- updateUserData(db,ndt)
> db[[2]]
$userID
[1] "2"

$itms
[1] 1

$ratings
1 
3 

attr(,"class")
[1] "usrDatum"
> newdb[[2]]
$userID
[1] "2"

$itms
[1] 1 6

$ratings
1   
3 1 

attr(,"class")
[1] "usrDatum"
> newdb[[3]]
$userID
[1] "5"

$itms
[1] 3

$ratings
[1] 4

attr(,"class")
[1] "usrDatum"

Place your code in a file ProblemA.R that you will include in your submission.

Problem B

Compare using as location measure mean, median and mode on the InstEval data.

Details:

Split your data into training and test sets:

set.seed(9999)  # so we are all using the same random numbers
testidxs <- sample(1:nrow(ivl),1000)
testset <- ivl[testidxs,]
trainset <- ivl[-testidxs,]

You will call formUserData() on the training set and then predict on the test set. (See forthcoming Chapter 3.)

Use the following criteria as your measures of prediction accuracy: mean absolute prediction error (MAPE); probability of guessing exactly correctly (PGEC). For PGEC with median and mean, round off the predicted value (with round()).

For each location measure and each accuracy criterion, generate a graph of the criterion against k, the number of nearest neighbors. You must use either ggplot2 or lattice; many tutorials on the Web for these.

You must use L^AT_EX for your writeup, presenting the graphs with commentary.

Name your .tex file ProblemB.tex, and place your R code in a file ProblemB.R. Include these, your image files and the resulting .pdf file in your submission.