Due Thursday, October 18
In this problem, you will extend the k-nearest neighbor capabiity of the rectools package, in two ways:
(I use the term "analyst" here instead of "user" to distinguish from the latter term in the context of "users and items" in RSs. The analyst is the one who writes the code to call predict.usrData() (actually calling predict(), the generic function.)
Details:
predict.usrData <- function(origData,newData,newItem, k,wtcovs=NULL,wtcats=NULL,locMeasure=mean)
predict(oD,nD,nI,k,locMeasure=median)
in which case the built-in R function median() will be used.
An example of a custom function supplied by the analyst might be, say, the average of the minimum and maximum values:
minMaxAvg <- function(x) (min(x) + max(x)) / 2
Do NOT put this in your file, as it is an example of what they analyst might supply, not you. However, DO offer the analyst the choice of using the mode as a location measure: Include in your file a function with call form
vecmode(x)
where x is a numeric vector. The term mode means the most frequent value, e.g. 8 in the vector (5,12,13,8,3,4,5,8,6,8). If the case of a tie, return the largest value among those involved in the tie. Try to make good use of R constructs.
updateUserData(usrData,newData)
Here usrData is the existing database, and newData is a data frame of the same type as input to formUserData(). The return value is the new database. Note that newData may include both new items for existing users, and new users. Try to make good use of R constructs.
Here is an example for updateUserData():
> d <- data.frame(uID=c(1,2,1),iID=c(5,1,8),rat=c(3,3,5)) > d uID iID rat 1 1 5 3 2 2 1 3 3 1 8 5 > db <- formUserData(d) > ndt <- + data.frame(uID=c(5,2),iID=c(3,6),rat=c(4,1)) > ndt uID iID rat 1 5 3 4 2 2 6 1 > newdb <- updateUserData(db,ndt) > db[[2]] $userID [1] "2" $itms [1] 1 $ratings 1 3 attr(,"class") [1] "usrDatum" > newdb[[2]] $userID [1] "2" $itms [1] 1 6 $ratings 1 3 1 attr(,"class") [1] "usrDatum" > newdb[[3]] $userID [1] "5" $itms [1] 3 $ratings [1] 4 attr(,"class") [1] "usrDatum"
Compare using as location measure mean, median and mode on the InstEval data.
Details:
set.seed(9999) # so we are all using the same random numbers testidxs <- sample(1:nrow(ivl),1000) testset <- ivl[testidxs,] trainset <- ivl[-testidxs,]
You will call formUserData() on the training set and then predict on the test set. (See forthcoming Chapter 3.)