Homework 3

Due Thursday, February 24, 11:59 pm

Submission Rules and Advice

You must write a report for both Problems 1 and 2, with the names Problem1.pdf and Problem2.pdf. Submit the .tex and .pdf files, along with image files, if any (which must be generated in R). Put your code in the files Problem1.R and Problem2.R

In this and subsequent assignments, you are required to use qeML for all basic prediction operations (linear model, RFs, etc.) or the wrapped base-R functions, e.g. lm(), or specific packages stated in the homework statement. For matrix factorization, use only the functions covered in our course.

As before, if asked to write a general function, its code cannot be tailored to MovieLens,say.

Problem 1

Here you will explore an approach to model selection, specifically feature selection. It probably would be helpful to first review the material on this from our course, including the blog posts on this topic.

The most common method of feature selection in linear (or generalized linear) models is to simply look at p-values. This is correctly viewed as a poor, crude method of feature selection. And in recent years p-values themselves have come under heavy criticism, with many of us having been critical from way back. However, one of the main problems with using p-values in the feature selection context suggests a way to make it much better, explained below.

Here is what those p-values in the lm() output mean: For each coefficient, the statistical hypothesis H0: βi = 0 is tested. The p-value measures whether the estimated βi is far enough from 0 for us to believe βi is nonzero; specifically, the p-value is a "what if" measure. We ask, what if actually βi IS 0? What then is the probability that its estimate would stray as far or further from 0 as it is here?

Not only is this rather convoluted reasoning, it doesn't answer the question of interest to us. What we want to know is, is the true βi far enough from 0 to justify using feature i in our model? These are very different questions. For instance, the true βi could be nonzero but too small to use a predictive feature; remember, we want to keep the number of features small if possible, to avoid overfitting. Note that with large enough "n," the p-value will be tiny yet even if βi is small. (Note: All else staying equal the p-value goes to 0 as n goes to infinity.)

It is customary to reject a stat hypothesis if the p-value is below a set cutoff, typically 0.05. But again, there is no particular reason that that cutoff, called α, is well-calibrated to our question as to whether to retain feature i in our model.

But that very defect of p-values suggests a way toward improvement: We try many values of the cutoff α, and choose the one that results in the best prediction on the holdout set.

With these considerations in mind, write a function that implements the above scheme. Here are the details:

Then try your function on the pef data

Example: mlb, Position, Height, Weight, Age.

> head(mlb)
        Position Height Weight   Age
1        Catcher     74    180 22.99
2        Catcher     74    215 34.69
3        Catcher     72    210 30.78
4  First_Baseman     72    210 35.43
5  First_Baseman     73    188 35.71
6 Second_Baseman     69    176 29.39
> library(regtools)
> head(mlb)
        Position Height Weight   Age
1        Catcher     74    180 22.99
2        Catcher     74    215 34.69
3        Catcher     72    210 30.78
4  First_Baseman     72    210 35.43
5  First_Baseman     73    188 35.71
6 Second_Baseman     69    176 29.39
> mlbd <- factorsToDummies(mlb,omitLast=T)
> mlbd <- as.data.frame(mlbd)
> lmAlpha(mlbd,'Weight',50,250)
holdout set has  250 rows
holdout set has  250 rows
holdout set has  250 rows
...
[[1]]
[[1]]$vars
[1] "Height"

[[1]]$testAcc
[1] 14.09343
attr(,"stderr")
[1] 0.07803801


[[2]]
[[2]]$vars
[1] "Height" "Age"

[[2]]$testAcc
[1] 13.5378
attr(,"stderr")
[1] 0.08276051


[[3]]
[[3]]$vars
[1] "Height"             "Age"                "Position.Shortstop"

[[3]]$testAcc
[1] 13.3291
attr(,"stderr")
[1] 0.07131034


[[4]]
[[4]]$vars
[1] "Height"                  "Age"
[3] "Position.Shortstop"      "Position.Second_Baseman"

[[4]]$testAcc
[1] 13.42597
attr(,"stderr")
[1] 0.1098794


[[5]]
[[5]]$vars
[1] "Height"                  "Age"
[3] "Position.Shortstop"      "Position.Second_Baseman"
[5] "Position.First_Baseman"

[[5]]$testAcc
[1] 13.28826
attr(,"stderr")
[1] 0.1001254


[[6]]
[[6]]$vars
[1] "Height"                  "Age"
[3] "Position.Shortstop"      "Position.Second_Baseman"
[5] "Position.First_Baseman"  "Position.Relief_Pitcher"
  
[[6]]$testAcc
[1] 13.12534
attr(,"stderr")
[1] 0.08009306
  
  
[[7]]
[[7]]$vars
[1] "Height"                  "Age"
[3] "Position.Shortstop"      "Position.Second_Baseman"
[5] "Position.First_Baseman"  "Position.Relief_Pitcher"
[7] "Position.Catcher"
  
[[7]]$testAcc
[1] 13.20249
attr(,"stderr")
[1] 0.07688103

[[8]]
[[8]]$vars
[1] "Height"                    "Age"
[3] "Position.Shortstop"        "Position.Second_Baseman"
[5] "Position.First_Baseman"    "Position.Relief_Pitcher"
[7] "Position.Catcher"          "Position.Starting_Pitcher"

[[8]]$testAcc
[1] 13.15968
attr(,"stderr")
[1] 0.06916993


[[9]]
[[9]]$vars
[1] "Height"                    "Age"
[3] "Position.Shortstop"        "Position.Second_Baseman"
[5] "Position.First_Baseman"    "Position.Relief_Pitcher"
[7] "Position.Catcher"          "Position.Starting_Pitcher"
[9] "Position.Outfielder"

[[9]]$testAcc
[1] 13.00936
attr(,"stderr")
[1] 0.07490062

Problem 2

Here you will get some experience with matrix factorization methods, using the Book Crossing dataset.