DESCRIPTION

Students:  Please keep in mind the OMSI rules.  Save your files often,
make sure OMSI fills your entire screen at all times, etc.  Remember
that clicking CopyQtoA will copy the entire question box to the answer
box.

In questions involving code which will PARTIALLY be given to you in the
question specs, you may need add new lines.  There may not be
information given as to where the lines should be inserted.

MAKE SURE TO RUN THE CODE IN PROBLEMS THAT INVOLVE CODE!

QUESTION

(Text answer, 25 points) Suppose we have weather data, say temperatures, 
on d days for each of s sites, stored in an s x d matrix or data frame.  
Some of the entries are NAs.  We could cast this as a recommender systems 
issue. Explain what "users," "items" and "ratings" would be in this setting.
Also, if we were to convert to (user ID, item ID, rating) format, how
many rows would that new data frame have?


QUESTION

(Text answer, 25 points) A lot of ML is about optimization.  As we know 
from calculus, we can optimize a quantity by setting its derivative/derivatives 
to 0 and solving.  But that depends on whether the quantity HAS A
DERIVATIVE.  What is the situation in that regard for the LASSO
vs. ridge regression?  No lengthy answer required, just a
statement as to which, if any, of these two methods can be
computed using calculus, and why or why not.

QUESTION

(Text answer, 25 points) Consider a LASSO approach, but based on the 
l-infinity norm, which is max_i |x_i|.  Would that typically result in a sparse
solution?  Best to refer to coordinates of points in your answer.

QUESTION -ext .R -run 'Rscript ./omsi_answer4.R'

(R code answer, 25 points) We had a simulation aimed at illustrating p-hacking.

sim <- function() {
   rhos <- replicate(1000,{
      u <- rnorm(50);
      v <- rnorm(50);
      x <- u;
      y <- 0.2 * u + v;
      cor(x,y)})
   plot(density(rhos))
}

It showed that even if the population correlation coefficient is a small
number like 0.2, the sample estimate could be much larger.  If we are
estimating many correlations, over different pairs of variables, we risk
p-hacking. 

Fill in the gaps in this function, which will use simulation to find the
probability of getting one or more "accidentally" high correlations.  We
have p variables (e.g. 2), a sample size of n (e.g. 50), a definition of
an "accidental correlation" (e.g. > 0.6), and the number of samples to 
generate (e.g. 500). The true population correlation, which all variables 
have in this model, is rho.

simProbAccident <- function(p,n,rho,accidentLevel,nSamp)
{
   simout <- replicate(nSamp,
   {
      v <- matrix(rnorm(p*n),ncol=p);
      w <- matrix(nrow=n,ncol=p);
      # generate correlated variables

      corMat <- cor(w);   # element i,j is rhoHat[i,j]
      diag(corMat) <- 0; 
      
    })
   # calculate a probability

}

set.seed(9999)
simProbAccident(2,50,0.2,0.4,500)  # 0.06
simProbAccident(10,50,0.2,0.4,500)  # 0.478