DESCRIPTION Students: Please keep in mind the OMSI rules. Save your files often, make sure OMSI fills your entire screen at all times, etc. Remember that clicking CopyQtoA will copy the entire question box to the answer box. In questions involving code which will PARTIALLY be given to you in the question specs, you may need add new lines. There may not be information given as to where the lines should be inserted. MAKE SURE TO RUN THE CODE IN PROBLEMS THAT INVOLVE CODE! QUESTION (Text answer, 25 points) Suppose we have weather data, say temperatures, on d days for each of s sites, stored in an s x d matrix or data frame. Some of the entries are NAs. We could cast this as a recommender systems issue. Explain what "users," "items" and "ratings" would be in this setting. Also, if we were to convert to (user ID, item ID, rating) format, how many rows would that new data frame have? QUESTION (Text answer, 25 points) A lot of ML is about optimization. As we know from calculus, we can optimize a quantity by setting its derivative/derivatives to 0 and solving. But that depends on whether the quantity HAS A DERIVATIVE. What is the situation in that regard for the LASSO vs. ridge regression? No lengthy answer required, just a statement as to which, if any, of these two methods can be computed using calculus, and why or why not. QUESTION (Text answer, 25 points) Consider a LASSO approach, but based on the l-infinity norm, which is max_i |x_i|. Would that typically result in a sparse solution? Best to refer to coordinates of points in your answer. QUESTION -ext .R -run 'Rscript ./omsi_answer4.R' (R code answer, 25 points) We had a simulation aimed at illustrating p-hacking. sim <- function() { rhos <- replicate(1000,{ u <- rnorm(50); v <- rnorm(50); x <- u; y <- 0.2 * u + v; cor(x,y)}) plot(density(rhos)) } It showed that even if the population correlation coefficient is a small number like 0.2, the sample estimate could be much larger. If we are estimating many correlations, over different pairs of variables, we risk p-hacking. Fill in the gaps in this function, which will use simulation to find the probability of getting one or more "accidentally" high correlations. We have p variables (e.g. 2), a sample size of n (e.g. 50), a definition of an "accidental correlation" (e.g. > 0.6), and the number of samples to generate (e.g. 500). The true population correlation, which all variables have in this model, is rho. simProbAccident <- function(p,n,rho,accidentLevel,nSamp) { simout <- replicate(nSamp, { v <- matrix(rnorm(p*n),ncol=p); w <- matrix(nrow=n,ncol=p); # generate correlated variables corMat <- cor(w); # element i,j is rhoHat[i,j] diag(corMat) <- 0; }) # calculate a probability } set.seed(9999) simProbAccident(2,50,0.2,0.4,500) # 0.06 simProbAccident(10,50,0.2,0.4,500) # 0.478