Homework I

Due Monday, February 3

Notes on Submission Packages

In any Homework assignment, include all the files in the same .tar file, NO SUBDIRECTORIES. When the TA's grading script unpacks the file, all the files should be in the directory from which she issued the unpack command. The name of the .tar file must be as specified in the course Syllabus.

Homework problems will typically ask you to both write code and do data experiments using your code. The writeups for such experiments must be in L^AT_EX (see Syllabus), submitted in a .tex file. Tons of tutorials on the Web; mine is quick.

Your .tex, .pdf and image files (if any), must be included in your submission.

Work must be done entirely in R, and except for packages that I specify, limited to R packages available on CSIF.

Image files, if any, must be generated using R (base R, lattice or ggplot2. You are required to include this in your submitted R code file.

Concerning file placement: When grading your work, the TA will run your code from his OMSI directory (or equivalent), with the same data file structure as in your OMSI directory. For instance, ml-100k/ will be a subdirectory of the directory from which he runs your R code.

Be sure to use exactly the same function names, file names etc. as in the specs!

Unless otherwise stated, your R code file will consist only of functions, no executable code. In grading your code, the TA will simply call R's source() function from the R command line, and she will then run her own test code that calls your functions.
You are expected to be resourceful! E.g. finding on your own how to do the graphs. Tons of material on the Web.

Note: You will often need to do data conversion. E.g. though the forest cover data below is numeric, it is arranged as a data frame. You may need to call as.matrix() to convert it to a matrix.

Problem A

This problem involves PCA, including calculating them and graphing them, similar to the "genetic map of Europe" example. In the latter, researchers took genomic measurements, "genotype data from 197,146 loci in 1,387 individuals," then graphed the first two principal components, color-coding the people by countries. The result looks strikingly like a geographic map of Europe!

One implication of this is that if we wanted to predict nationality from genes, instead of needing 197,146 variables ("features"), we'd do pretty well with only two (PC1 and PC2).

Let's see if we can do as well on another dataset.

Download the UCI Forest Cover dataset.

Compute the principal components. Exclude V55, which is the cover type.

Graph the total proportion of variance of PCs vs. the number of PCs, 1,2,...,54. Comment on how many might be "enough." Also state why I put quotation marks in that last sentence.

Graph a scatter plot of PC2 vs. PC1, color-coded by cover type. To avoid overplotting, what I call the "black screen problem," plot only a random sample of 10000 of the 580K cases.

Comment on how well the different cover types are separating here, and how well we might fare in predicting cover type from just PC1 and PC2.

Place your code in ProblemA.R.

Problem B

Here you will acquire skill in R data tools. It also will illustrate the all-important point: Know your data -- explore the data before embarking on any analysis.

Use the MovieLens 100K dataset, in files u.data and u.item. Create data frames named 'ratings' and 'movies'.

There are 19 individual genres, including the 'unknown' "genre" (see u.genre). Number them 1,2,...,20, in alphabetical order.

Write code to create a new data frame, singles, with 20 columns and the same number of rows as movies. Column 1 will be the movie ID, and column i will be 1 or 0, according to whether the movie in this row includes genre i-1 in its genre list.

Write code to do a join of ratings and singles, producing a data frame named 'all'. Use R's merge() function. The data frame all will have all the columns of the two input data frames, but will match the movie ID in both inputs.

For instance, the fourth line in ratings is for user 1 and movie 47. So, in the output data frame, there will be a row showing user 1, movie 47, the rating, and the 20 booleans for the genres.

Familiarize yourself with R's table() function, which is quite versatile; see Appendix below. Then pick 3-4 genres, and write code to do a two-way table of genre against rating. Comment on whether it appears that some genres tend to get higher ratings than others. (No formal statistical analysis for now.)

Place your code in ProblemB.R.

Appendix: R tables

> set.seed(9999)
> m <- matrix(sample(1:3,24,replace=T),ncol=3)
> md <- as.data.frame(m)
> md
  V1 V2 V3
1  2  1  3
2  1  2  3
3  1  3  3
4  3  1  3
5  2  3  1
6  2  2  2
7  2  2  2
8  3  2  1
> table(md[,2])
1 2 3 
2 4 2 
> table(md[,1:2])
   V2
V1  1 2 3
  1 0 1 1
  2 1 2 1
  3 1 1 0
> table(md)
, , V3 = 1

   V2
V1  1 2 3
  1 0 0 0
  2 0 0 1
  3 0 1 0

, , V3 = 2

   V2
V1  1 2 3
  1 0 0 0
  2 0 2 0
  3 0 0 0

, , V3 = 3

   V2
V1  1 2 3
  1 0 1 1
  2 1 0 0
  3 1 0 0

Note that a 1-way table is an R vector (with element names), a 2-way table is an R matrix (with row and column names), a 3-way table is a 3-D R array (row, column, layer names), then 4-D and so on. As such, we can extract elements, etc.:

> table(md[,2]) / sum(table(md[,2]))
   1    2    3 
0.25 0.50 0.25 
> table(md)[2,2,2]
[1] 2
> table(md)[3,1,2]
[1] 0
> table(md)[1,2,3]
[1] 1