Homework I
Due Monday, February 3
Notes on Submission Packages
- In any Homework assignment, include all the files
in the same .tar file, NO
SUBDIRECTORIES. When the TA's grading script unpacks the file, all the
files should be in the directory from which she issued the unpack
command. The name of the .tar file must be as specified in the
course Syllabus.
- Homework problems will typically ask you to both write code and do
data experiments using your code. The writeups for such experiments
must be in
LATEX
(see Syllabus), submitted in a .tex file.
Tons of tutorials on the Web;
mine is
quick.
- Your .tex, .pdf and image files (if any),
must be included in your submission.
- Work must be done entirely in R, and except for packages that I
specify, limited to R packages available on CSIF.
- Image files, if any, must be generated using R (base R,
lattice or ggplot2. You are required to include this in
your submitted R code file.
- Concerning file placement: When grading your work,
the TA will run your code from his OMSI directory (or equivalent), with
the same data file structure
as in your OMSI directory. For instance,
ml-100k/ will be a subdirectory of the directory from which he
runs your R code.
- Be sure to use exactly the same function names, file names etc. as
in the specs!
-
Unless otherwise stated, your R code file will consist only of
functions, no executable code. In grading your code, the TA will simply
call R's source() function from the R command line, and she will
then run her own test code that calls your functions.
- You are expected to be resourceful! E.g. finding on your own how
to do the graphs. Tons of material on the Web.
Note: You will often need to do data conversion. E.g. though the forest cover
data below is numeric, it is arranged as a data frame. You may need to
call as.matrix() to convert it to a matrix.
Problem A
This problem involves PCA, including calculating them and graphing them,
similar to the
"genetic map of Europe" example. In the latter, researchers took
genomic measurements, "genotype data from 197,146 loci in 1,387
individuals," then graphed the first two principal components,
color-coding the people by countries. The result looks strikingly
like a geographic map of Europe!
One implication of this is that if we wanted to predict nationality from
genes, instead of needing 197,146 variables ("features"), we'd do pretty
well with only two (PC1 and PC2).
Let's see if we can do as well on another dataset.
- Download the
UCI Forest Cover dataset.
- Compute the principal components. Exclude V55, which is the
cover type.
- Graph the total proportion of variance of PCs vs. the number of
PCs, 1,2,...,54. Comment on how many might be "enough." Also state why
I put quotation marks in that last sentence.
- Graph a scatter plot of PC2 vs. PC1, color-coded by cover type.
To avoid overplotting, what I call the "black screen problem," plot only
a random sample of 10000 of the 580K cases.
-
Comment on how well the different cover types are separating here, and
how well we might fare in predicting cover type from just PC1 and PC2.
- Place your code in ProblemA.R.
Problem B
Here you will acquire skill in R data tools. It also will illustrate
the all-important point: Know your data -- explore the data before
embarking on any analysis.
- Use the MovieLens 100K dataset, in files u.data and
u.item. Create data frames named 'ratings' and 'movies'.
-
There are 19 individual genres, including the 'unknown' "genre"
(see u.genre). Number them 1,2,...,20, in alphabetical order.
- Write code to create a new data frame, singles, with 20
columns and the same number of rows as movies. Column 1 will be the
movie ID, and column i will be 1 or 0, according to whether the movie in
this row includes genre i-1 in its genre list.
- Write code to do a join of ratings and
singles, producing a data frame named 'all'. Use R's merge()
function. The data frame all will have all the columns of the
two input data frames, but will match the movie ID in both inputs.
For instance, the fourth line in ratings is for user 1
and movie 47. So, in the output data frame, there will be a row showing
user 1, movie 47, the rating, and the 20 booleans for the genres.
- Familiarize yourself with R's table() function, which
is quite versatile; see Appendix below. Then pick 3-4 genres, and write
code to do a two-way table of genre against rating. Comment on whether
it appears that some genres tend to get higher ratings than others. (No
formal statistical analysis for now.)
- Place your code in ProblemB.R.
Appendix: R tables
> set.seed(9999)
> m <- matrix(sample(1:3,24,replace=T),ncol=3)
> md <- as.data.frame(m)
> md
V1 V2 V3
1 2 1 3
2 1 2 3
3 1 3 3
4 3 1 3
5 2 3 1
6 2 2 2
7 2 2 2
8 3 2 1
> table(md[,2])
1 2 3
2 4 2
> table(md[,1:2])
V2
V1 1 2 3
1 0 1 1
2 1 2 1
3 1 1 0
> table(md)
, , V3 = 1
V2
V1 1 2 3
1 0 0 0
2 0 0 1
3 0 1 0
, , V3 = 2
V2
V1 1 2 3
1 0 0 0
2 0 2 0
3 0 0 0
, , V3 = 3
V2
V1 1 2 3
1 0 1 1
2 1 0 0
3 1 0 0
Note that a 1-way table is an R vector (with element names),
a 2-way table is an R matrix (with row and column names),
a 3-way table is a 3-D R array (row, column, layer names),
then 4-D and so on. As such, we can extract elements, etc.:
> table(md[,2]) / sum(table(md[,2]))
1 2 3
0.25 0.50 0.25
> table(md)[2,2,2]
[1] 2
> table(md)[3,1,2]
[1] 0
> table(md)[1,2,3]
[1] 1