Homework I

Due Monday, February 3

Notes on Submission Packages

Note: You will often need to do data conversion. E.g. though the forest cover data below is numeric, it is arranged as a data frame. You may need to call as.matrix() to convert it to a matrix.

Problem A

This problem involves PCA, including calculating them and graphing them, similar to the "genetic map of Europe" example. In the latter, researchers took genomic measurements, "genotype data from 197,146 loci in 1,387 individuals," then graphed the first two principal components, color-coding the people by countries. The result looks strikingly like a geographic map of Europe!

One implication of this is that if we wanted to predict nationality from genes, instead of needing 197,146 variables ("features"), we'd do pretty well with only two (PC1 and PC2).

Let's see if we can do as well on another dataset.

Problem B

Here you will acquire skill in R data tools. It also will illustrate the all-important point: Know your data -- explore the data before embarking on any analysis.

Appendix: R tables

> set.seed(9999)
> m <- matrix(sample(1:3,24,replace=T),ncol=3)
> md <- as.data.frame(m)
> md
  V1 V2 V3
1  2  1  3
2  1  2  3
3  1  3  3
4  3  1  3
5  2  3  1
6  2  2  2
7  2  2  2
8  3  2  1
> table(md[,2])
1 2 3 
2 4 2 
> table(md[,1:2])
   V2
V1  1 2 3
  1 0 1 1
  2 1 2 1
  3 1 1 0
> table(md)
, , V3 = 1

   V2
V1  1 2 3
  1 0 0 0
  2 0 0 1
  3 0 1 0

, , V3 = 2

   V2
V1  1 2 3
  1 0 0 0
  2 0 2 0
  3 0 0 0

, , V3 = 3

   V2
V1  1 2 3
  1 0 1 1
  2 1 0 0
  3 1 0 0

Note that a 1-way table is an R vector (with element names), a 2-way table is an R matrix (with row and column names), a 3-way table is a 3-D R array (row, column, layer names), then 4-D and so on. As such, we can extract elements, etc.:

> table(md[,2]) / sum(table(md[,2]))
   1    2    3 
0.25 0.50 0.25 
> table(md)[2,2,2]
[1] 2
> table(md)[3,1,2]
[1] 0
> table(md)[1,2,3]
[1] 1