Homework 1

Due Thursday, January 20, 11:59 pm

Extremely Important Submission Rules

As noted, you will be graded on your homework interactively in your group. However, the TA will pre-screen your submission, running an R script. You must follow the rules exactly, OR RISK NOT GETTING CREDIT. Here are the rules:

Your submission will be in the form of a .tar file, with naming convention as in our course syllabus.

You will make sure that when the TA's script unpacks that file, your .R files will be there, in the same directory from which the script ran. No subdirectories.

Your code file names are Problem1.R, Problem2.R, and Problem3.R. You must have no other files.

The code in those files will be directly runnable; upon being read by source(), your code will execute, including plots if any.

Remember, your code will run on the TA's machine, or possibly CSIF, NOT YOUR MACHINE. So your code cannot use any special environment that you have on your machine. You may use any packages that are in the R installation on CSIF, as well as those listed below, but no others. And avoid the temptation to use a package that does not appreciably reduce your development/debugging time; package bloat is the root of all evil. :-)

Here are details on the TA's script (he could share it with you if he wants):

In the same directory from which the script is run, the TA will have the MovieLens data, in the Hwk1.RData file. The contents are the data frames ml100 and ml100kpluscovs from our book.

For the graphics problems, the base R plotting facilities are fine. But the TA will make sure the ggplot2 and lattice packages are loaded too, if you wish to use them. (I generally like the colors in the latter more.)

The script, entirely R code, will be something like this:

library(ggplot2 
library(lattice) 
library(rectools)
library(regtools)

load('Hwk1.RData')

for (f in dir(pattern='*.tar')) {  
   system(paste('tar xf',f))
   for (i in 1:3) {
      source(paste0('Problem,i,'.R'))
      readline('hit Enter when ready')
   }
}

Generality

A Note on the Role of Programming

ECS 172 is not a programming course. You should find the coding in our course to be quite straightforward, not challenging. The programming serves as a tool for data analysis, which is our goal.

Problem 1

Here you will explore how long individual user i waits between making ratings, and the same thing for all users collectively. Let's call the means of those quantities those quantities W_i and W, respectively.

Write a function with call form

waitTimes(rawData)

where rawData is a dataframe whose first column is user ID and second is timestamp, e.g. the first and fourth columns of ml100 in our book. The return value is an R list, with elements individs and overall, the first being a vector consisting of the W_i (in the order of user ID) and the second consisting of W.

Note: The first wait time is from the first timestamp to the second.

In order to make sure you know a variety of R constructs, there will be two requirements:

Use this R idiom for finding differences of successive items in a vector, making use of shifts:
```
> x <- c(3:5,12,13)
> x
[1]  3  4  5 12 13
> xLeft <- x[2:5]
> xLeft
[1]  4  5 12 13
length
> xLeft - x[1:4]
[1] 1 1 7 1
```
This vectorizes the differencing operation, much faster than a loop.

Write (and use) a function with this call form:
```
mergeEm(listOfVecs)
```
Here listOfVecs is an R list, each element of which is a vector sorted in ascending order. The vectors need not be of the same length. The return value is the merge of all the individual vectors, again in sorted order, so that for instance element 22 is W₂₂.

Note that this is to be a general function, usable in R in general. None of the code should be specific to this assignment.

Have your code conduct and print a test of mergeEm() on a very small dataset. Have your code run waitTimes() on the full ml100kpluscov data, and print out the first 10 elements of each of W₁ and W; of course, waitTimes() must make use of mergeEm().

Problem 2

Are modern movie titles longer or shorter than in the past? Is there any time trend? You'll explore that here, and gain some experience with string processing.

The file u.item in the MovieLens .zip file contains information about the movies, including title and year of release. Write and run code to plot mean title length (in words) against release year.

You'll need to escape the apostrophes, using the option quote="" in your call to read.table().

Two strings will be considered words if they are separated by a blank or other whitespace, e.g. commas.

The output is just the plot against time.

Problem 3

Do different movie genres have different rating patterns? You will explore this here.

Recall from your probability course that the density function f_X(t) of a continuous random variable X is a function that can be integrated to obtain probabilities. But the function itself, loosely speaking, gives relative frequencies. If say f_X(28.8) is large, then X often occurs near 28.8, or rarely so if the density is small there.

In other words, a density is like a histogram, and in fact a histogram is actually a sample estimate of the population density.

A nicer estimator is called a kernel density estimator. Its salient virtue is that it is a smooth curve, not choppy like a histogram. Since we are interested in multiple variables, for multiple genres, the graph will be a lot less cluttered if we use kernel estimators. The analog of the histogram bin width is something called the bandwidth, but we'll just take the default value.

Write a function with call form

plotDensities(inputDF,xName,grpName)

Arguments:

inputDF: Input data frame.

xName: name of the column in inputDF that we want to draw "histograms" of

grpName: name of the column in inputDF to be used for grouping; the function will group the xName data, and then draw a "histogram" for each group; in our case, the groups are the genres; the code will assume the grouping column is an R factor, showing the first genre of that movie

The estimated densities will all be plotted on the same graph, in different colors, and a legend showing which curve is which will be displayed.

The variable X here is mean rating. It turns out that there are 757 movies having genre 8. After randomization due to movies possibly having more than one genre, let's say there are 622 movies of genre 8. So, you will plot an estimated density based on those 622 mean ratings. One could draw a histogram of those 622 numbers, but it's nicer to draw a smooth curve. And you will do the same for each of the genres. (So you can see why plotting histograms would not be good; they would partially obscure each other.)

If in any row of the data there are multiple genres, choose one at random, by calling sample(). Set the other 1s in that row to 0s.

Use either base R, ggplot2 or lattice. Tons of tutorials on the Web. If you don't use base R, remember to call library() or require() at the start of your function.

Use either base R's density() function or equivalent. These draw smooth curves instead of histograms.

To use base R, call plot() on the first curve, then lines() for each of the rest. You'll first need to find the max Y coordinate among all the curves, and set ylim accordingly in your call to plot(). You can use 1,2,3,... for the default colors, or your own choice, but the function must be GENERAL.

Note that this is to be a general function, usable in R in general. None of the code should be specific to this assignment.

The genres are given in the data as indicator variables, 0s and 1s (called a dummy variable in stat and econ, and one-hot in machine learning). You'll need to convert them.

With 19 genres, you may find the colors hard to distinguish. If so, just do 8 genres. Also, the fact that some movies are of multiple genres; the curve for genre i should be based on all movies having that genre, even though two different curves may be derived from somewhat overlapping sets of movies.