Homework 1

Due Thursday, January 20, 11:59 pm

Extremely Important Submission Rules

As noted, you will be graded on your homework interactively in your group. However, the TA will pre-screen your submission, running an R script. You must follow the rules exactly, OR RISK NOT GETTING CREDIT. Here are the rules:

Here are details on the TA's script (he could share it with you if he wants):

Generality

A Note on the Role of Programming

ECS 172 is not a programming course. You should find the coding in our course to be quite straightforward, not challenging. The programming serves as a tool for data analysis, which is our goal.

Problem 1

Here you will explore how long individual user i waits between making ratings, and the same thing for all users collectively. Let's call the means of those quantities those quantities Wi and W, respectively.

Write a function with call form

waitTimes(rawData)

where rawData is a dataframe whose first column is user ID and second is timestamp, e.g. the first and fourth columns of ml100 in our book. The return value is an R list, with elements individs and overall, the first being a vector consisting of the Wi (in the order of user ID) and the second consisting of W.

Note: The first wait time is from the first timestamp to the second.

In order to make sure you know a variety of R constructs, there will be two requirements:

Have your code conduct and print a test of mergeEm() on a very small dataset. Have your code run waitTimes() on the full ml100kpluscov data, and print out the first 10 elements of each of W1 and W; of course, waitTimes() must make use of mergeEm().

Problem 2

Are modern movie titles longer or shorter than in the past? Is there any time trend? You'll explore that here, and gain some experience with string processing.

The file u.item in the MovieLens .zip file contains information about the movies, including title and year of release. Write and run code to plot mean title length (in words) against release year.

You'll need to escape the apostrophes, using the option quote="" in your call to read.table().

Two strings will be considered words if they are separated by a blank or other whitespace, e.g. commas.

The output is just the plot against time.

Problem 3

Do different movie genres have different rating patterns? You will explore this here.

Recall from your probability course that the density function fX(t) of a continuous random variable X is a function that can be integrated to obtain probabilities. But the function itself, loosely speaking, gives relative frequencies. If say fX(28.8) is large, then X often occurs near 28.8, or rarely so if the density is small there.

In other words, a density is like a histogram, and in fact a histogram is actually a sample estimate of the population density.

A nicer estimator is called a kernel density estimator. Its salient virtue is that it is a smooth curve, not choppy like a histogram. Since we are interested in multiple variables, for multiple genres, the graph will be a lot less cluttered if we use kernel estimators. The analog of the histogram bin width is something called the bandwidth, but we'll just take the default value.

Write a function with call form

plotDensities(inputDF,xName,grpName)

Arguments:

The estimated densities will all be plotted on the same graph, in different colors, and a legend showing which curve is which will be displayed.

The variable X here is mean rating. It turns out that there are 757 movies having genre 8. After randomization due to movies possibly having more than one genre, let's say there are 622 movies of genre 8. So, you will plot an estimated density based on those 622 mean ratings. One could draw a histogram of those 622 numbers, but it's nicer to draw a smooth curve. And you will do the same for each of the genres. (So you can see why plotting histograms would not be good; they would partially obscure each other.)

The genres are given in the data as indicator variables, 0s and 1s (called a dummy variable in stat and econ, and one-hot in machine learning). You'll need to convert them.

With 19 genres, you may find the colors hard to distinguish. If so, just do 8 genres. Also, the fact that some movies are of multiple genres; the curve for genre i should be based on all movies having that genre, even though two different curves may be derived from somewhat overlapping sets of movies.