ECS 132 Term Project
- Progress report: Due Monday, March 8, 11:59 pm, in the form
of an e-mail message to me. Must state the contribution of each group
member so far.
- Submission: Scheduled final exam day (no written final),
Wednesday, March 17, 11:59 pm. NO LATE SUBMISSIONS.
In Problem B, you will be using my
There are various ways to install it on your computer, e.g.
for a Unix-family system.
git clone http://github.com/matloff/regtools
R CMD INSTALL -l ~/R regtools
and then add this line to your R startup file, ~/.Rprofile:
Note that regtools in turn requires several other packages to be
installed. If you don't have them already installed, you'll get an
error message telling you which ones you need.
For Problem B, you will also need to download the
Porto taxi trip data. We will use only train.csv, which you
can read using read.csv().
("Scavenger hunt.") Each person in your group must find, and write a
report on, a research poster somewhere in a campus building. The poster
must use p-values or equivalents such as saying something like "The
confidence intervals for the difference between two means does not
Broadening due to pandemic: Many of you are not in Davis, so you
can find a research paper online. It still must have a UCD professor as one
of the authors.
In a paragraph or two, state the scientific topic being researched, and
discuss how using confidence intervals instead of significance tests
might result in a more insightful analysis, and discuss possible dangers
reliance of p-values might bring in this case.
Each section must have the name of
the team member who did the analysis for this example and wrote this
section of your report. Again, note that there must be a different
example written by each team member. The sections for Problem A
should be titled accordingly.
You may wish to cite authority here, e.g.
this announcement and its links.
Be sure to include the URL for the paper you analyze.
As noted, this problem will involve analysis of the Porto taxi trip
data. Here are some general rules:
- Note that many of the parts of this problem will be rather
open-ended, using terms like explore and investigate.
But when possible, formal statistical inference methods must be used.
- So, there is flexibility here, but note that (a) you are restricted to
using the methods of our course and (b) you are restricted to using R
throughout, including for generating graphs.
- For graphics, use either (i) base-R graphics, (ii) the
lattice package or (iii) the ggplot2 package. Our book
contains a tutorial on the latter, but there are tons of examples of all
three on the Web. There are no specific graphical tasks delineated
below, but you'll definitely need good pictures to write a good,
professional quality report.
- The tapply() function is one of the most powerful in R
(also, the related split()). You may find it useful.
Do not use another package or statistical method without getting my approval.
- The UCI data repository is quite famous in the machine learning
world. Thus, its datasets tend to have many analyses on the Web. You
can probably find some for this Porto taxi data, and there are many,
many analyses of the various NYC taxi datasets on the Web. You are very
welcome to browse through these, but if you use anything that you find
there, you must cite it in your report.
- It is imperative that you start early. There are many
things to be done that you may not realize at the outset. You'll need
to figure out how the software works, which can sometimes be challenging
in the case of graphics. You'll encounter various idiosyncrasies in the
data, and will need to figure out how to handle them. You'll find
yourself wondering, "Well, in actuality, what does that
statistical method do?" This is a rather large dataset; some
computations may have long run times. Etc. We are happy to help you
through problems you encounter, but clearly you need to allocate time to
deal with these issues.
Here are your tasks:
- Let T denote trip duration. Explore finding a model for
fT from one of our density families.
- Let B denote the proportion of time a driver is busy,
i.e. actually driving rather than waiting for the next fare.
Explore finding a model for fB from one of our density families.
- Investigate whether the type of call used to summon the taxi makes
much difference in mean trip time.
- Develop models for predicting trip time from other variables, both
the explicit ones and also trip distance. (This task has
many subparts, and will probably take up most of your time.)
Details of the prediction task:
- First try a linear model.
- Note that "linear" means "linear in the beta parameters." You
can and should also try polynomial models, using qePolyLin().
- Find an approximate 95% confidence interval for E(Y | X = u)
(X, u vectors), your choice of u, for your straight line model.
- Then try two machine learning models, chosen from
In each case, you may wish to try some nondefault values for the
hyperparameters. (Some of the models have many hyperparameters, some of
them very esoteric; you need not learn them all.)
I suggest you use
the rough draft of my forthcoming book to learn what these methods
do. Or, there are many good, usually brief, tutorials on the Web.
I will give
brief overviews of ML methods in class.
- Compare the predictive power of the methods you used above (several
linear models plus two ML).
Important General Rules
PLEASE FOLLOW THESE RULES 100%!
- Groups that put in a reasonable amount of time -- and thought! --
almost always receive an A or A+ grade on the project. Groups that
do not complete the project usually get a D grade. PLEASE START
- As explained in class, groups that do good work on the project
receive an extra bonus in their course grades, beyond what your quiz
and homework grades are. The boost is always at least one
notch (e.g. B to B+) and typically two notches (e.g. B to A-). E.g. a
student could have strictly B work in the homework and quizzes and
yet still get an A- in the course.
- A+ grades are very possible, and can have a significant impact
on your course grade, letters of recommendation, knighthoods,
marriage prospects, coronations, etc.
- Technical content of the work (correctness, thoroughness etc.).
- Adherence to instructions.
- Professional quality of the work: Clear, engaging writing,
using correct grammar; it need not (should not) be pretentious, but
avoid being too colloquial ("the mean was kinda low"). Presentation
need not be fancy, but graphs and tables should be used where helpful.