Blog, ECS 189G, Winter 2020

Tuesday, March 17, 11:40 am

I'm mailing out the job interview grades. I have two for which I apparently misspelled or mis-heard the student's name. If you do not receive your grade and did participate, let me know.

Tuesday, March 17, 9:50 am

Please note that my handin username is "matloff". See syllabus.

Monday, March 16, 8:10 pm

I've been asked to extend the due date of our project, as Covid-19 has caused a disruption to many students' lives.

I'm very reluctant to do this (see below), but am going ahead and changing the due date to Thursday at noon. Sharp!

This is not ideal. The original due date gave everyone 22 days, certainly more than enough. Some students are leaving campus after tomorrow. Others, understandably, feel they've done enough, and don't want to feel pressed to work past tomorrow on the project.

Also very important: It takes a long time for me to grade the project, note, I also have ECS 145, so it's always a struggle for me to submit my course grades to the Register by their deadline. This extension will of course compound that problem.

In light of that last point, as soon as you submit your final version, please let me know, so I can start grading it.

As before, I'm looking forward to reading your work, which I'm sure will be interesting and well-executed.

Saturday, March 14, 2:25 pm

To ECS 145 and 189G students:

In submitting your Term Projects, make sure -- ABSOLUTELY, TOTALLY, POSITIVELY, EXTREMELY SURE -- that you get the file name right, with proper e-mail addresses etc. In grading the ECS 145 group quiz just now, two of the submissions were wrong. One used commas instead of periods and the other had only the submitter's name, not the other three team members. Fortunately I caught it in time, and changed the file names by hand, but potentially SEVEN students could have had an F grade on this quiz.

Wednesday, March 11, 10:50 pm

Earlier I wrote here that any group that would like feedback from me on their Hwk II reports should e-mail me their PDFs. I had thought I finished all the reports that were sent to me, but today I learned of one that I had apparently overlooked. If you did not receive feedback from me in response to a PDF you sent me, please let me know. (Note: The same member of your group who sent me the PDF should send me the new message.)

Tuesday, March 10, 11:20 pm

I fixed a couple of typos in today's supplement, and aded some clarifying phrasing. Please use this version.

Tuesday, March 10, 7:30 pm

Should have mentioned today: There is a shortcut to finding matrix powers.

Say we want to find a power of M that is at least degree 20. We do this:

K <- M; K <- K^2; K <- K^2; K <- K^2; K <- K^2; K <- K^2;

This gives us power 32.

Tuesday, March 10, 11:35 pm

Just added our last supplement.

Tuesday, March 10, 9:45 pm

Just in case my 5:40 post was not explicit enough: Yes, we will have class tomorrow; yes, I will hold the Job Interview session as planned; and yes, we will have our Group Quiz on Friday.

Tuesday, March 10, 6:10 pm

In the Term Project specs, there is no requirement, or even recommendation, that you use the k-NN routines in either rectools or regtools. Depending on what you'd like the user to see -- again, it's open-ended, and I want you to produce something useful enough that some people will see it in your GitHub repo and make use of it -- you might use or modify one of the above routines or write your own.

Tuesday, March 10, 5:40 pm

As you have probably heard, all in-person final exams are canceled. But since our class didn't have a final anyway, there is no effect on us.

Tuesday, March 10, 5:25 pm

You may find this useful. The partykit call

predict(partyObj,xval,type='node')

returns the node number that xval lands in.

E.g.

library(regtools)
data(mlb)
mlb <- mlb[,c(4:6)]
ctout <- ctree(Weight ~ .,data=mlb)
predict(ctout,mlb[1,],type='node')
predict(ctout,data.frame(Height=71,Age=22.5),type='node')

outputs

This also can be used to determine which rows of the original dataset are used in which nodes for prediction of future new cases. E.g.

> nodeRows <- split(1:nrow(mlb),predict(ctout,type='node'))
> names(nodeRows)
 [1] "4"  "5"  "8"  "9"  "10" "14" "15" "16" "19"
[10] "20" "21"
> nodeRows[[1]]
 [1]   6  15  73  75  77  84 105 139 175 219 273
[12] 279 308 331 343 347 349 435 437 444 447 449
[23] 515 550 562 580 614 654 677 682 712 752 757
[34] 785 790 810 815 850 854 883 894 922 928 961
[45] 996
> mean(mlb$Weight[nodeRows[[1]]])
[1] 176.9556

Any future case that lands in node 4 will have his weight predicted to be 176.9556.

Tuesday, March 10, 8:55 am

As reiterated in class yesterday, your code in the Term Project must be general. Do not tailor it to the example datasets. If a dataset's file is not in the form of your general code, that means the user must do his/her own preprocessing before calling your functions.

For instance, in the MOOCs dataset in our specs, must divide the rating range into intervals, to make it categorical, before calling ratingProbsFit().

Also: Note that the central goal is to produce a set of probabilities. If I am considering watching movie 324, I want to know: What is the probability that I would rate it a 5? A 4? Etc.

Monday, March 9, 8:25 pm

Here is an example of ctree() where the outcome/response variable is categorical.

The dataset is prgeng, included in regtools. It's data on programmers and engineers in Silicon Valley in 2000. We will predict occupation, of which there are 6 categories.

library(regtools) 
library(partykit) 
head(prgeng) 
data(prgeng) 
head(prgeng) 
for (i in 1:ncol(prgeng)) print(class(prgeng[,i])) 
ctout <- ctree(occ ~ .,data=prgeng) 
ctout

Here are the first few lines of output:

> ctout

Model formula:
occ ~ age + educ + sex + wageinc + wkswrkd

Fitted party:
[1] root
|   [2] educ in 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
|   |   [3] sex <= 1
|   |   |   [4] age <= 35.15345
|   |   |   |   [5] wageinc <= 49500: 101 (n = 1355, err = 63.2%)
|   |   |   |   [6] wageinc > 49500: 102 (n = 441, err = 63.3%)
|   |   |   [7] age > 35.15345

There are 79 nodes in all. We see above that nodes 5 and 6 are two of the leaf nodes. In node 5, we guess occupation code 101, and 102 for node 6.

Friday, March 6, 9:30 pm

In recosystem's data_memory() function, the index1=TRUE argument is crucial it deals with the fact that indices start at 1 in R, while they beging at 0 in C, the latter language being the one that the core of recosystem is written.

One student misspelled the argument as 'index', so the code produced NAs, thinking that things started at 0. I think this is a good chance for a learning experience. Let's think about what happened as a result of the typo.

It's easy to understand that having the positions off by 1 would result in NAs. But you might wonder why it even ran at all. Shouldn't R have emitted an error message like "index: invalid argument name"? Ordinarily yes, but the problem is that the last formal argument of data_memory() is '...'. That is intended to accommodate various other arguments, so anything is allowed, including 'index'.

Computers are subtle. :-)

Friday, March 6, 11:50 am

Two items:

No quiz next Tuesday. Group Quiz next Friday in lecture!

I will hold the final Job Interview session next Wednesday, 3:30-5. You can also do your JI at the end of any of my office hours.

Thursday, March 5, 11:05 pm

Be sure to download and print our latest supplement.

Thursday, March 5, 7:15 pm

I was asked today what criterion to use for accuracy of a model, in the Term Project. Again, it's largely an open-ended project, so you can choose your own accuracy criterion/criteria. But the following may be helpful to you:

Consider our example from the US census. Say we predict occupation 102, using a logistic model. For each person in our test set, we will get an estimated P(occ = 102 | age, gender, education etc.). If we average that probability over all people in the test test, we will get an expected total number of occupation 102 people in the test set, under the logit model.

Thursday, March 5, 6:40 pm

Ignore the 5:25 message. It was intended for ECS 145.

Thursday, March 5, 5:25 pm

Download and print the new supplement on R classes and environments Consider it an official part of our course materials, and bring it to lecture tomorrow.

Tuesday, March 3, 9:30 pm

I've been reading your Hwk II reports in the order you submitted them, giving feedback that will help on your Term Project. It will be a few days before I get through all of them.

Tuesday, March 3, 11:00 am

Regarding the Windows problem mentioned earlier, a student reports that he solved it by placing a copy of his .Rprofile file (with the .libPaths() call to set search path) in his OMSI directory.

Monday, March 2, 10:35 pm

Here's a recap of my comments today about p-values:

Background:

Basic idea: We set a null hypothesis H₀, e.g. β_age = 0 in predicting MovieLens rating, or p = 0.5, where p is the true population proportion of voters who support Andy. We ask, "If H₀ were true, what is the probability getting data as extreme as we see here?" E.g. if Melissa's sample of 100 voters has 78 supporting Andy, how likely would that be if p were actually 0.5?" That probability is the p-value.

The terminology used is that p < 0.05 is called "significant," p <- 0.01 is "highly significant" etc.

It was first developed by Sir Ronald Fisher a century ago. There were objections even then, but Fisher prevailed. The methodology is quite entrenched today. So, even though people teaching statistics know there are problems, they teach it anyway.

What are the problems?
- H₀ is asking the wrong question, and indeed, is a priori false. Age has some effect on movie ratings, but possibly minuscule. We want to know if the effect is substantial, not whether it is 0.000000000000000...
- In a large sample, we will get such an accurate estimate of the given parameter, e.g. β_age or p, that the significance test will detect even tiny departures from H₀, thus pouncing on meaningless effects and declaring them "significance."
- Confidence intervals are much more informative and relevant to our goals. The CI for β_age, (0.001,0.006) shows the age effect to be small. A CI for p of (0.46,0.78) would be informative for Andy; it's wide, telling him that the sample was too small for a reliable estimate of p, but the fact that the CI mostly has p > 0.5 should be of some encouragement to him.

As mentioned, people have known this all along. No statistician would be surprised by the above statements, and few if any would disagree. But no one did anything about it until the 2016 ASA Policy Statement. It was highly critical of p-values, and though it stopped short of saying they never should be used, it gave no examples of good use cases. See also the articie in Nature, though again with somewhat muted tones.

Finally, note Raffi's point, quite correct, that the use of confidence intervals, while much better than significance testing, is still no cure for p-hacking. There are ways to adjust the CIs for that purpose, though.

Monday, March 2, 1:35 pm

Two news items:

We have a new supplement, on significance tests.

The official lists of datasets that you will choose from for your project is now in the project specs.

Monday, March 2, 12:45 pm

We will have another Job Interview session this Thursday, 4:00-5:30 pm.

Sunday, March 1, 7:30 pm

I just noticed that I had not made the Lee and Seung paper world-readable in the supplements directory. Fixed now.

Sunday, March 1, 12:00 pm

As I've said, to me the Term Project IS the course, and accordingly, I structure the grading such that a good project makes a world of difference in your course grade.

For that reason, I stated that I would make own comments on your Homework II writeups (in addition to the regular grade you get from the TA), so that you can use my feedback to do a super Term Project.

Just now I started writing a script to extract the PDF reports from each of your .tar files, but then I realized it actually is just easier from you to send me your PDF files in e-mail. That way the PDFs are already extracted :-) and I can send you my comments simply in my e-mail reply. Please send me your Homework II PDF as soon as possible.

Sunday, March 1, 9:50 am

I've now sent out the grades for Quiz 4. Some remarks on the individual questions:

Question 1: I generally try to place the easier questions in the earlier parts of a quiz, but this one actually turned out to be the most difficult question on the quiz. None of (i), (ii) and (iii) is correct; the new user will have a new user ID, not among the levels of the R factor for user ID that had been used in lm() earlier. So predict.lm() would generate an error.

Question 2: This was straightforward, very similar to Homework II.

Question 3: This was a straightforward implementation of the ALS algorithm in the textbook.

Saturday, February 29, 1:45 pm

I've updated our MustInstall file.

Saturday, February 29, 8:25 am

I know some students have been disappointed with their grades on the quizzes. As I've said, that will be greatly compensated for with the Job Interview and a good Term Project (which will in turn be aided by feedback I give on your Hwk II), but I've decided to add one more "compensator":

Instead of dropping your lowest 2 quizzes, I will drop the lowest 3.

Please note again the value of the Job Interview in your ultimate course grade.

For the remaining quizzes:

Make SURE you are adept at using all software presented in our class, e.g. formUserData() in rectools, as well as the function kNN() in regtools.

As I mentioned yesterday, there are issues with using covariates together with user/item IDs. You will not be asked to write any code for this situation. However, you may be asked to predict rating from covariates without user/item IDs.

Make SURE you are able to load the libraries in an OMSI context. Note that I posted something about this last night. By the way, our upcoming material makes use of the functions in lme4, not just the data.

Friday, February 28, 9:50 pm

Based on an interesting observation made by a student, I've simplified the specs for the Term Project, removing the parts regarding covariates.

Friday, February 28, 8:50 pm

Several items:

We will have no class on Wednesday, March 4. I have a medical appointment, made a couple months ago, that can't be changed.

Please make sure you have the ctree package is installed and in your R search path.

Some Windows users have reported not being able to load lme4 from within OMSI. If this happened to you, please make a file a.R with contents
```
library(lme4)
data(InstEval)
print(InstEval[1,])
```
and then, from a terminal window, not OMSI, run
```
Rscript a.R
```
and let me know the results.

Monday, February 24, 11:30 pm

Our Term Project is ready!

Sunday, February 23, 11:05 pm

Please download and print our latest supplement .

Sunday, February 23, 4:15 pm

I mentioned at the start of the quarter that our course assumes NO prior background in machine learning. I just did a spot check of how some students are doing in the quizzes, comparing those who have such background with those who don't. The answer basically is, no difference. What does count, instead, is good math intution, which I also stated at the start of the course.

As discussed before, that's why I have so many mechanisms to compensate for weaker quiz grades: dropping the lowest 2 quizzes; the Job Interview; and heavy bonus for a good Term Project.

Again, my goal is to turn you all into good data analysts, with these rather open-ended, real-world projects. In a real job interview, you should be able to do well!

Sunday, February 23, 1:25 pm

There are a couple of typos in Supp02022020.pdf, p.26: (74,1.0) should be (74,180) and co2 should be hw.

Saturday, February 22, 9:20 pm

I've added a new supplement, to be discussed on Monday. It's rather dense, so don't try to read it in detail yet, but you might glance through it before Monday.

Saturday, February 22, 3:10 pm

In the InstEval data, the instructor IDs are not consecutive, which could cause problems. Here's how to convert the IDs to a new R factor with levels 1,2,3,...:

# ids originally an R factor; temp convert to nums, then back to factor
ids <- as.factor(as.numeric(ids))

Saturday, February 22, 10:35 am

Please keep in mind that you are responsible on quizzes for the material in the R tutorial in the appendix of our textbook.

Friday, February 21, 10:25 pm

The latest supplement shows the details of the ALS method in the case of partially known A. As mentioned, more supplements coming over the weekend.

Friday, February 21, 7:55 pm

Please note the condition stated in Hwk I (for all assignments),

Work must be done entirely in R, and except for packages that I specify, limited to R packages available on CSIF.

and one in Hwk II:

You'll use only lm() and NMF matrix factorization methods, and in general are allowed to use only methods from our course. You may run lars() instead of lm() if you wish.

In conversations with some students, I've found that they were not adhering to these requirements. If you have already done most of your project using these "forbidden" things, go ahead and submit them. But please note the following:

You will not be eligible for Extra Credit for having one of the most accurate models. After all, it would not be fair to the other groups who have followed the rules.

Please make sure that you follow the rules in the Term Project.

Thursday, February 20, 11:20 pm

The next official Job Interview session will be February 26, 3:30-4:30. However, if there are not many students asking about homework in office hours this Friday and Monday, I can take do a few Job Interviews then. Again, there will be other sessons after Feb. 26.

Tuesday, February 18, 10:20 pm

A student reported to me that his team's saved model takes up 1.5Gb. He was worried this would be a problem, and indeed it would. With 20+ groups submitting their work, this would quickly overwhelm the TA's disk quota on CSIF.

So, we will do things this way: You will NOT submit a saved model, but sill must submit FULL code, including preparatory operations on the data, e.g. outlier removal. You will still write a thorough report, etc.

You will also indicate, in a README file, whether you wish to be considered for the Extra Credit for the three most accurate teams. The TA will only run these, again on his own secret cross-validation partition.

Everything else remains the same. Again, remember that you must submit your FULL code, even if you are not entering the "competition."

Tuesday, February 18, 8:20 pm

Our first Job Interview session will be held this Thursday from 2 to 3 pm. I will hold other sessions at various times in the next couple of weeks, so that everyone who wishes to participate will be accommodated.

Tuesday, February 18, 6:00 pm

There is a new supplement. Print it and bring it to class Friday.

Tuesday, February 18, 12:50 pm

By popular demand, I have extended the due date for Hwk II (a lot). Note, though, that I will probably be assigning the Term Project in the next few days.

BTW, no, I did not have a typo in my 11:45 post last night. :-)

Monday, February 17, 11:45 pm

The TA will make sure his holdout set contains no IDs not in the training set.

Monday, February 17, 4:55 pm

As mentioned, our first Job Interview session will be this coming Thursday, time to be announced.

Some details on the workings of the Job Interview are available here.

Monday, February 17, 12:00 pm

A couple of people have asked me about using PCA for dimension reduction with categorical variables (after conversion to dummies). A few comments:

Remember, PCA merely finds "interesting" linear combinations of variables. The linear combinations that have small variance as uninteresting, as they are essentially constants.

There is nothing inherently wrong with using PCA on dummy variables. Linear combinations are linear combinations.

However, two cases should be delineated:
- Dummies that come from different categorical variables. E.g. a survey consisting of a number of Yes/No questions.
- Dummies arising from the same categorical variable. E.g. dummies coming from ZIP code.
In the first case, PCA works in the same way as with continuous variables.

The second case is more complicated. Let's say for instance that everyone has a favorite planet, so we have 9 classes of people. Say we have dummies P₁ through P₉.
- We know for sure that one linear combination of the dummies will have 0 variance, since P₁ + ... + P₉ = 1. So the smallest eigenvalue will be 0.0.
- But say almost no one likes Uranus. Funny name, never in the news, etc. P₁ + ... + P₆ + P₈ + P₉ will approximately be 1. So the second-smallest eigenvalue will be close to 0.0.
- So, at the very least, PCA will identify dummies that we should consider deleting. In the planet example, we might delete Uranus and one of the others, say Pluto.
- In addition, there may be interesting linear combinations of nondummies with dummies, even if the latter come from the same categorical variable.

By the way, one can also do PCA after forming interaction terms, thus accounting for nonmonotonic relations (e.g. income vs. age).

All this serves to show, once again, that ML is an art, not a science. There are no good rules of the form "If this, do that and then the other thing." If someone claims such, view with a highly skeptical eye.

By the way, one can also do PCA after forming interaction terms, thus accounting for nonmonotonic relations (e.g. income vs. age).

Sunday, February 16, 4:15 pm

As mentioned, the regtools package contains functions factorToDummies() and factorsToDummies().

Saturday, February 15, 4:15 pm

Tentatively I will begin the "job interviews" next Thursday. I'll have several sessions during the next couple of weeks, so everyone who wants to do this will be accommodated.

Your "interview" will last 5 minutes or less. Typically I'll ask a question like "Tell me about..." and then ask followup questions.

You'll get a letter grade. If it is better than your lowest quiz grade (AFTER deleting the lowest 2), it wll replace the latter. Otherwise, it will have no effect; it cannot harm your course grade, only help or be neutral.

Below are the job interview grades from when I taught this course in Fall 2018. (There were 37 students enrolled in the class.)

> table(z[,2])

 A A- A+  B B- B+  C C- 
 7  8  1  1  1  7  1  1 
> sum(table(z[,2]))
[1] 27

Saturday, February 15, 12:50 pm

Had a typo in my last post (now corrected). The TWO lowest quizzes are dropped.

Saturday, February 16, 10:15 am

A note on our remaining quizzes:

We just completed the 6th week of class. As announced last night, we will not have a quiz next week, the 7th week.

We will have quizzes in the discussion section in the 8th and 9th week.

As explained at the start of the quarter, we have a Group Quiz on the last day of lecture, which will be Friday of the 10th week.

We will probably have a regular quiz in the discussion section of the 10th week.

So, we will have either 3 or 4 more quizzes, for a total of 6 or 7. Adding in Quiz 0, that means a total of 7 or 8.

And the implications for your course grade:

As you know, the 2 lowest of those 7 or 8, in terms of letter grade, will be dropped.

In my experience, the fewer the number of quizzes, the more advantageous it is to a student, as it accentuates the student's good quizzes, including Quiz 0.

As I have explained before, in the mathematical courses that I teach (this one, and ECS 132 and 256), I realize that some students will not do very well on the quizzes, due to not having strong mathematical intuition. Yet I want to encourage them to learn the material and become good, strong data analysts, so I have mechanisms by which they can earn good grades -- dropping the lowest 2 quizzes; the optional "job interview"; and giving a major bonus in the course grade due to a having a good Term Project. Extra Credit on Homework II will additionally serve in the same manner as the latter.

Saturday, February 15, 12:25 am

To give everyone a chance to catch up, no quiz or discussion section net week. More supplements coming, though.

Friday, February 14, 10:40 pm

A couple of items:

Please note that in the Homework, your predictions are merely to fill out the A matrix, not to deal with entirely new users. New users do come in, in practice, and we'll be covering the latter issue later in the current chapter, but you will not be dealing with it.

In forming training and holdout/test sets, it's possible that the latter will contain users or movies not in the former. Just remove these.

In case my wording today in class was not clear, here it is in print:

In the decomposition A approx WH, the rows of H form an approximate basis for the span of the rows of A.

(Recall that the span of a set of vectors is the set of all possible linear combinations of those vectors.)

Friday, February 14, 9:50 am

Note that I'll be adding recosystem to packages you must have installed on quizzes, and others later.

I heard that some people had trouble loading the InstEval data in a past quiz. Make SURE this won't happen in future quizzes. Also, note that whatever fix you come up with, your code must work on both your machine and mine, the latter when I grade your quiz. You may consider putting a call to tryCatch() in your code.

Thursday, February 13, 4:10 pm

I was asked just now what the Homework specs mean when they say you must use at least some covariates. Here is the answer.

Look at the worked-out lm() example. There we predicted rating from just user ID and movie ID. We could have added the age covariate, with the call

lm(V3 ~ V1+V2+age,data=rats)

(Replace 'rats' by the name of the data frame, etc.) Now we are predicting rating from user ID, movie ID and age. Age is an example of what is called a covariate in stat and side information in ML.

Thursday, February 13, 3:30 pm

I have placed a new item in the Supplements/ directory, Supp02132020.pdf.

Again, all materials in the Supplements/ directory are official parts of our course. There is also an optional item, the entire revised book. I'm providing these supplements piecemeal, and you may wish to see the whole thing in integrated form, e.g. for following links. I'll be updating this as a I along.

Monday, February 10, 11:10 pm

A student just asked me if they should switch to P/NP in our course, as tonight is the deadline for that. A few comments, important even if you are not considering P/NP:

As I said in an earlier blog post, one student had been under the misimpression that this course is an "automatic A" as long as one has a good Term Project. That is not correct.

At the start of the quarter I showed how a student in ECS 132 last quarter did get an A in the course in spite of approximately a C average on the quizzes. That was due to his having a strong Term Project AND a strong Job Interview AND enough good quiz grades that his quiz average was good after dropping the lowest two.

There were many students in 132 with similar situations -- but far from ALL.
Most students are doing well in the quizzes so far, but a few are doing poorly and are worried, understandably so, and are asking me what they can do to improve.

By far, the most important answer is to understand the "Why?" of everything. Question 1 of Quiz 2 is a perfect example of that. If you missed it, you should ask yourself how you could have approached learning our course material differently, in order that you would have gotten Question 1 right.

In tomorrow's quiz, there will be two questions that deal with the "Why?", qualitative (i.e. non-coding) in nature. The Job Interview will be ALL about the "Why?".

The "Why?" is what you hopefully carry with you and use when you leave school, not arcane details of some computer language.

I really enjoyed today's lecture, very interactive, with some people speaking up who've been silent up to now. Please continue! I really hope for more discussions like this.

On the other hand: The questions today were mainly on how polyFit() worked, which was not the focus of that example; it was supposed to be on illustrating overfitting. I wrote that example with the view that the interested reader would look up the details of polyFit() offline. Still, yes, it was quite beneficial to discuss that function, especially as it might apply to your Homework and Term Project, but "Don't lose sight of the forest for the trees."

Monday, February 10, 10:40 pm

An alert student pointed out to me that a single column for genre will not suffice in u.big. You'll need 19. I've changed the specs accordingly.

Monday, February 10, 8:00 pm

Please make sure you have read the course syllabus carefully, especially this passage regarding the interactive homework grading:

You must be prepared to speak cogently about the ENTIRE assignment. In particular, if you worked on one part of the assignment and Johnnie worked on another, it is NOT acceptable to answer the TA’s question about Johnnie’s part by saying “Oh, I don’t know about that part, because Johnnie did it."

Sunday, February 9, 12:10 pm

Solutions for Quiz 2 are now on our Web page. Let me know if you have any questions on the solutions, or on the grading of your quiz.

Saturday, February 8, 9:40 pm

Please note that I have added a bullet point to the Homework, stating "It is REQUIRED that your report contain a section titled, 'Who Did What,' explaining the contribution of each team member."

Saturday, February 8, 9:00 pm

For quizzes etc., make sure you know how to create and use S3 classes (Sec. A.10.3 of our book).

Saturday, February 8, 12:40 pm

I have moved the Supplements directory to here. Please keep in mind that all such material is to be considered official course material, eligible for quizzes, the "job interview" and so on.

Saturday, February 8, 9:35 am

On quizzes, be sure you know the R functions solve() and t(), for matrix inverse and matrix transpose.

Friday, February 7, 11:40 pm

I've put in a detailed procedure for running your code for the Homework. I've also elaborated on grading and credit issues for this assignment. Read this new version as soon as possible, as it may affect how you do the work.

Friday, February 7, 8:30 pm

I've placed material on the LASSO in our Supplements directory. Remember, the files in that directory are considered our official course materials.

'Cp' is Mallows' C_p, a variation on the minimal sum of squares. One popular strategy is to choose λ to be the one that minimizes C_p.

The excerpt is from my book, Statistical Regression and Classification: from Linear Models to Machine Learning, CRC, 2017.

Thursday, February 6, 10:35 pm

URL in the 1:25 pm corrected.

Thursday, February 6, 1:25 pm

Our TA Runtian told me he had been taught that one should always scale one's data before applying PCA. Actually, this is indeed a common recommendation, but it actually oversimplifies the situation.

This inspired me to write the third in my Clearing the Confusion series. Please read this and consider it part of our course materials.

Monday, February 3, 9:50 pm

Due to a meeting over in the Genomics Center, I'll be leaving my office hour at 4 pm this Friday.

Monday February 3, 5:10 pm

Say you look at 3 genres. Your table to explore whether different genres have different rating patterns should be 3x5.

Monday February 3, 4:55 pm

Every quiz covers the material from the start of the course to the Monday preceding the quiz. Actually it's the Friday preceding the quiz, but sometimes something from Monday is helpful.

Sunday February 2, 12:35 pm

Reminder: Make sure to be ready to use the InstEval data.

Sunday February 2, 12:25 pm

There is a new supplement waiting for you. Print it out before Wednesdays's lecture and bring it to lecture and quizzes. Remember, supplements are official parts of our course materials.

Friday January 31, 12:25 pm

As I mentioned in class, we will have a number of supplements, since I am basically writing the textbook as we go along. They will all (only one so far) be available in here.

Friday January 31, 11:50 am

Please note that the lme4 package, required for our class, contains the InstEval dataset. Be ready to use it in Homework and on Quizzes.

Thursday, January 30, 8:25 am

A couple of comments:

In all datasets we use this quarter, the official versions will be the ones listed in the MustInstall. All Homework, Quizzes and the Term Project must use those versions.

In case the various versions of the MovieLens data caused problems, I apologize for the confusion. I should never rely on Kaggle. :-) At any rate, I am extending the due date to Monday.

I will make a second Homework assignment, probably around the middle of next week. Then around the seventh week, I will assign the Term Project, which as you know counts both as a (double) Homework assignment and a substitute for a final exam. You will have sufficient background by then to get started right away, which of course I strongly recommend. Remember, to me the Term Project is the COURSE.

Some of the phrasing in the current assignment is somewhat open-ended ("Commment on..."). This is a data analytic course, and some degree of open-endedness is an integral part of that; this will be especially true of the Term Project. The TA will grade it liberally. Please keep in mind, though, that aside from the correctness of your code and graphs, the main part of your Homework grades will focus on how well you do at the interactive grading sessions.

Wednesday, January 29, 9:25 am

When sending me e-mail, please remember to put "[ecs 189g]" in the Subject line.

Tuesday, January 28, 7:20 am

The questions and answers for Quiz 1 are now on our Web site.

If you believe you were misgraded on a quiz problem, please feel free to bring this up with me. Make sure to read the solution first, and send me an e-mail query.

Monday, January 27, 8:25 pm

Comments:

I was asked what column names to set in the singles data frame, e.g. 1,2,3,... or the actual genre names. It's up to you, but of course the more professional thing to do would be to use genre names. Names are really important in R; e.g. they automatically create nice labels in graphs.

As noted in the specs, you are expected to be resourceful, e.g. in finding out how to make your graphs (in this Homework and later). There are tons and tons of tutorials on the Web; just plug in 'base-R graphics', 'lattice graphics' or 'ggplot2' into Google.

I use all three. Base-R is easier when I need something quickly; ggplot2 is more difficult but is highly versatile, and allows storage of intermediate results, very valuable; for my taste, lattice has the best colors.

For really advanced stuff, the world's foremost expert on graphics in R is Paul Murrell (R Graphics, 3rd ed., CRC Press 2018); you won't need anything like that for our course, but if you're interested in graphics for data science, he's the top guy.

I recommend running table() on V55 in the Forest Cover data. You'll see that the first two cover types dominate, so a partial check on your PCA plot would be that two colors dominate. You may need to take a subsample of, say, 500 to see this.

Monday, January 27, 5:50 pm

As you know, a couple of weeks ago, some people were having trouble accessing the Kaggle data, so I found another source. I changed the MustInstall file accordingly, as well as the file for Hwk I.

But a couple of students have pointed out that u.genre already lists the genres, and that there are 19, no 20 as in the other version.

Please read the current specs, slightly revised. Sorry for the inconvenience!

Sunday, January 26, 8:05 pm

Recall that when I first wrote the homework specs and the MustInstall file, we were still having trouble obtaining the full MovieLens data, due to issues with Kaggle. But later I found the data in the GroupLens site itself. Accordingly:

A few days ago, I updated the MustInstall file, and blogged about the update.

Just now I updated the file names in Problem B of our Homework.

Just now I also updated the MustInstall file to have entries for our Forest Cover data, the subject of Problem A.

Obviously we will continue to accumulate more and more datasets as the quarter progresses, and I will update MustInstall as the need arises, announcing it when I do.

Sunday, January 26, 1:15 pm

I am about to e-mail your grades on Quiz 1. A few points:

Those who did not heed the blog post of Saturday, January 18, 4:10 pm, wasted precious time writing a loop to find which element of a vector is largest. :-(

Doing PCA on some data and then removing some columns is not the same as removing some columns first and then doing PCA.

abs(which.max()) is not the same as which.max(abs().

Re Question 4: The purpose of this question was to measure your understanding of matrix partitioning. The solution was w[i,] %*% h, i.e. the product of row i of w and h. The expression (w %*% h)[i,] was not an acceptable answer. The question called for computing only row i of the product, NOT the entire matrix. As noted in class several times, in this field we are typically working with very large matrices (note that the MovieLens people call the matrix with 100,000 rows "small"), so we absolutely want to avoid calculating a huge product if possible.

In problems like Question 4, the answer must be general. Any submitted solution that contained specific numbers, e.g. 1, 2 and 3, got an automatic 0.

Sunday, January 26, 12:00 pm

Concerning file placement in the homework: When grading your work, the TA will run your code from his OMSI directory (or equivalent), with the same data file structure as in your OMSI directory. For instance, ml-100k/ will be a subdirectory of the directory from which he runs your R code.

Note: Homework I due date is Thursday.

Sunday, January 26, 8:25 am

In your Homework problems, limit your code to base R and a small number of packages that I specifically allow. For graphics, the latter means lattice or ggplot2. Do you see why? The TA will run your code, and if 20 groups install say 3 packages each, that means that he would have to install 60 packages on his machine to grade your work, a highly unreasonable burden.

Saturday, January 25, 4:00 pm

A student asked me the other day if I had an opinion on the perennial debate as to whether R or Python is better for Data Science. (Note the qualifier.) I replied that I have an essay on that very topic. Comments welcome!

Friday, January 24, 9:50 am

To test whether you have the correct MovieLens dataset, run the code on p.30 of our book, and verify that you replicate the output. Make sure you understand every step!

Friday, January 24, 9:30 am

I want to give you a little more time to develop your R skill, and to make sure you have your data and R packages set up correctly. So, I am postponing Quiz 2 until the week after next. We will not hold discussion section next week.

Friday, January 24, 8:55 am

One technical skill you'll need in our class is facility with R factor variables. Again, see our fasteR tutorial. In particular, R packages vary as to whether they expect data in factor or indicator variable (often referred to as dummy variable) form, so you need to be able to convert between the two. The regtools package has utilities for that.

Make sure you are really adept at operations as in this example:

> library(lme4)  # MustInstall.html
> data(package='lme4')
Arabidopsis    Arabidopsis
               clipping/fertilization data
Dyestuff       Yield of dyestuff by batch
Dyestuff2      Yield of dyestuff by batch
InstEval       University Lecture/Instructor
               Evaluations by Students at
               ETH
...
> data(InstEval)
> head(InstEval)
  s    d studage lectage service dept y
1 1 1002       2       2       0    2 5
2 1 1050       2       1       1    6 2
3 1 1582       2       2       0    2 5
4 1 2050       2       2       1    3 3
5 2  115       2       1       0    5 2
6 2  756       2       1       0    5 4
> dept <- InstEval$dept 
> class(dept)
[1] "factor"
> levels(dept)
 [1] "15" "5"  "10" "12" "6"  "7"  "4"  "8"  "9" 
[10] "14" "1"  "3"  "11" "2" 
# 14 levels, '1', '2' etc. [note which one they skipped :-) ]
> library(regtools)  # MustInstall.html
> deptIVs <- factorToDummies(dept,'dept')
> head(deptIVs)
     dept.15 dept.5 dept.10 dept.12 dept.6 dept.7
[1,]       0      0       0       0      0      0
[2,]       0      0       0       0      1      0
[3,]       0      0       0       0      0      0
[4,]       0      0       0       0      0      0
[5,]       0      1       0       0      0      0
[6,]       0      1       0       0      0      0
     dept.4 dept.8 dept.9 dept.14 dept.1 dept.3
[1,]      0      0      0       0      0      0
[2,]      0      0      0       0      0      0
[3,]      0      0      0       0      0      0
[4,]      0      0      0       0      0      1
[5,]      0      0      0       0      0      0
[6,]      0      0      0       0      0      0
     dept.11
[1,]       0
[2,]       0
[3,]       0
[4,]       0
[5,]       0
[6,]       0
# 13 dummies, by default (why?)

Thursday, January 23, 5:45 pm

R's tapply() function is one of the most often-used in R. You'll find that it is quite useful in recommender systems.

Make sure you've reviewed the tapply() examples in our fasteR tutorial before Quiz 2.

Thursday, January 23, 5:45 pm

To make sure we are all on the same page: When I grade future quizzes, I will first do

% cd ~/omsi/

This directory will contain OmsiGui.py etc. Ii the same one from which you run that program on your own machine.

I will launch R in that directory, then do

> source('Grading/AutoGradeOMSI.R')
> grader()

My grading script will then be running in that same directory, i.e. the one containing OmsiGui.py etc. In that directory, I will have files like ml-100k/u.data, exactly the ones specified in MustInstall.html. Note the directory here, ml-100k/.

My script will then run your code, in that environment.

Thursday, January 23, 9:15 am

Following up on my blog post last night, in most classes I teach I try to remember to post previous grade distributions. As you know, this course is under development, but I did teach it in Fall 2018. Here are the course grades:

> table(z$V21)

 A A- A+  B B- B+  C C+ 
13  5  5  6  1  2  2  3

The grades were pretty generous, I think, but not an "automatic A," as that one student had thought.

Wednesday, January 22, 6:15 pm

I received e-mail today from a student who referred to getting an "automatic A" in this class. Needless to say, I was shocked to see this.

What I stated the first day of class is that I give an extra bonus for a very good Term Project. I then showed how a student in ECS 132 last quarter got an A in the class in spite of his mediocre quiz grades, due to a good Term Project and a good "Job Interview." It is NOT the case that I gave "automatic As" in that class, nor will it be in our class.

Monday, January 20, 7:30 pm

The quiz results were for Quiz 0, not Quiz 1.

Monday, January 20, 7:20 pm

Behold!

% wget http://files.grouplens.org/datasets/movielens/ml-100k.zip

Monday, January 20, 2:10 pm

E-mailing Quiz 0 grades as I write this.

Saturday, January 18, 4:10 pm

Important information for Tuesday's quiz:

Any quiz will cover the entire course material through the Monday preceding.

I've simplified te MustInstall file. If the data is a .csv file or similar, do not use save(). Just make sure the file is in your OMSI directory, as shown in MustInstall,

R functions and constructs you may find useful on Quiz 1: max(), which.max(), abs(), sqrt(), as.matrix(), matrix subsetting (e.g. m[1:2,c(5,8,9)]).

As noted earlier, the more you can work on Hwk I before the quiz, the better prepared you'll be for it.

Friday, January 17, 9:55 pm

The second and final problem in Homework II is now on the Web.

The more you can work on this assignment before Tuesday's quiz, the better prepared you'll be (though I anticipate that that quiz will turn out to be one of the easier ones). This is especially true of Problem A, but please note that the material on Problem B may be on the quiz; the R functions, e.g. table(), will be considered an official part of the course.

Friday, January 17, 7:20 pm

I've now updated the MustInstall file consistent with my 4:10 pm posting.

Friday, January 17, 4:10 pm

Regarding downloading the MovieLens data:

We need the full data, including the covriate information on demographics, genre etc.

The MovieLens people do make the data publicly available WITHOUT that information, and they have strict requirements about redistribution. (Presumably the Kaggle site has permission.) So I can't make the data available myself.

The following worked for me:
- Plug "MovieLens Kaggle" into Google.
- Choose the 100K version, taking one to https://www.kaggle.com/prajitdatta/movielens-100k-dataset.
- Click Data, then Download.
- Log in.
- Click Download again, and data starts flowing.
For the time being, we will use the data without covariate information, in ml-latest-small.zip, specifically the file ratings.csv. BE READY TO READ THIS FILE ON NEXT TUESDAY'S QUIZ. But you will need to full dataset later.

Again, if you have any problem installing data or packages, let me know immediately, rather than waiting to see me in class.

Friday, January 17, 3:40 pm

I've updated the MustInstall file to specify which one of the three Turkish Evaluation files I want you to put on your laptop.

BTW, if you ever have trouble loading a dataset, inform me by e-mail right away instead of waiting to see me in class.

Thursday, January 16, 11:35 pm

Problem I of Homework I is ready.

Thursday, January 16, 9:15 pm

I just added two items to the MustInstall file, one a dataset and the other the data.table package for data selection, filtering and so on. This will be needed because some of our datasets will be large.

Make sure to install AND TEST these downloads well before Quiz 1.

Wednesday, January 15, 8:25 pm

One student told me he was unable to download the Kaggle MovieLens data. I've changed the URL slightly now, and was able to download from two separate Kaggle accounts. When you click on "Data" then "Download (15 MB)", it will invite you to create an account.

I've simplified the instructions in MustInstall.html regarding installation of datasets.

Wednesday, January 15, 3:25 pm

Regarding prereqs for this course: As noted in the course flyer, you need to have had a calculus-based course in probability. Key words and phrases are density function; cumulative distribution function; expected value; exponential distribution; etc. STA 13 or AP Statistics is of essentially no value for this course. I do expect that you've forgotten the material and will review it in the next chapter, but you need to have spent a whole quarter pondering the material, not a few days of learning it from scratch. Of course, if you have really first-rate mathematical intuition, it might be OK to learn it now from scratch, but I don't recommend it.

Tuesday, January 14, 11:00 pm

Our syllabus is ready.

Tuesday, January 14, 9:55 pm

I will be maintaining a list showing which data files and R packages you are required to install on your machine for quizzes. The first quiz you'll need this for will be next week (and all subsequent quizzes), so you have some time, but I recommend starting this as soon as possible. It is not the kind of thing you should leave for the last minute.

This list will grow over time.

Tuesday, January 14, 3:15 pm

Our TA told me that Quiz 0 went well overall, but a couple of students could not run R, likely due to not having their search path configured properly. Needless to say, this will be a grave handicap in future quizzes.

The flyer for our course lists ECS 30 as prereq. That course is now ECS 36A, which includes Unix tools in its topic coverage. Thus you should be very familiar with the concept of a search path. If not, you are not prepared for this course, and will need to learn these things quickly.

Similarly, if you haven't run OMSI in both instructor and student roles as the docs state, with an R problem, then you are not ready for future quizzes.

Tuesday, January 14, 12:10 am

IMPORTANT NOTE: You are no longer responsible for Section 2.3.6; it's just too complicated.

Also, note that I am continually revising the book, here. You are still responsible only for the original version, but you may find the improved phrasing helpful. In particular, I think you will find the revised Section 2.3.5 much clearer, and I recommend that you read it as a supplement.

Monday, January 9, 9:30 pm

Here is a dramatic use of PCA:

It comes from this scientific paper .

This is drastic dimension reduction, with (in our book's notation) p = 500568 (that many genes for each person) and s = 2. We usually wouldn't reduce that much, but the picture shows that a surprisingly large amount of information is retained in just 2 principal components.

Each person is one point on the graph, plotted according to their PC1 and PC2 values. In addition, they've been color-coded by country of residence/ancesry. Amazingly, this rather well reproduces the geography of Europe.

Thursday, January 9, 8:10 pm

If you added the class after the first lecture, make sure you go to our class info page., and its links to our blog, the pre-quarter heads-up and the flyer.

Thursday, January 9, 8:10 pm

As I mentioned earlier, our course syllabus is not quite ready. Should be ready in a couple of days, and I'll announce it here.

Thursday, January 9, 6:10 pm

Our office hours are now posted on our class info Web page.

Wednesday, January 8, 3:55 pm

Note again that you will need Python 2.7 on your laptop in order to use OMSI. Conversion of OMSI to Python 3 should be simple, but we probably will not get to it this quarter.

Tuesday, January 7, 1:20 pm Tuesday, January 7, 10:50 am

You'll need to be an expert on OMSI, our online exam system, before discussion section of next week. We will have Quiz 0 then, whose sole purpose is to ensure that students are ready to use OMSI in the subsequent quizzes.

This should be an automatic A+. And for most students in ECS 132 last quarter, it was:

> z <- read.table('Quiz0Grades')
> table(z[,7])

  A  A+   B   C   D  D+   F
  1 106  20   1   9   5   1

You can avoid getting any grade lower than A+ by simply (a) reading the docs carefully and (b) actually running OMSI, playing the roles of both student and instructor. In (b), make sure your experiment has both Python and R questions (could be just printing out 2+2); the R question is especially important, to make sure you don't have search path problems.

You will need both Python 2.7 and Python 3 on your laptop. As of January 1, Py 2.7 is officially deprecated. OMSI is written in Python 2.7, but according to the ECS 189G TA Runtian Wang, only a few print and exception statements need to be changed. We'll switch sometime later in the quarter. For now, it's 2.7.

Make sure to adhere to our OMSI rules.

Tuesday, January 7, 10:25 am

Outline of our course:

Overview of recommender system (RS) methods.

Review of linear algebra, then matrix methods in RS.

Review of probability, then statistical methods in RS.

Nearest-neighbor methods in RS.

Introduction to predictive methods in ML, and application to RS.

Introduction to NLP sentiment analysis, and application to RS.

Google PageRank.

Case studies.

Also: You can get an idea of what the homework will be like by looking at the Fall 2018 assignments.

Tuesday, January 7, 7:45 am

Lecture begins tomorrow! Don't forget to bring your textbook.

Keep in mind that the textbook is very incomplete. I was writing it when I taught the course in Fall 2018. That quarter the campus closed down for two full weeks, due to the Sonoma County fires and concern that the smoke would produce problems for some students. So, I will be adding supplements to the book throughout this quarter.

Monday, January 6, 5:55 pm

According to the Registrar site , our class' scheduled final exam slot is Tuesday, March 17, 6:00-8:00 p.m. We have no written final, but the Term Project deadline is 11:59 pm on that date.

As noted in class, I will assign your Term Project about three weeks prior to that deadline. As also noted, the Project IS the class, extremely important. The generous grading scheme reflects that.