Blog, ECS 189G, Winter 2020

Tuesday, March 17, 11:40 am

I'm mailing out the job interview grades. I have two for which I apparently misspelled or mis-heard the student's name. If you do not receive your grade and did participate, let me know.

Tuesday, March 17, 9:50 am

Please note that my handin username is "matloff". See syllabus.

Monday, March 16, 8:10 pm Saturday, March 14, 2:25 pm

To ECS 145 and 189G students:

In submitting your Term Projects, make sure -- ABSOLUTELY, TOTALLY, POSITIVELY, EXTREMELY SURE -- that you get the file name right, with proper e-mail addresses etc. In grading the ECS 145 group quiz just now, two of the submissions were wrong. One used commas instead of periods and the other had only the submitter's name, not the other three team members. Fortunately I caught it in time, and changed the file names by hand, but potentially SEVEN students could have had an F grade on this quiz.

Wednesday, March 11, 10:50 pm

Earlier I wrote here that any group that would like feedback from me on their Hwk II reports should e-mail me their PDFs. I had thought I finished all the reports that were sent to me, but today I learned of one that I had apparently overlooked. If you did not receive feedback from me in response to a PDF you sent me, please let me know. (Note: The same member of your group who sent me the PDF should send me the new message.)

Tuesday, March 10, 11:20 pm

I fixed a couple of typos in today's supplement, and aded some clarifying phrasing. Please use this version.

Tuesday, March 10, 7:30 pm

Should have mentioned today: There is a shortcut to finding matrix powers.

Say we want to find a power of M that is at least degree 20. We do this:

K <- M; K <- K^2; K <- K^2; K <- K^2; K <- K^2; K <- K^2;

This gives us power 32.

Tuesday, March 10, 11:35 pm

Just added our last supplement.

Tuesday, March 10, 9:45 pm

Just in case my 5:40 post was not explicit enough: Yes, we will have class tomorrow; yes, I will hold the Job Interview session as planned; and yes, we will have our Group Quiz on Friday.

Tuesday, March 10, 6:10 pm

In the Term Project specs, there is no requirement, or even recommendation, that you use the k-NN routines in either rectools or regtools. Depending on what you'd like the user to see -- again, it's open-ended, and I want you to produce something useful enough that some people will see it in your GitHub repo and make use of it -- you might use or modify one of the above routines or write your own.

Tuesday, March 10, 5:40 pm

As you have probably heard, all in-person final exams are canceled. But since our class didn't have a final anyway, there is no effect on us.

Tuesday, March 10, 5:25 pm

You may find this useful. The partykit call

predict(partyObj,xval,type='node')

returns the node number that xval lands in.

E.g.

library(regtools)
data(mlb)
mlb <- mlb[,c(4:6)]
ctout <- ctree(Weight ~ .,data=mlb)
predict(ctout,mlb[1,],type='node')
predict(ctout,data.frame(Height=71,Age=22.5),type='node')

outputs

 1
14 
1 
8 

This also can be used to determine which rows of the original dataset are used in which nodes for prediction of future new cases. E.g.

> nodeRows <- split(1:nrow(mlb),predict(ctout,type='node'))
> names(nodeRows)
 [1] "4"  "5"  "8"  "9"  "10" "14" "15" "16" "19"
[10] "20" "21"
> nodeRows[[1]]
 [1]   6  15  73  75  77  84 105 139 175 219 273
[12] 279 308 331 343 347 349 435 437 444 447 449
[23] 515 550 562 580 614 654 677 682 712 752 757
[34] 785 790 810 815 850 854 883 894 922 928 961
[45] 996
> mean(mlb$Weight[nodeRows[[1]]])
[1] 176.9556

Any future case that lands in node 4 will have his weight predicted to be 176.9556.

Tuesday, March 10, 8:55 am

As reiterated in class yesterday, your code in the Term Project must be general. Do not tailor it to the example datasets. If a dataset's file is not in the form of your general code, that means the user must do his/her own preprocessing before calling your functions.

For instance, in the MOOCs dataset in our specs, must divide the rating range into intervals, to make it categorical, before calling ratingProbsFit().

Also: Note that the central goal is to produce a set of probabilities. If I am considering watching movie 324, I want to know: What is the probability that I would rate it a 5? A 4? Etc.

Monday, March 9, 8:25 pm

Here is an example of ctree() where the outcome/response variable is categorical.

The dataset is prgeng, included in regtools. It's data on programmers and engineers in Silicon Valley in 2000. We will predict occupation, of which there are 6 categories.

library(regtools) 
library(partykit) 
head(prgeng) 
data(prgeng) 
head(prgeng) 
for (i in 1:ncol(prgeng)) print(class(prgeng[,i])) 
ctout <- ctree(occ ~ .,data=prgeng) 
ctout

Here are the first few lines of output:

> ctout

Model formula:
occ ~ age + educ + sex + wageinc + wkswrkd

Fitted party:
[1] root
|   [2] educ in 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
|   |   [3] sex <= 1
|   |   |   [4] age <= 35.15345
|   |   |   |   [5] wageinc <= 49500: 101 (n = 1355, err = 63.2%)
|   |   |   |   [6] wageinc > 49500: 102 (n = 441, err = 63.3%)
|   |   |   [7] age > 35.15345

There are 79 nodes in all. We see above that nodes 5 and 6 are two of the leaf nodes. In node 5, we guess occupation code 101, and 102 for node 6.

Friday, March 6, 9:30 pm

In recosystem's data_memory() function, the index1=TRUE argument is crucial it deals with the fact that indices start at 1 in R, while they beging at 0 in C, the latter language being the one that the core of recosystem is written.

One student misspelled the argument as 'index', so the code produced NAs, thinking that things started at 0. I think this is a good chance for a learning experience. Let's think about what happened as a result of the typo.

It's easy to understand that having the positions off by 1 would result in NAs. But you might wonder why it even ran at all. Shouldn't R have emitted an error message like "index: invalid argument name"? Ordinarily yes, but the problem is that the last formal argument of data_memory() is '...'. That is intended to accommodate various other arguments, so anything is allowed, including 'index'.

Computers are subtle. :-)

Friday, March 6, 11:50 am

Two items:

Thursday, March 5, 11:05 pm

Be sure to download and print our latest supplement.

Thursday, March 5, 7:15 pm

I was asked today what criterion to use for accuracy of a model, in the Term Project. Again, it's largely an open-ended project, so you can choose your own accuracy criterion/criteria. But the following may be helpful to you:

Consider our example from the US census. Say we predict occupation 102, using a logistic model. For each person in our test set, we will get an estimated P(occ = 102 | age, gender, education etc.). If we average that probability over all people in the test test, we will get an expected total number of occupation 102 people in the test set, under the logit model.

Thursday, March 5, 6:40 pm

Ignore the 5:25 message. It was intended for ECS 145.

Thursday, March 5, 5:25 pm

Download and print the new supplement on R classes and environments Consider it an official part of our course materials, and bring it to lecture tomorrow.

Tuesday, March 3, 9:30 pm

I've been reading your Hwk II reports in the order you submitted them, giving feedback that will help on your Term Project. It will be a few days before I get through all of them.

Tuesday, March 3, 11:00 am

Regarding the Windows problem mentioned earlier, a student reports that he solved it by placing a copy of his .Rprofile file (with the .libPaths() call to set search path) in his OMSI directory.

Monday, March 2, 10:35 pm

Here's a recap of my comments today about p-values:

Background:

As mentioned, people have known this all along. No statistician would be surprised by the above statements, and few if any would disagree. But no one did anything about it until the 2016 ASA Policy Statement. It was highly critical of p-values, and though it stopped short of saying they never should be used, it gave no examples of good use cases. See also the articie in Nature, though again with somewhat muted tones.

Finally, note Raffi's point, quite correct, that the use of confidence intervals, while much better than significance testing, is still no cure for p-hacking. There are ways to adjust the CIs for that purpose, though.

Monday, March 2, 1:35 pm

Two news items:

  • We have a new supplement, on significance tests.
  • The official lists of datasets that you will choose from for your project is now in the project specs.
  • Monday, March 2, 12:45 pm

    We will have another Job Interview session this Thursday, 4:00-5:30 pm.

    Sunday, March 1, 7:30 pm

    I just noticed that I had not made the Lee and Seung paper world-readable in the supplements directory. Fixed now.

    Sunday, March 1, 12:00 pm

    As I've said, to me the Term Project IS the course, and accordingly, I structure the grading such that a good project makes a world of difference in your course grade.

    For that reason, I stated that I would make own comments on your Homework II writeups (in addition to the regular grade you get from the TA), so that you can use my feedback to do a super Term Project.

    Just now I started writing a script to extract the PDF reports from each of your .tar files, but then I realized it actually is just easier from you to send me your PDF files in e-mail. That way the PDFs are already extracted :-) and I can send you my comments simply in my e-mail reply. Please send me your Homework II PDF as soon as possible.

    Sunday, March 1, 9:50 am

    I've now sent out the grades for Quiz 4. Some remarks on the individual questions:

    Saturday, February 29, 1:45 pm

    I've updated our MustInstall file.

    Saturday, February 29, 8:25 am

    I know some students have been disappointed with their grades on the quizzes. As I've said, that will be greatly compensated for with the Job Interview and a good Term Project (which will in turn be aided by feedback I give on your Hwk II), but I've decided to add one more "compensator":

    Instead of dropping your lowest 2 quizzes, I will drop the lowest 3.

    Please note again the value of the Job Interview in your ultimate course grade.

    For the remaining quizzes:

    Friday, February 28, 9:50 pm

    Based on an interesting observation made by a student, I've simplified the specs for the Term Project, removing the parts regarding covariates.

    Friday, February 28, 8:50 pm

    Several items:

    Monday, February 24, 11:30 pm

    Our Term Project is ready!

    Sunday, February 23, 11:05 pm

    Please download and print our latest supplement .

    Sunday, February 23, 4:15 pm

    I mentioned at the start of the quarter that our course assumes NO prior background in machine learning. I just did a spot check of how some students are doing in the quizzes, comparing those who have such background with those who don't. The answer basically is, no difference. What does count, instead, is good math intution, which I also stated at the start of the course.

    As discussed before, that's why I have so many mechanisms to compensate for weaker quiz grades: dropping the lowest 2 quizzes; the Job Interview; and heavy bonus for a good Term Project.

    Again, my goal is to turn you all into good data analysts, with these rather open-ended, real-world projects. In a real job interview, you should be able to do well!

    Sunday, February 23, 1:25 pm

    There are a couple of typos in Supp02022020.pdf, p.26: (74,1.0) should be (74,180) and co2 should be hw.

    Saturday, February 22, 9:20 pm

    I've added a new supplement, to be discussed on Monday. It's rather dense, so don't try to read it in detail yet, but you might glance through it before Monday.

    Saturday, February 22, 3:10 pm

    In the InstEval data, the instructor IDs are not consecutive, which could cause problems. Here's how to convert the IDs to a new R factor with levels 1,2,3,...:

    # ids originally an R factor; temp convert to nums, then back to factor
    ids <- as.factor(as.numeric(ids))
    
    Saturday, February 22, 10:35 am

    Please keep in mind that you are responsible on quizzes for the material in the R tutorial in the appendix of our textbook.

    Friday, February 21, 10:25 pm

    The latest supplement shows the details of the ALS method in the case of partially known A. As mentioned, more supplements coming over the weekend.

    Friday, February 21, 7:55 pm

    Please note the condition stated in Hwk I (for all assignments),

    Work must be done entirely in R, and except for packages that I specify, limited to R packages available on CSIF.

    and one in Hwk II:

    You'll use only lm() and NMF matrix factorization methods, and in general are allowed to use only methods from our course. You may run lars() instead of lm() if you wish.

    In conversations with some students, I've found that they were not adhering to these requirements. If you have already done most of your project using these "forbidden" things, go ahead and submit them. But please note the following:

    Thursday, February 20, 11:20 pm

    The next official Job Interview session will be February 26, 3:30-4:30. However, if there are not many students asking about homework in office hours this Friday and Monday, I can take do a few Job Interviews then. Again, there will be other sessons after Feb. 26.

    Tuesday, February 18, 10:20 pm

    A student reported to me that his team's saved model takes up 1.5Gb. He was worried this would be a problem, and indeed it would. With 20+ groups submitting their work, this would quickly overwhelm the TA's disk quota on CSIF.

    So, we will do things this way: You will NOT submit a saved model, but sill must submit FULL code, including preparatory operations on the data, e.g. outlier removal. You will still write a thorough report, etc.

    You will also indicate, in a README file, whether you wish to be considered for the Extra Credit for the three most accurate teams. The TA will only run these, again on his own secret cross-validation partition.

    Everything else remains the same. Again, remember that you must submit your FULL code, even if you are not entering the "competition."

    Tuesday, February 18, 8:20 pm

    Our first Job Interview session will be held this Thursday from 2 to 3 pm. I will hold other sessions at various times in the next couple of weeks, so that everyone who wishes to participate will be accommodated.

    Tuesday, February 18, 6:00 pm

    There is a new supplement. Print it and bring it to class Friday.

    Tuesday, February 18, 12:50 pm

    By popular demand, I have extended the due date for Hwk II (a lot). Note, though, that I will probably be assigning the Term Project in the next few days.

    BTW, no, I did not have a typo in my 11:45 post last night. :-)

    Monday, February 17, 11:45 pm

    The TA will make sure his holdout set contains no IDs not in the training set.

    Monday, February 17, 4:55 pm

    As mentioned, our first Job Interview session will be this coming Thursday, time to be announced.

    Some details on the workings of the Job Interview are available here.

    Monday, February 17, 12:00 pm

    A couple of people have asked me about using PCA for dimension reduction with categorical variables (after conversion to dummies). A few comments:

    By the way, one can also do PCA after forming interaction terms, thus accounting for nonmonotonic relations (e.g. income vs. age).

    All this serves to show, once again, that ML is an art, not a science. There are no good rules of the form "If this, do that and then the other thing." If someone claims such, view with a highly skeptical eye.

    By the way, one can also do PCA after forming interaction terms, thus accounting for nonmonotonic relations (e.g. income vs. age).

    Sunday, February 16, 4:15 pm

    As mentioned, the regtools package contains functions factorToDummies() and factorsToDummies().

    Saturday, February 15, 4:15 pm

    Tentatively I will begin the "job interviews" next Thursday. I'll have several sessions during the next couple of weeks, so everyone who wants to do this will be accommodated.

    Your "interview" will last 5 minutes or less. Typically I'll ask a question like "Tell me about..." and then ask followup questions.

    You'll get a letter grade. If it is better than your lowest quiz grade (AFTER deleting the lowest 2), it wll replace the latter. Otherwise, it will have no effect; it cannot harm your course grade, only help or be neutral.

    Below are the job interview grades from when I taught this course in Fall 2018. (There were 37 students enrolled in the class.)

    > table(z[,2])
    
     A A- A+  B B- B+  C C- 
     7  8  1  1  1  7  1  1 
    > sum(table(z[,2]))
    [1] 27
    
    Saturday, February 15, 12:50 pm

    Had a typo in my last post (now corrected). The TWO lowest quizzes are dropped.

    Saturday, February 16, 10:15 am

    A note on our remaining quizzes:

    And the implications for your course grade:

    As I have explained before, in the mathematical courses that I teach (this one, and ECS 132 and 256), I realize that some students will not do very well on the quizzes, due to not having strong mathematical intuition. Yet I want to encourage them to learn the material and become good, strong data analysts, so I have mechanisms by which they can earn good grades -- dropping the lowest 2 quizzes; the optional "job interview"; and giving a major bonus in the course grade due to a having a good Term Project. Extra Credit on Homework II will additionally serve in the same manner as the latter.

    Saturday, February 15, 12:25 am

    To give everyone a chance to catch up, no quiz or discussion section net week. More supplements coming, though.

    Friday, February 14, 10:40 pm

    A couple of items:

    Friday, February 14, 9:50 am

    Note that I'll be adding recosystem to packages you must have installed on quizzes, and others later.

    I heard that some people had trouble loading the InstEval data in a past quiz. Make SURE this won't happen in future quizzes. Also, note that whatever fix you come up with, your code must work on both your machine and mine, the latter when I grade your quiz. You may consider putting a call to tryCatch() in your code.

    Thursday, February 13, 4:10 pm

    I was asked just now what the Homework specs mean when they say you must use at least some covariates. Here is the answer.

    Look at the worked-out lm() example. There we predicted rating from just user ID and movie ID. We could have added the age covariate, with the call

    lm(V3 ~ V1+V2+age,data=rats) 
    

    (Replace 'rats' by the name of the data frame, etc.) Now we are predicting rating from user ID, movie ID and age. Age is an example of what is called a covariate in stat and side information in ML.

    Thursday, February 13, 3:30 pm

    I have placed a new item in the Supplements/ directory, Supp02132020.pdf.

    Again, all materials in the Supplements/ directory are official parts of our course. There is also an optional item, the entire revised book. I'm providing these supplements piecemeal, and you may wish to see the whole thing in integrated form, e.g. for following links. I'll be updating this as a I along.

    Monday, February 10, 11:10 pm

    A student just asked me if they should switch to P/NP in our course, as tonight is the deadline for that. A few comments, important even if you are not considering P/NP:

    Monday, February 10, 10:40 pm

    An alert student pointed out to me that a single column for genre will not suffice in u.big. You'll need 19. I've changed the specs accordingly.

    Monday, February 10, 8:00 pm

    Please make sure you have read the course syllabus carefully, especially this passage regarding the interactive homework grading:

    You must be prepared to speak cogently about the ENTIRE assignment. In particular, if you worked on one part of the assignment and Johnnie worked on another, it is NOT acceptable to answer the TA’s question about Johnnie’s part by saying “Oh, I don’t know about that part, because Johnnie did it."
    Sunday, February 9, 12:10 pm

    Solutions for Quiz 2 are now on our Web page. Let me know if you have any questions on the solutions, or on the grading of your quiz.

    Saturday, February 8, 9:40 pm

    Please note that I have added a bullet point to the Homework, stating "It is REQUIRED that your report contain a section titled, 'Who Did What,' explaining the contribution of each team member."

    Saturday, February 8, 9:00 pm

    For quizzes etc., make sure you know how to create and use S3 classes (Sec. A.10.3 of our book).

    Saturday, February 8, 12:40 pm

    I have moved the Supplements directory to here. Please keep in mind that all such material is to be considered official course material, eligible for quizzes, the "job interview" and so on.

    Saturday, February 8, 9:35 am

    On quizzes, be sure you know the R functions solve() and t(), for matrix inverse and matrix transpose.

    Friday, February 7, 11:40 pm

    I've put in a detailed procedure for running your code for the Homework. I've also elaborated on grading and credit issues for this assignment. Read this new version as soon as possible, as it may affect how you do the work.

    Friday, February 7, 8:30 pm

    I've placed material on the LASSO in our Supplements directory. Remember, the files in that directory are considered our official course materials.

    'Cp' is Mallows' Cp, a variation on the minimal sum of squares. One popular strategy is to choose λ to be the one that minimizes Cp.

    The excerpt is from my book, Statistical Regression and Classification: from Linear Models to Machine Learning, CRC, 2017.

    Thursday, February 6, 10:35 pm

    URL in the 1:25 pm corrected.

    Thursday, February 6, 1:25 pm

    Our TA Runtian told me he had been taught that one should always scale one's data before applying PCA. Actually, this is indeed a common recommendation, but it actually oversimplifies the situation.

    This inspired me to write the third in my Clearing the Confusion series. Please read this and consider it part of our course materials.

    Monday, February 3, 9:50 pm

    Due to a meeting over in the Genomics Center, I'll be leaving my office hour at 4 pm this Friday.

    Monday February 3, 5:10 pm

    Say you look at 3 genres. Your table to explore whether different genres have different rating patterns should be 3x5.

    Monday February 3, 4:55 pm

    Every quiz covers the material from the start of the course to the Monday preceding the quiz. Actually it's the Friday preceding the quiz, but sometimes something from Monday is helpful.

    Sunday February 2, 12:35 pm

    Reminder: Make sure to be ready to use the InstEval data.

    Sunday February 2, 12:25 pm

    There is a new supplement waiting for you. Print it out before Wednesdays's lecture and bring it to lecture and quizzes. Remember, supplements are official parts of our course materials.

    Friday January 31, 12:25 pm

    As I mentioned in class, we will have a number of supplements, since I am basically writing the textbook as we go along. They will all (only one so far) be available in here.

    Friday January 31, 11:50 am

    Please note that the lme4 package, required for our class, contains the InstEval dataset. Be ready to use it in Homework and on Quizzes.

    Thursday, January 30, 8:25 am

    A couple of comments:

    Wednesday, January 29, 9:25 am

    When sending me e-mail, please remember to put "[ecs 189g]" in the Subject line.

    Tuesday, January 28, 7:20 am

    The questions and answers for Quiz 1 are now on our Web site.

    If you believe you were misgraded on a quiz problem, please feel free to bring this up with me. Make sure to read the solution first, and send me an e-mail query.

    Monday, January 27, 8:25 pm

    Comments:

    Monday, January 27, 5:50 pm

    As you know, a couple of weeks ago, some people were having trouble accessing the Kaggle data, so I found another source. I changed the MustInstall file accordingly, as well as the file for Hwk I.

    But a couple of students have pointed out that u.genre already lists the genres, and that there are 19, no 20 as in the other version.

    Please read the current specs, slightly revised. Sorry for the inconvenience!

    Sunday, January 26, 8:05 pm

    Recall that when I first wrote the homework specs and the MustInstall file, we were still having trouble obtaining the full MovieLens data, due to issues with Kaggle. But later I found the data in the GroupLens site itself. Accordingly:

    Sunday, January 26, 1:15 pm

    I am about to e-mail your grades on Quiz 1. A few points:

    Sunday, January 26, 12:00 pm

    Concerning file placement in the homework: When grading your work, the TA will run your code from his OMSI directory (or equivalent), with the same data file structure as in your OMSI directory. For instance, ml-100k/ will be a subdirectory of the directory from which he runs your R code.

    Note: Homework I due date is Thursday.

    Sunday, January 26, 8:25 am

    In your Homework problems, limit your code to base R and a small number of packages that I specifically allow. For graphics, the latter means lattice or ggplot2. Do you see why? The TA will run your code, and if 20 groups install say 3 packages each, that means that he would have to install 60 packages on his machine to grade your work, a highly unreasonable burden.

    Saturday, January 25, 4:00 pm

    A student asked me the other day if I had an opinion on the perennial debate as to whether R or Python is better for Data Science. (Note the qualifier.) I replied that I have an essay on that very topic. Comments welcome!

    Friday, January 24, 9:50 am

    To test whether you have the correct MovieLens dataset, run the code on p.30 of our book, and verify that you replicate the output. Make sure you understand every step!

    Friday, January 24, 9:30 am

    I want to give you a little more time to develop your R skill, and to make sure you have your data and R packages set up correctly. So, I am postponing Quiz 2 until the week after next. We will not hold discussion section next week.

    Friday, January 24, 8:55 am

    One technical skill you'll need in our class is facility with R factor variables. Again, see our fasteR tutorial. In particular, R packages vary as to whether they expect data in factor or indicator variable (often referred to as dummy variable) form, so you need to be able to convert between the two. The regtools package has utilities for that.

    Make sure you are really adept at operations as in this example:

    > library(lme4)  # MustInstall.html
    > data(package='lme4')
    Arabidopsis    Arabidopsis
                   clipping/fertilization data
    Dyestuff       Yield of dyestuff by batch
    Dyestuff2      Yield of dyestuff by batch
    InstEval       University Lecture/Instructor
                   Evaluations by Students at
                   ETH
    ...
    > data(InstEval)
    > head(InstEval)
      s    d studage lectage service dept y
    1 1 1002       2       2       0    2 5
    2 1 1050       2       1       1    6 2
    3 1 1582       2       2       0    2 5
    4 1 2050       2       2       1    3 3
    5 2  115       2       1       0    5 2
    6 2  756       2       1       0    5 4
    > dept <- InstEval$dept 
    > class(dept)
    [1] "factor"
    > levels(dept)
     [1] "15" "5"  "10" "12" "6"  "7"  "4"  "8"  "9" 
    [10] "14" "1"  "3"  "11" "2" 
    # 14 levels, '1', '2' etc. [note which one they skipped :-) ]
    > library(regtools)  # MustInstall.html
    > deptIVs <- factorToDummies(dept,'dept')
    > head(deptIVs)
         dept.15 dept.5 dept.10 dept.12 dept.6 dept.7
    [1,]       0      0       0       0      0      0
    [2,]       0      0       0       0      1      0
    [3,]       0      0       0       0      0      0
    [4,]       0      0       0       0      0      0
    [5,]       0      1       0       0      0      0
    [6,]       0      1       0       0      0      0
         dept.4 dept.8 dept.9 dept.14 dept.1 dept.3
    [1,]      0      0      0       0      0      0
    [2,]      0      0      0       0      0      0
    [3,]      0      0      0       0      0      0
    [4,]      0      0      0       0      0      1
    [5,]      0      0      0       0      0      0
    [6,]      0      0      0       0      0      0
         dept.11
    [1,]       0
    [2,]       0
    [3,]       0
    [4,]       0
    [5,]       0
    [6,]       0
    # 13 dummies, by default (why?)
    
    Thursday, January 23, 5:45 pm

    R's tapply() function is one of the most often-used in R. You'll find that it is quite useful in recommender systems.

    Make sure you've reviewed the tapply() examples in our fasteR tutorial before Quiz 2.

    Thursday, January 23, 5:45 pm

    To make sure we are all on the same page: When I grade future quizzes, I will first do

    % cd ~/omsi/
    

    This directory will contain OmsiGui.py etc. Ii the same one from which you run that program on your own machine.

    I will launch R in that directory, then do

    > source('Grading/AutoGradeOMSI.R')
    > grader()
    

    My grading script will then be running in that same directory, i.e. the one containing OmsiGui.py etc. In that directory, I will have files like ml-100k/u.data, exactly the ones specified in MustInstall.html. Note the directory here, ml-100k/.

    My script will then run your code, in that environment.

    Thursday, January 23, 9:15 am

    Following up on my blog post last night, in most classes I teach I try to remember to post previous grade distributions. As you know, this course is under development, but I did teach it in Fall 2018. Here are the course grades:

    > table(z$V21)
    
     A A- A+  B B- B+  C C+ 
    13  5  5  6  1  2  2  3 
    

    The grades were pretty generous, I think, but not an "automatic A," as that one student had thought.

    Wednesday, January 22, 6:15 pm

    I received e-mail today from a student who referred to getting an "automatic A" in this class. Needless to say, I was shocked to see this.

    What I stated the first day of class is that I give an extra bonus for a very good Term Project. I then showed how a student in ECS 132 last quarter got an A in the class in spite of his mediocre quiz grades, due to a good Term Project and a good "Job Interview." It is NOT the case that I gave "automatic As" in that class, nor will it be in our class.

    Monday, January 20, 7:30 pm

    The quiz results were for Quiz 0, not Quiz 1.

    Monday, January 20, 7:20 pm

    Behold!

    % wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
    
    Monday, January 20, 2:10 pm

    E-mailing Quiz 0 grades as I write this.

    Saturday, January 18, 4:10 pm

    Important information for Tuesday's quiz:

    Friday, January 17, 9:55 pm

    The second and final problem in Homework II is now on the Web.

    The more you can work on this assignment before Tuesday's quiz, the better prepared you'll be (though I anticipate that that quiz will turn out to be one of the easier ones). This is especially true of Problem A, but please note that the material on Problem B may be on the quiz; the R functions, e.g. table(), will be considered an official part of the course.

    Friday, January 17, 7:20 pm

    I've now updated the MustInstall file consistent with my 4:10 pm posting.

    Friday, January 17, 4:10 pm

    Regarding downloading the MovieLens data:

    Friday, January 17, 3:40 pm

    I've updated the MustInstall file to specify which one of the three Turkish Evaluation files I want you to put on your laptop.

    BTW, if you ever have trouble loading a dataset, inform me by e-mail right away instead of waiting to see me in class.

    Thursday, January 16, 11:35 pm

    Problem I of Homework I is ready.

    Thursday, January 16, 9:15 pm

    I just added two items to the MustInstall file, one a dataset and the other the data.table package for data selection, filtering and so on. This will be needed because some of our datasets will be large.

    Make sure to install AND TEST these downloads well before Quiz 1.

    Wednesday, January 15, 8:25 pm

    One student told me he was unable to download the Kaggle MovieLens data. I've changed the URL slightly now, and was able to download from two separate Kaggle accounts. When you click on "Data" then "Download (15 MB)", it will invite you to create an account.

    I've simplified the instructions in MustInstall.html regarding installation of datasets.

    Wednesday, January 15, 3:25 pm

    Regarding prereqs for this course: As noted in the course flyer, you need to have had a calculus-based course in probability. Key words and phrases are density function; cumulative distribution function; expected value; exponential distribution; etc. STA 13 or AP Statistics is of essentially no value for this course. I do expect that you've forgotten the material and will review it in the next chapter, but you need to have spent a whole quarter pondering the material, not a few days of learning it from scratch. Of course, if you have really first-rate mathematical intuition, it might be OK to learn it now from scratch, but I don't recommend it.

    Tuesday, January 14, 11:00 pm

    Our syllabus is ready.

    Tuesday, January 14, 9:55 pm

    I will be maintaining a list showing which data files and R packages you are required to install on your machine for quizzes. The first quiz you'll need this for will be next week (and all subsequent quizzes), so you have some time, but I recommend starting this as soon as possible. It is not the kind of thing you should leave for the last minute.

    This list will grow over time.

    Tuesday, January 14, 3:15 pm

    Our TA told me that Quiz 0 went well overall, but a couple of students could not run R, likely due to not having their search path configured properly. Needless to say, this will be a grave handicap in future quizzes.

    The flyer for our course lists ECS 30 as prereq. That course is now ECS 36A, which includes Unix tools in its topic coverage. Thus you should be very familiar with the concept of a search path. If not, you are not prepared for this course, and will need to learn these things quickly.

    Similarly, if you haven't run OMSI in both instructor and student roles as the docs state, with an R problem, then you are not ready for future quizzes.

    Tuesday, January 14, 12:10 am

    IMPORTANT NOTE: You are no longer responsible for Section 2.3.6; it's just too complicated.

    Also, note that I am continually revising the book, here. You are still responsible only for the original version, but you may find the improved phrasing helpful. In particular, I think you will find the revised Section 2.3.5 much clearer, and I recommend that you read it as a supplement.

    Monday, January 9, 9:30 pm

    Here is a dramatic use of PCA:

    It comes from this scientific paper .

    This is drastic dimension reduction, with (in our book's notation) p = 500568 (that many genes for each person) and s = 2. We usually wouldn't reduce that much, but the picture shows that a surprisingly large amount of information is retained in just 2 principal components.

    Each person is one point on the graph, plotted according to their PC1 and PC2 values. In addition, they've been color-coded by country of residence/ancesry. Amazingly, this rather well reproduces the geography of Europe.

    Thursday, January 9, 8:10 pm

    If you added the class after the first lecture, make sure you go to our class info page., and its links to our blog, the pre-quarter heads-up and the flyer.

    Thursday, January 9, 8:10 pm

    As I mentioned earlier, our course syllabus is not quite ready. Should be ready in a couple of days, and I'll announce it here.

    Thursday, January 9, 6:10 pm

    Our office hours are now posted on our class info Web page.

    Wednesday, January 8, 3:55 pm

    Note again that you will need Python 2.7 on your laptop in order to use OMSI. Conversion of OMSI to Python 3 should be simple, but we probably will not get to it this quarter.

    Tuesday, January 7, 1:20 pm Tuesday, January 7, 10:50 am

    You'll need to be an expert on OMSI, our online exam system, before discussion section of next week. We will have Quiz 0 then, whose sole purpose is to ensure that students are ready to use OMSI in the subsequent quizzes.

    This should be an automatic A+. And for most students in ECS 132 last quarter, it was:

    > z <- read.table('Quiz0Grades')
    > table(z[,7])
    
      A  A+   B   C   D  D+   F
      1 106  20   1   9   5   1
    

    You can avoid getting any grade lower than A+ by simply (a) reading the docs carefully and (b) actually running OMSI, playing the roles of both student and instructor. In (b), make sure your experiment has both Python and R questions (could be just printing out 2+2); the R question is especially important, to make sure you don't have search path problems.

    You will need both Python 2.7 and Python 3 on your laptop. As of January 1, Py 2.7 is officially deprecated. OMSI is written in Python 2.7, but according to the ECS 189G TA Runtian Wang, only a few print and exception statements need to be changed. We'll switch sometime later in the quarter. For now, it's 2.7.

    Make sure to adhere to our OMSI rules.

    Tuesday, January 7, 10:25 am

    Outline of our course:

    Also: You can get an idea of what the homework will be like by looking at the Fall 2018 assignments.

    Tuesday, January 7, 7:45 am

    Lecture begins tomorrow! Don't forget to bring your textbook.

    Keep in mind that the textbook is very incomplete. I was writing it when I taught the course in Fall 2018. That quarter the campus closed down for two full weeks, due to the Sonoma County fires and concern that the smoke would produce problems for some students. So, I will be adding supplements to the book throughout this quarter.

    Monday, January 6, 5:55 pm

    According to the Registrar site , our class' scheduled final exam slot is Tuesday, March 17, 6:00-8:00 p.m. We have no written final, but the Term Project deadline is 11:59 pm on that date.

    As noted in class, I will assign your Term Project about three weeks prior to that deadline. As also noted, the Project IS the class, extremely important. The generous grading scheme reflects that.