# Blog, ECS 256, Winter 2016, Norm Matloff

## Thursday, March 10, 5:40 pm

I really was hoping that everything would go smoothly today with students' use of handin etc. That turned out not to be the case. Worse, I asked everyone at 2:30 to try saving at that time, so that any trouble spots could be discovered early and quickly remedied. That didn't turn out to be the case either.

Sigh. I will console myself with the knowledge that everyone turned in real .tar files, e.g. not .zip renamed to .tar, and that the file structure will be OK. I've not checked yet, but I'm crossing my fingers...

I won't impose penalties on any of you, but if you have ANY suggestions to get students to follow directions, I'd sure appreciate them.

I gather that the exam itself was more difficult than I had thought. I'll post solutions later, but here is an outline:

• 1. Straightforward simulation. Generate nrow(indata) * nnoise random variables from N(0,σ^2), apply matrix(.,ncol=nnoise) and then use cbind() to tack on the new data to the old. Or, use loops if you wish. Call lm().
• 2. Use Var(Y) = E[Var(Y|X)] + Var[E(Y|X)]. Since we are given that Var(Y|X) = X etc., and we know (i.e. look up in the book) that Var(X) = 1/12, we're done in 1.5 minutes. With integrals it does take a while, but still not bad.
• 3. Straightforward problem similar to the homework, no surprises.

## Thursday, March 10, 8:50 am

Recall that I said earlier, " I'll be basing your homework grade on three things: Nick's prescreeing assessment; my own reading of your homework submissions; and today's interactive grading." I will be doing that reading in the next couple of days, done hopefully tomorrow, when I will give you your homework grade.

## Wednesday, March 9, 6:25 pm

Faculty interview Talks starting tomorrow.

A reminder that our first NLP faculty candidate talk will be tomorrow at 3:10pm in 1131 Kemper Hall:

Dr. Ndapa Nakashole, from Carnegie Mellon University

Title: Knowledge Graph Construction using Machine Reading Methods

Abstract:

Knowledge graphs such a NELL, Freebase, and YAGO provide a means to address the knowledge bottleneck in artificial intelligence domains such as natural language processing, computer vision, and robotics. State of the art knowledge graphs have accumulated large amounts of beliefs about real world entities using machine reading methods. Current machine readers have been successful at populating such knowledge graphs by means of pattern detection =E2=80=94 a shallow way of machine reading which leverages the re= dundancy of large corpora to capture language patterns. However, machine readers still lack the ability to fully understand language. In the pursuit of the much harder goal of language comprehension, knowledge graphs present an opportunity for a virtuous circle: the accumulated knowledge can be used to improve machine readers; in turn, advanced reading methods can be used to populate knowledge graphs with beliefs expressed using complex and potentially ambiguous language. In this talk, I will elaborate on this virtuous circle, starting with building knowledge graphs, followed by results on using them for machine reading.

Bio:

Ndapa Nakashole is a post-doctoral fellow at Carnegie Mellon University. Her research interests include machine reading, natural language processing, machine learning and data mining. She works with professor Tom Mitchell on using machine learning to build computer systems that intelligently process and understand human language. She received her PhD from Saarland University, Germany.

## Monday, March 7, 8:05 pm

Keep in mind that this coming Thursday, March 10, we will have our second quiz.

• You must have a laptop computer, with R and LaTex installed and tested.
• Your .tar file will be named in the same manner as in your homework, but with just your own e-mail address, not those of your teammates. It must be timestamped on or before 3:00 p.m. on handin on CSIF, in the directory 256quiz2.
• NO LATE SUBMISSIONS. Save your work to handin frequently, so that if time runs out, you at least still have something submitted.
• To be prepared for a situation (and ONLY then) in which WiFi goes down, bring a USB key (memory stick) to submit your file to me that way. Same restriction on timestamp.
• Your .tar file must contain one .tex file, one .pdf file (output of the .tex file), and one R file for each problem requiring code to be written. The R files will be named x.R, where x is the problem number, e.g. 2.R for Problem 2. Even if a problem is purely R, your code listing must be in the .tex file, in addition to the .R file.

The quiz will cover the statistics half of the course.

## Monday, March 7, 6:05 pm

In our textbook, I make a big point of the folly of significance testing. I also note that even though everyone knows it has problems, they use it anyway, which is bizarre.

Today the American Statistical Association finally recognized that, pretty amazing. By the way, the current ASA president is Jessica Utts, a former UCD professor. Here is the ASA's statement, signed by Jessica and the ASA executive director:

Dear Member,

Today, the American Statistical Association Board of Directors issued a statement on p-values and statistical significance. We intend the statement, developed over many months in consultation with a large panel of experts, to draw renewed and vigorous attention to changing research practices that have contributed to a reproducibility crisis in science.

Widespread use of 'statistical significance' (generally interpreted as 'p < 0.05') as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process," says the ASA statement (in part). By putting the authority of the world's largest community of statisticians behind such a statement, we seek to begin a broad-based discussion of how to more effectively and appropriately use statistical methods as part of the scientific reasoning process.

In short, we envision a new era, in which the broad scientific community recognizes what statisticians have been advocating for many years. In this "post p < .05 era," the full power of statistical argumentation in all its nuance will be brought to bear to advance science, rather than making decisions simply by reducing complex models and methods to a single number and its relationship to an arbitrary threshold. This new era would be marked by radical change to how editorial decisions are made regarding what is publishable, removing the temptation to inappropriately hunt for statistical significance as a justification for publication. In such an era, every aspect of the investigative process would have its appropriate weight in the ultimate decision about the value of a research contribution...

This is the first time the ASA has spoken so publicly about a fundamental part of statistical theory and practice. We urge you to share this statement with appropriate colleagues and spread the word via social media...

## Wednesday, March 2, 8:35 am

The summation in Equation (12.60) should go from i = 1 to 3.

## Tuesday, March 1, 8:35 pm

We discussed the LASSO in class today. I stated that the procedure is interesting, but is no panacea. You may find a post to my statistics blog of interest in this regard.

## Tuesday, March 1, 8:35 pm

I made an Extra Credit offer to my ECS 132 class. The offer applies to you in ECS 256 as well.

## Monday, February 29, 8:35 pm

The interactive grading went reasonably well today.

I know some students were disappointed with their performance. Though I had said in class and here in the blog that the interactive grading would be "conversational," "like an interview," "like a PhD oral exam--you just talk, and I ask questions based on what you say," I think some students just didn't expect it to work this way. But really everyone did at least OK, some much more than OK; don't worry about it.

By the way, I will be announcing a pretty major Extra Credit offer, either this evening or tomorrow.

One important thing that came up in today's grading sessions: Most students found it difficult to express themselves on technical issues. On a number of occasions, I said, "Pretend you are a consultant and you are talking to a client. How would you explain what XYZ is?", e.g. the Kolmogorov-Smirnov band that you coded in Homework II. It's not just that they had problems explaining technical things in general, but also they were not able to put themselves into the shoes of the client, i.e. understand what the client would want to know and how to explain it to him/her in terms the client could understand. This clearly is a skill that everyone should strive to perfect. Anybody can achieve excellence in this, even those who are not native speakers of English.

## Sunday, February 28, 12:25 pm

The interactive grading will be in my office, Kemper 3053.

## Sunday, February 28, 9:35 am

Remember, our interactive grading sessions will be held tomorrow (Leap Day!), from 3 to 6:30. By now, you should have gotten notification from your group leader as to your time slot.

The style will be conversational. I will ask questions of each person in the group, along the lines of "Why this?" and "How that?" As noted in class, each member of the group must be able to answer questions about any problem in the assignment. Both Homework I and Homework II will be covered.

## Friday, February 26, 4:45 pm

You'll recall that I've said several times, both here and in class, that estimating mixtures is hard, due to convergence problems (including converging to the "wrong" solution), issues with the latency (hiddenness) of some variables, and so on.

In that light, you'll understand why I laughed out loud today when I saw this quote from one of my favorite statisticians, Larry Wasserman of CMU:

I have decided that mixtures, like tequila, are inherently evil and should be avoided at all costs.

He also said that mixtures are the "twilight zone" (i.e. Land of the Weird) of statistics.

This is all reported in this posting to R-Bloggers, an extremely valuable newsletter for anyone who does a lot of data analysis. My own blog is archived there too.

Getting back to that mixture posting, what makes it even more interesting is that the posting cites a paper saying that the better way to estimate mixtures than MLE is the Method of Moments! (But of course, there are problems there too.)

## Thursday, February 25, 9:45 am

I have decided not to assign Homework III, for the following reasons:

• We are simply running out of time. I would not want to have interactive grading after March 10, our last day of class, and working backwards from that point, time is just too tight.
• I believe that many students are not spending enough time on the reading. One of the factors (though certainly not the only one) underlying that is likely that people have been focusing on the homework.

I would have liked to have had a third assignment, but the first two have been quite substantial, and certainly achieve the goals I had.

There are still some groups that have not submitted their time preferences for interactive grading to me. Please do so TODAY.

## Wednesday, February 24, 12:10 pm

I wish to reiterate that estimation of mixtures is hard. Some students have mentioned to me that they are getting negative values for the mixing coefficients, for instance, and this is not surprising. (It will probably go away if you take a much larger sample.)

This might be improved somewhat by taking an alternating approach similar to that of the EM algorithm: Treat the mixing coefficients as correct and estimate the distribution family parameters in odd-numbered iterations, and do the opposite in the even-numbered ones. But it is not a panacea.

Note by the way that the power law family is very unstable, in the sense that we usually have very small values but sometimes have really large ones, the "Black Swan" notion. Accoringly, that makes estimation of the gamma parameter difficult as well.

Remember that in this project we're pretending to develop a tool that could be used by people who choose to use these "dangerous" models such as mixtures and power laws. We are NOT endorsing the use of such models.

## Wednesday, February 24, 12:00 pm

Several people have asked me about standard errors. What are they, and how are they derived?

The answer to the second question is more complex. Where does gmm() get its standard errors, for instance? This relies on the Central Limit Theorem, and something called the "delta method. Here is an outline of the derivation.

First, note that sample moments (say noncentral ones) are made up of sums, so the Central Limit Theorem says they are approximately normally distributed. In fact, if we have a vector of k sample moments, its approximate distribution will be k-variate normal.

Second, we know that any linear function of a normally distributed random vector will again be normally distributed. In general, the estimators coming out of a Method of Moments operation will not be linear functions of the sample moments, but we can approximate them by linear functions by expanding in a first-degree Taylor series, For large n, theta-hat will be close enough to the true population parameter vector theta that the nonlinear terms become negligible. We thus can get the approximate covariance matrix of theta-hat, from the covariance matrix of the sample moments.

## Tuesday, February 23, 9:00 pm

It would be good for you to draw a picture of the approximation in (4.46). There you'll see that c/(gamma-1) is an underestimate of the series. In fact, the greater gamma is, the more serious the error. Thus for larger gamma, the true c is considerably less than gamma-1.

Thus when you calculate the first moment of this distribution, I recommend that you use a better method to find c. One quick way would be to simply keep summing k^(-gamma), k = 1,2,3,... until the sum doesn't change much, and then take c to be the reciprocal of the sum.

## Tuesday, February 23, 4:00 pm

Unfortunately, it has just come to my attention that my specs for the CRAN problem in Homework II are missing an important ingredient -- a function specifying the pmf/pdf for the parametric family being fit.

One group told me that in their code, the user is prompted for this. I will definitely accept that, but of course it is not good design practice. It would be better to specify the function in an argument gd in the mmfit call, and I've updated the specs accordingly.

Sorry for this omission. It would have been nice if I had learned of it earlier, but since it is so late, I've extended the due date accordingly.

Here are some tips, in response to some questions I've gotten:

• For your examples, you can generate random data from the specified family if you wish, but it would be better to use real data. Note that that is what I did in the book example. I used the real geyser data, fit a beta model to it, and then graphed to assess the fit.
• For the power law example, there are some nice data sets in the powerlaw package on CRAN.
• Note that using the visual approach to fit assessment as we are doing here is just an informal procedure; there is no formal way to make a decision "Good fit, bad fit" based on this. A formal way would be to do a goodness-of-fit test (chi-squared, Kolmogorov-Smirnov etc.), but significance tests can be highly misleading, as we have discussed.
• You are supposed to derive the first moment of the power law family. Let me know if you are having trouble with this.
• Note that R's functions curve() and integrate() can be useful.

## Monday, February 22, 11:05 pm

In my hint this afternoon on Math Problem A, I forgot to mention (a) use iterated expectations and (b) review the material on p.367 and the related emphasis in class that λM is a random variable (because M is).

## Monday, February 22, 10:55 pm

In denoting multiplication in your homework and quizzes, please use '*' only for program code, NOT in math expressions. For the latter use juxtaposition, \cdot etc.

## Monday, February 22, 10:50 pm

We need to start interactive grading for the homework. Since we are getting a rather late start (my fault), we'll combine the grading for Homework sets I and II.

The grading will consist of 30 minutes per group. If possible, I'd like to have most or all of the sessions next Monday, February 29, between 3 and 6:30 pm. (So the last session might be during 6:00-6:30.) Please discuss with your team, and appoint ONE person to communicate with me on your time preferences during that Monday time block. Remember, all members of the team must be present.

## Monday, February 22, 9:10 pm

In the homework, I am only asking for the discrete power law.

Nick's excellent Rcpp tutorial is on my Web page.

## Monday, February 22, 5:15 pm

Notes on the current homework:

• The CRAN problem mentions the notion of empirical cdf and the Kolmogorov-Smirnov method, and says they are discussed in the full version of the book. Presumably most of you know that one can do searches within PDF files, and have found the needed material. But a group asked me about it today, and they apparently didn't know one can search PDFs. Keep this in mind, as it obviously is valuable (not just in our course)!

The sections are 20.9 and 21.2.2.

• The same group also spoke of the power law as "a relation between two variables," which they got from the Wikipedia entry. Then concluded that the argument x in our CRAN problem for the power law will be a matrix rather than a vector.

I consider Wikipedia to be a wonderful resource, and I use it a lot myself. But it can be wrong, or misleading. (Books can be wrong or misleading too.) Unfortunately, Wikipedia's phrasing here gave these students the wrong impression.

In that Wikipedia entry, one of the "two variables" they refer to is not a variable at all; it is a frequency. In other words, the power law is simply a family of probability mass functions, as in our book, or a similar family of density functions (not in our book).

In the Wikipedia example of the sizes of earthquakes, for instance, the horizontal axis is earthquake size and the vertical axis is the density value for that size.

• In Math Problem A, your derivation will at some point have sums like
• iP(Xn = 1) × something

when you take the limit, that factor P(Xn = 1) will become πi.

## Sunday, February 21, 2:45 pm

• The solutions are here.
• Problems 1(a) and 2 were meant to be very easy. while 1(b) and 3 were meant to be challenging.
• I have not stated how much of your course grade the quizzes will count. One reason is that I want to be able to account for disparate backgrounds of the Stat and non-Stat students. I want very much to teach this course in a manner that will be useful to both groups; this is a tough task, so I want to take background into consideration in the grading.

Definitely the interactive homework grades (Hwk I grading coming soon!) will play a larger role in the course grades. But yes, the two quizzes will be factors too.

• As judged by median/mean, the Stat students tended to do better than the non-Stat ones. HOWEVER, that was only at the low end; at the high end and the middle, the two groups were roughly similar. In other words, the disparate backgrounds of the two groups was much less of a factor than one might guess. [Note: that word is disparate, not desperate. :-) ]
• Apart from the above, the following needs to be said: A disturbingly large number of students failed to heed directions. My blog post of Tuesday, February 2, 8:25 pm said,

You will submit your work in a .tar file to 256quiz1 on CSIF. The file will contain your LaTeX document in a file Quiz1.tex, with its output in Quiz1.pdf, as well as your R code, in a file Quiz1.R. No subdirectories, please.

I repeated these requirements in class multiple times in the days leading up to the quiz. It was your responsibility to know how to do these things, e.g. form .tar files. And yet instead of .tar I got .zip and even .rar; I got corrupted .pdf files; several people didn't know how to use handin; and many people had subdirectories!

If a student does not take the time to "take care of business" on even such mundane things such as the proper way to submit quizzes, one must speculate that the student did not give careful thought to our textbook reading and so. For instance, I've gotten rather few questions from students about the material in the textbook.

Please excuse my bluntness here, but an exam is supposed to give you useful feedback, so please treat my comments as stemming from my concern for your welfare as students.

If you did submit your quiz in full compliance with the instructions, please notify me, and I will give you Extra Credit. But everyone is expected to be fully compliant in Quiz 2.

## Saturday, February 20, 11:05 am

I am currently grading your Quiz 1 papers. Again, sorry for the delay, but right now I am being further delayed by the failure of some students to comply with directions, which is causing my grading script to crash. Please note, for example, that .zip files, .rar files and so on are not .tar files! I will not impose penalties, but I would highly appreciate it if you all strictly follow directions in Quiz 2. Thanks in advance.

## Friday, February 19, 9:50 pm

A student sent me a query regarding the beta distribution example of gmm() in our book, specifically the computation of m2. Why, he asks, do I use

(x-meanb)^2


(x-mean(x))^2


Actually, the latter would work too, but the former should be a little better, as follows.

Look at (15.23), p.270. It is exactly in analogy with the population mean: We average over our sample instead of averaging over the population, and we use the sample mean instead of the population mean. BUT if we knew the population mean μ, we could use it instead of X-bar, and get a more accurate estimate of σ2. In fact, then there would be no controversy regarding divding by n vs. n-1, since the estimate with 1/n would be unbiased. (Make sure you see this.)

In the case brought up by the student, we don't know μ but we do know it in terms of α and β, and this is useful.

## Friday, February 19, 10:30 am

In my post yesterday mentioning "my mmf() function," I of course meant "my mm() function."

## Thursday, February 18, 5:45 pm

News on Hwk II:

• Someone pointed out that 2/23 is a Tuesday, not a Monday. So, I have changed the due date to Tuesday. However, please note that Hwk III is coming soon.
• I believe I mentioned in class that one generally needs a very large sample for good results on mixtures (and that nonlinear procedures like this can be very sensitive to starting values). The user chooses the starting values; if it doesn't converge, he/she must try new ones (or maybe suspect that the model is incorrect).
• I'm getting queries on the built-in cases, but don't lose sight of the user-specified case.
• One group found a bug in gmm(). Nick reported it to the maintainer of the package, and his response was

From: "Pierre Chausse"
Date: Feb 17, 2016 13:50
Subject: Re: gmm Bug - Single Parameter Estimation
To: "Nick Ulle"
Thanks for pointing that out. It is so rare that I have never tested it. I will fix it when I have time. I might just remove the optimize option and leave the option of using Brent which works well. You did not use Brent properly in gmm(). t0, like in optim, is a scalar and lower and upper must be provided as separate arguments. The following works well: gmm(exp_constraint, x, 5, lower = 0, upper = 10, method = "Brent") Pierre

You also may wish to try my mmf() function, linked to in the homework specs.

I hope to get your quizzes graded tonight. Apologies for the delay.

## Wednesday, February 17, 10:55 pm

I see now that I forgot to state in Math Problem A that we're assuming that the stationary distribution etc. exist. I've now added such language.

## Sunday, February 14, 9:45 pm

The specs for Homework II are now complete.

## Sunday, February 14, 8:55 pm

I just made a couple of clarifications on Hwk II.

## Sunday, February 14, 4:15 pm

Note that the due date for Homework II, February 23, should be considered FIRM. Requests for an extension of one day will be considered upon evidence that the requesting group is nearly done.

There will be one math problem in the assignment, also due February 23. It should be on the Web later today.

## Sunday, February 14, 3:45 pm

The CRAN project Hwk II is now on the Web!

## Thursday, February 4, 11:25 pm

I've placed today's quiz, with solutions, in the Exams/ directory of our class Web site.

## Tuesday, February 2, 9:25 pm

If you submitted Stage 2 for the homework in a manner that overwrote Stage 1, please resubmit Stages 1 and 2 together in a single file.

## Tuesday, February 2, 8:25 pm

Concerning Thursday's quiz:

• Coverage for this quiz will be on probability topics only, no statistics. That means discrete- and continuous-time Markov chains, the exponential distribution family, random vectors and multivariate distributions.
• As previously noted, you'll need a laptop with R and LaTeX installed (and tested).
• During the exam, you may access anything on the Internet -- but no communication with any people. If you wish to bring books -- probability textbooks, dictionaries, comic books -- that is fine.
• You will submit your work in a .tar file to 256quiz1 on CSIF. The file will contain your LaTeX document in a file Quiz1.tex, with its output in Quiz1.pdf, as well as your R code, in a file Quiz1.R. No subdirectories, please.

## Monday, February 1, 2:15 pm

Thursday's quiz will not cover the statistics material, i.e. Chapter 13 and later. That material will be covered on the second quiz.

## Thursday, January 28, 9:15 pm

I'm getting questions on Problem C. I think it would be helpful to look at a specific example state of transition rates.

There are various ways to label the states. What I'll use here will be (i,j,k), for i ordinary calls currently in service, j calls at the manager (1 being served currently, j-1 in line) and k calls in the arrivals queue. One could set up an equivalent double-subscripting system, as in the problem specs, but this one is easier to discuss here. (And you are welcome to use the triple-subscript system if you wish.)

Say s = 6, and consider the state (6,3,2). Suppose one of the two servers finishes his/her call, and takes the one at the head of the arrivals queue. The latter now has only 1 call. However, say the server finds (immediately, according to assumption) that it is actually a complex call, and hands it off to the manager, so that there are now 4 calls at the manager. But that frees the server who did the handoff, and now that server can take another call from the arrivals queue, further reducing the queue length to 0. BUT this call too might be complex! If so, we end up at state (5,5,0).

Now, at what rate are we going from state (6,3,2) to (5,5,0)? For one particular server, we have a rate of σ q^2 but there are 6 servers for which the above events could occur, for a total rate of 6 σ q^2.

Note that though the above description of what transpires in a transition from (6,3,2) to (5,5,0) involves several events, they all occur instantaneously (under our assumptions). A key point is that only one exponentially distributed random variable is involved, so the issue discussed in class, concerning the probability that two independent continuous random variables occur at once is 0, does not apply here.

As mentioned before, this is a very messy, intricate problem, with lots of cases. You will NOT be docked points for overlooking a case here and there. However, if your simulation results don't come out reasonably close to your analytical ones, you should try to find possible errors either in your math or your code.

## Thursday, January 28, 9:45 am

Please keep in mind that we will have our first quiz on February 4. (Our second and last quiz will be on the last day of class, March 10.) Note the following.

• You will need a laptop with R and LaTeX installed (and tested!).
• You will package your quiz in a .tar file consisting of your .tex, .pdf and .R files, as with the homework.
• You will submit your .tar file on the CSIF machines via handin, in the directory 256quiz1. Please try this ahead of time if you have not used handin before (e.g. one of your teammates has been submitting the homework). CHECK THAT IT WORKS.
• The coverage of the quiz will include everything up to and including the lecture of February 2.
• Needless to say, during the quiz, you must not communicate with any person in or outside the class, either verbally or electronically.

## Tuesday, January 26, 11:20 pm

A question arose today as to exactly what is meant in Problem C by the determination of whether a call is complex is made "upon call arrival." I agree that this was ambiguous. What I meant was when the call is screened by a server. In particular, a call must wait in the big queue, finally be looked at by a server when the call reaches the head of the queue and a server becomes free, and then determined by that server.

By the way, this problem illustrates how quickly a model can become complex. In its description, it seems simply, but the Markov chain is rather intricate, with a lot of cases.

## Monday, January 25, 8:20 pm

Recall that we will have two quizzes in our course. I've decided that the first one will be on February 4.

You will use your laptops in the quiz. The laptop must have LaTeX and R installed. If either of these conditions presents a problem for you, please let me know as soon as possible.

## Monday, January 18, 4:20 pm

Math Problem A, first two bullets, concerns the final contents of a box, after we start a new one.

## Sunday, January 17, 11:00 pm

Stage 2 is ready, due February 1. There is quite a bit here, more than it seems. Start now!

## Sunday, January 17, 9:35 pm

I am preparing the second (and last) stage of the homework now, ready later this evening. (I'll announce it then.)

Homework II will involve mixture distributions, including clustering.

Homework III, the last one, will involve something called random effects models.

## Thursday, January 14, 10:25 pm

I made a couple of slight clarifications in the math problem in the current homework.

Today I mentioned the Method of Stages in class. The goal is to be able to handle non-exponential distributions in Markov models. Here is a summary:

• We can approximate any continuous distribution by a discrete one. For instance, we can approximate a U(0,1) random variable U by V, which takes on the values 0.01,0.02,...,0.99,1, with probability 0.01 each.
• We can approximate a constant by a sum of i.i.d. random variables, in particular exponentially-distributed ones. For instance, we can approximate the constant 5.0 by a sum of 100 i.i.d. exponentially-distributed random variables with mean 0.05; since the standard deviation of that sum is 100 (0.05^2) = 0.25, which is small relative to the mean 5.0, the sum is approximately constant. With 1000 summands it is even more accurate, etc.
• Now consider, say, the Machine Repair example, with repair time being exponentially distributed with mean 8.0. Suppose repair time were instead the constant 8.0. That would make the model non-Markovian. But we could replace that repair time with a sum of 100 exponential random variables, each having mean 0.08. We would then add 100 artificial states to the model (actually 200, for use when both machines are down), and voila!, we now have a Markovian system.
• If the repair time had a uniform distribution, we would first approximate it by a discrete distribution as above, and then approximate each of the values of the latter distribution by a sum of exponentials. So, we could make the system Markovian here too.

## Wednesday, January 13, 1:45 pm

For various reason, I've decided to extend the due date for the homework, so that Stage 1 will be due on January 25. However, if you can get Stage 1 done by the original January 19 date, I will make a note in my records of this.

The second and final stage will probably be due a few days after the 25th. It will conist of the continuous-time completion of the CRAN project, and a couple more math problems.

## Tuesday, January 12, 3:45 pm

The following is a summary (and partly an extension of) a conversation I had with some students after class today, concerning the current CRAN project.

The user U has a Markov chain he/she wants to analyze. If it is a chain with finite state space, U can simply supply your function with an R matrix, the transition matrix. However, in the infinite case, this cannot be done (and even in the finite case it might be cumbersome). So your function will provide U with an alternative, which is to specify the matrix via a function, pijdef(). What is the nature of that function?

Consider an element pij of the transition matrix. That can be considered to be a function of i and j; pijdef() formalizes this notion. Consider, for instance, random walk on 1,2,3,... The function might look like this:

pijdef <- function(i,j) {
if (i == 1 && j == 2) return(1)
if (abs(i-j)) != 1 return(0)
0.5
}


Now, how to find π in the infinite case? The answer is to find the approximate value through an iterative process. First truncate the chain to the first k states, say k = 10, and find π for that chain. Then do the same for a larger truncated version, say k = 20, and find π for it. If the two π vectors are close enough, then use the second one as your final approximation for the full chain. If not, increase k even more.

The hardest thing to do in creating a package for others to use is to anticipate what users might want. -- and what the users might want not to be bothered with. It's up to you whether to ask the user to supply the initial k, or just not to involve the user in this. You may want to ask the user, say optionally, for a convergence criterion ε

You'll need at least one test case, one with known answer. I suggest using birth-and-death models. These are easy to understand and use. There is plenty on the Web on this; see for instance this one on the discrete-time case.

## Monday, January 11, 5:45 pm

You must write up your mathematical solutions in LATEX. clearly and in detail, doing a good professional job. Don't just write down equations; explain them.

## Saturday, January 9, 7:55 pm

Stage 1 of the homework is now complete, due January 19. You should start as soon as possible.

## Saturday, January 9, 3:15 pm

I now have the first part of Homework I ready, consisting of the CRAN project. I may make some small clarifications and corrections, but it's basically ready to go. Let me know if you have any questions.

## Friday, January 8, 10:35 pm

Scooped! (Another good word for nonnative speakers of English.)

I said in class that I didn't know of any R packages that dealt with general Markov chains. Well, I was 42 days out of date. :-) The package markovchain was released on November 26. By the way, I found out about it through my subscription to R-bloggers, a free service that I highly recommend to you. See Joe Rickert's article Getting Started with Markov Chains.

The package looks pretty good, but it only does part of what I will be having you do in your first CRAN project, which as you know will be on Markov chain analysis. On the other hand, it does cover some material not in our course, such as on recurrent and transient states; I have a chapter on that for our book, but chose not to include it.

It is common for one R package to make use of another. (Indeed, it is rare that a package does NOT do this.) You may wish to have your package include this one. Note, though, that this one includes C++ code, making it a little harder to install, a reason you may want NOT to include it.

I expect to have the formal specs writeup for your project in the next day or two.

## Friday, January 8, 10:45 am

I've add a bit more detail to the CRAN introduction that I announced here last night.

## Thursday, January 7, 11:40 pm

Please read my quick introduction to R package development. I will be posting (at least a rough draft of) your first package homework in the next day or so.

## Thursday, January 7, 9:15 pm

As I mentioned in class today, Equation (6.29) was not quite right. I've fixed it in the non-class version of the book. That one is always up-to-date, as I continually revise the book (but the official text for our course is still this one, as before).

## Thursday, January 7, 9:15 am

In case you are wondering, here is the grade distribution from the last time I taught this class, winter 2014. (Dipak Ghosal and I alternate teaching the course.) It was smaller than what we have this year, 29 students. The grade breakdown was

A+ 6
A 11
A- 7
B+ 5


## Thursday, January 7, 9:10 am

Reminder: Tomorrow (say 5 pm) is the deadline for submitting your group information. If you have formed a group, send the membership list. If you wish to be assigned to a group, so state. I said on Tuesday that you should send this information to me, but instead, please send it to Nick, our TA.

## Tuesday, January 5, 6:10 pm

Our TA, Nick Ulle, will hold office hours Wednesdays, 10-12, in 3106 Kemper.

## Tuesday, January 4, 12:30 pm

Our classroom is 205 Olson, not 205 Wellman as our Web page originally stated.