\chapter{Discrete Random Variables}
\label{dis}

This chapter will introduce entities called {\it discrete random
variables}.  Some properties will be derived for means of such
variables, with most of these properties actually holding for random
variables in general.  Well, all of that seems abstract to you at this
point, so let's get started.

\section{Random Variables}

\begin{definition}
A {\bf random variable} is a numerical outcome of 
our experiment.
\end{definition}

For instance, consider our old example in which we roll two dice, with X
and Y denoting the number of dots we get on the blue and yellow dice,
respectively.  Then X and Y are random variables, as they are numerical
outcomes of the experiment.  Moreover, X+Y, 2XY, sin(XY) and so on are
also random variables.

In a more mathematical formulation, with a formal sample space defined,
a random variable would be defined to be a real-valued function whose
domain is the sample space.

\section{Discrete Random Variables}

In our dice example, the random variable X could take on six values
in the set \{1,2,3,4,5,6\}.  This is a finite set.

In the ALOHA example, $X_1$ and $X_2$ each take on values in the set
\{0,1,2\}, again a finite set.\footnote{We could even say that $X_1$
takes on only values in the set \{1,2\}, but if we were to look at many
epochs rather than just two, it would be easier not to make an
exceptional case.}

Now think of another experiment, in which we toss a coin until we get
heads.  Let N be the number of tosses needed.  Then N can take on 
values in the set \{1,2,3,...\}  This is a countably infinite
set.\footnote{This is a concept from the fundamental theory of
mathematics.  Roughly speaking, it means that the set can be assigned an
integer labeling, i.e. item number 1, item number 2 and so on.  The set
of positive even numbers is countable, as we can say 2 is item number 1,
4 is item number 2 and so on.  It can be shown that even the set of all
rational numbers is countable.}

Now think of one more experiment, in which we throw a dart at the
interval (0,1), and assume that the place that is hit, R, can take on
any of the values between 0 and 1.  This is an uncountably infinite set.

We say that X, $X_1$, $X_2$ and N are {\bf discrete} random variables,
while R is {\bf continuous}.  We'll discuss continuous random variables
in a later chapter.

\section{Independent Random Variables}
\label{indepdef}

We already have a definition for the independence of events; what about
independence of random variables?  Here it is:

\begin{quote}

Random variables X and Y are said to be {\bf independent} if for any
sets I and J, the events \{X is in I\} and \{Y is in J\} are
independent, i.e.  P(X is in I and Y is in J) = P(X is in I) P(Y is in
J). 

\end{quote}

Sounds innocuous, but the notion of independent random variables is
absolutely central to the field of probability and statistics, and will
pervade this entire book.  

\section{Expected Value}
\label{expval}

\subsection{Generality---Not Just for \underline{Discrete} Random
Variables}

The concepts and properties introduced in this section form the very
core of probability and statistics.  {\bf Except for some specific
calculations, these apply to both discrete and continuous random
variables.}

The properties developed for {\it variance}, defined later in this
chapter, also hold for both discrete and continuous random variables.

\subsubsection{What Is It?}

The term ``expected value'' is one of the many misnomers one encounters
in tech circles.  The expected value is actually not something we
``expect'' to occur.  On the contrary, it's often pretty unlikely.

For instance, let H denote the number of heads we get in tossing a coin
1000 times.  The expected value, you'll see later, is 500 (i.e. the
mean).  Yet P(H = 500) turns out to be about 0.025.  In other words, we
certainly should not ``expect'' H to be 500.

In spite of being misnamed, expected value plays an absolutely central
role in probability and statistics.

\subsection{Definition}

Consider a repeatable experiment with random variable X.  We say that
the {\bf expected value} of X is the long-run average value of X, as we
repeat the experiment indefinitely.

In our notebook, there will be a column for X.  Let $X_i$ denote the
value of X in the i$^{th}$ row of the notebook.  Then the long-run
average of X is 

\begin{equation}
\label{longrunavg}
\lim_{n \rightarrow \infty} \frac{X_1+...+X_n}{n}
\end{equation}

Suppose for instance our experiment is to toss 10 coins.  Let X denote
the number of heads we get out of 10.  We might get four heads in the
first repetition of the experiment, i.e. $X_1 = 4$, seven heads in the
second repetition, so $X_2 = 7$, and so on.  Intuitively, the long-run
average value of X will be 5.  (This will be proven below.) Thus we say
that the expected value of X is 5, and write E(X) = 5.

\subsection{Computation and Properties of Expected Value}

Continuing the coin toss example above, let $K_{in}$ be the number of
times the value i occurs among $X_1,...,X_n$, i = 0,...,10, n =
1,2,3,...  For instance, $K_{4,20}$ is the number of times we get four
heads, in the first 20 repetitions of our experiment.  Then

\begin{eqnarray}
E(X) &=& \lim_{n \rightarrow \infty} \frac{X_1+...+X_n}{n} \\
&=& \lim_{n \rightarrow \infty} \frac{0 \cdot K_{0n}+1 \cdot K_{1n}+2 \cdot
K_{2n}...+10 \cdot K_{10,n}}{n} \\
&=& \sum_{i=0}^{10} i \cdot \lim_{n \rightarrow \infty} \frac{K_{in}}{n}
\end{eqnarray}

To understand that second equation, suppose when n = 5 we have
2, 3, 1, 2 and 1 for our values of $X_1,X_2,X_3,X_4,X_5,$.  Then we can
group the 2s together and group the 1s together, and write

\begin{equation}
2 +3 + 1 + 2 + 1 = 2 \times 2 + 2 \times 1 + 1 \times 3
\end{equation}

But $\lim_{n \rightarrow \infty} \frac{K_{in}}{n}$ is the long-run
fraction of the time that X = i.  In other words, it's P(X = i)!
So,

\begin{equation}
E(X) = \sum_{i=0}^{10} i \cdot P(X = i)
\end{equation}

So in general we have a

{\bf Property A:} The expected value of a discrete random variable X which
takes value in the set A is 

\begin{equation}
\label{a}
E(X) = \sum_{c \in A} c P(X = c)
\end{equation}

Note that (\ref{a}) is the formula we'll use.  The preceding equations
were derivation, to motivate the formula.  Note too that \ref{a} is not
the {\it definition} of expected value; that was in \ref{longrunavg}.
It is quite important to distinguish between all of these, in terms of
goals.

It will be shown in Section \ref{binom} that in our example above in
which X is the number of heads we get in 10 tosses of a coin,

\begin{equation}
P(X = i) =  \binom{10}{i} 0.5^i (1-0.5)^{10-i}  
\end{equation}

So

\begin{equation}
E(X) =  \sum_{i=0}^{10} i \binom{10}{i} 0.5^i (1-0.5)^{10-i}  
\end{equation}

It turns out that E(X) = 5.

For X in our dice example, 

\begin{equation}
E(X) = \sum_{c=1}^6 c \cdot \frac{1}{6} = 3.5
\end{equation}

It is customary to use capital letters for random variables, e.g. X
here, and lower-case letters for values taken on by a random variable,
e.g. c here.  Please adhere to this convention.  

By the way, it is also customary to write EX instead of E(X), whenever
removal of the parentheses does not cause any ambiguity.  An example in
which it would produce ambiguity is $E(U^2)$.  The expression $EU^2$
might be taken to mean either $E(U^2)$, which is what we want, or
$(EU)^2$, which is not what we want.

For S = X+Y in the dice example,

\begin{equation}
\label{es}
E(S) = 
2 \cdot \frac{1}{36} +
3 \cdot \frac{2}{36} +
4 \cdot \frac{3}{36} +
...
12 \cdot \frac{1}{36} = 7
\end{equation}

In the case of N, tossing a coin until we get a head:

\begin{equation}
E(N) = \sum_{c=1}^\infty c \cdot \frac{1}{2^c} =  2
\end{equation}

(We will not go into the details here concerning how the sum of this
particular infinite series is computed.)

Some people like to think of E(X) using a center of gravity analogy.
Forget that analogy!  Think notebook!  {\bf Intuitively, E(X) is the
long-run average value of X among all the lines of the notebook.}  So
for instance in our dice example, E(X) = 3.5, where X was the number of
dots on the blue die, means that if we do the experiment thousands of
times, with thousands of lines in our notebook, the average value of X
in those lines will be about 3.5.   With S = X+Y, E(S) = 7.  This means
that in the long-run average in column S in Table \ref{dicenotebook2} is
7.

\begin{table}
\begin{center}

\begin{tabular}{|l|l|l|l|}
\hline
notebook line & outcome & blue+yellow = 6? & S \\ \hline 
\hline
1 & blue 2, yellow 6 & No & 8 \\ \hline 
2 & blue 3, yellow 1 & No & 4 \\ \hline 
3 & blue 1, yellow 1 & No & 2 \\ \hline 
4 & blue 4, yellow 2 & Yes & 6 \\ \hline 
5 & blue 1, yellow 1 & No & 2 \\ \hline 
6 & blue 3, yellow 4 & No & 7 \\ \hline 
7 & blue 5, yellow 1 & Yes & 6 \\ \hline 
8 & blue 3, yellow 6 & No & 9 \\ \hline 
9 & blue 2, yellow 5 & No & 7 \\ \hline 
\end{tabular}

\end{center}
\caption{Expanded Notebook for the Dice Problem}
\label{dicenotebook2} 
\end{table}

Of course, by symmetry, E(Y) will be 3.5 too, where Y is the number of
dots showing on the yellow die.  That means we wasted our time
calculating in Equation (\ref{es}); we should have realized beforehand
that E(S) is $2 \times 3.5 = 7$. 

In other words: 

{\bf Property B:} For any random variables U and V, the expected value of
a new random variable D = U+V is the sum of the expected values of U and
V:

\begin{equation}
\label{eofsum} 
E(U+V) = E(U) + E(V)
\end{equation}

Note carefully that U and V do NOT need to be independent random
variables for this relation to hold.  You should convince yourself of
this fact intuitively {\bf by thinking about the notebook notion.}
Say we look at 10000 lines of the notebook, which has columns for the
values of U, V and U+V.  It makes no difference whether we average U+V
in that column, or average U and V in their columns and then
add---either way, we'll get the same result.

While you are at it, use the notebook notion to convince yourself of the
following:

{\bf Property C:}  For any random variable U and any constants {\it a} and
{\it b},

\begin{equation} 
\label{eoflin}
E(aU+b) = a E(U) + b
\end{equation}

Note also that this implies that for any constant {\it b}, we have

\begin{equation}
E(b) = b
\end{equation}

For instance, say U is temperature in Celsius.  Then the temperature in
Fahrenheit is $W = \frac{9}{5} U + 32$.  So, W is a new random variable,
and we can get its expected value from that of U by using (\ref{eoflin})
with $a = \frac{9}{5}$ and b = 32.

Another important point:

{\bf Property D:} If U and V {\it are} independent, then

\begin{equation}
\label{factoring}
E(UV) = EU \cdot EV
\end{equation}


In the dice example, for instance, let D denote the product of the
numbers of blue dots and yellow dots, i.e. D = XY.  Then

\begin{equation}
E(D) = 3.5^2 = 12.25
\end{equation}

Equation (\ref{factoring}) doesn't have an easy ``notebook proof.'' It
is proved in Section \ref{theyfactor}.

Consider a function g() of one variable, and let W = g(X).  W is then a
random variable too.  Say X takes on values in A, as in (\ref{a}).
Then W takes on values in $B = \{g(c): c \epsilon A \}$.  Define

\begin{equation}
A_d = \{c: c \in A, g(c) = d\}
\end{equation}

Then 

\begin{equation}
P(W = d) = P(X \in A_d) 
\end{equation}

so

\begin{eqnarray}
E[g(X)] &=& E(W) \\
&=& \sum_{d \in B} d P(W = d) \\ 
&=& \sum_{d \in B} d \sum_{c \in A_d} P(X = c) \\
&=& \sum_{c \in A} g(c) P(X = c) 
\end{eqnarray}

{\bf Property E:}

If E[g(X)] exists, then

\begin{equation}
\label{egofx}
E[g(X)] = \sum_{c} g(c) \cdot P(X = c)
\end{equation}

where the sum ranges over all values c that can be taken on by X.

For example, suppose for some odd reason we are interested in finding
$E(\sqrt{X})$, where {\bf X} is the number of dots we get when we roll
one die.  Let $W = \sqrt{X})$.  Then {\bf W} is another random variable,
and is discrete, since it takes on only a finite number of values.  (The
fact that most of the values are not integers is irrelevant.)  We want
to find $EW$.

Well, $W$ is a function of X, with $g(t) = \sqrt{t}$.  So,
(\ref{egofx}) tells us to make a list of values that W and
take on, i.e.  $\sqrt{1}, \sqrt{2}, ..., \sqrt{6}$, and a list of the
corresponding probabilities for {\bf X}, which are all $\frac{1}{6}$.
Substituting into (\ref{egofx}), we find that

\begin{equation}
E(\sqrt{X}) = \frac{1}{6} \sum_{i=1}^6 \sqrt{i}
\end{equation}

\subsection{``Mailing Tubes''}
\label{mailingtubes2}

{\bf The properties of expected value discussed above are key to the
entire remainder of this book.  You should notice immediately when you
are in a setting in which they are applicable.  For instance, if you see
the expected value of the sum of two random variables, you should
instinctively think of (\ref{eofsum}) right away.}  

As discussed in Section \ref{mailingtubes1}, these properties are
``mailing tubes.'' For instance, (\ref{eofsum}) is a ``mailing
tube''--make a mental note to yourself saying, ``If I ever need to find
the expected value of the sum of two random variables, I can use
(\ref{eofsum}).''  Similarly, (\ref{egofx}) is a mailing tube;
tell yourself, ``If I ever see a new random variable that is a function
of one whose probabilities I already know, I can find the expected value
of the new random variable using (\ref{egofx}).''

You will encounter ``mailing tubes'' throughout this book.  For
instance, (\ref{varcu}) below is a very important ``mailing tube.''
Constatly remind yourself---``Remember the `mailing tubes'!''

\subsection{Casinos, Insurance Companies and ``Sum Users,'' Compared
to Others}

The expected value is intended as a {\bf measure of central tendency},
i.e.  as some sort of definition of the probablistic ``middle'' in the
range of a random variable.  There are various other such measures one
can use, such as the {\bf median}, the halfway point of a distribution,
and today they are recognized as being superior to the mean in certain
senses.  For historical reasons, the mean plays an absolutely central
role in probability and statistics.  Yet one should understand its
limitations.

({\bf Warning:}  The concept of the mean is likely so ingrained in your
consciousness that you simply take it for granted that you know what the
mean means, no pun intended.  But try to take a step back, and think of
the mean afresh in what follows.)

First, the term {\it expected value} itself is a misnomer.  We
do not \underline{expect} W to be 91/6 in this last example; in fact, it
is impossible for W to take on that value.  

Second, the expected value is what we call the {\bf mean} in everyday
life.  And the mean is terribly overused.  Consider, for example, an
attempt to describe how wealthy (or not) people are in the city of
Davis.  If suddenly Bill Gates were to move into town, that would skew
the value of the mean beyond recognition.  

But even without Gates, there is a question as to whether the mean has
that much meaning.  After all, what is so meaningful about summing our
data and dividing by the number of data points?  The median has an easy
intuitive meaning, but although the mean has familiarity, one would be
hard pressed to justify it as a measure of central tendency.

What, for example, does Equation (\ref{longrunavg}) mean in the context
of people's heights in Davis?  We would sample a person at random and
record his/her height as $X_1$.  Then we'd sample another person, to get
$X_2$, and so on.  Fine, but in that context, what would
(\ref{longrunavg}) mean?  The answer is, not much.  So the significance
of the mean height of people in Davis would be hard to explain.

For a casino, though, (\ref{longrunavg}) means plenty.  Say X is the
amount a gambler wins on a play of a roulette wheel, and suppose
(\ref{longrunavg}) is equal to \$1.88.  Then after, say, 1000 plays of
the wheel (not necessarily by the same gambler), the casino knows from
\ref{longrunavg} it will have paid out a total about about \$1,880.  So
if the casino charges, say \$1.95 per play, it will have made a profit
of about \$70 over those 1000 plays.  It might be a bit more or less
than that amount, but the casino can be pretty sure that it will be
around \$70, and they can plan their business accordingly.

The same principle holds for insurance companies, concerning how much
they pay out in claims.  With a large number of customers, they know
(``expect''!) approximately how much they will pay out, and thus can
set their premiums accordingly.  Here the mean has a tangible, practical
meaning.

The key point in the casino and insurance companies examples is that
they are interested in {\it totals}, such as {\it total} payouts on a
blackjack table over a month's time, or {\it total} insurance claims
paid in a year.  Another example might be the number of defectives in a
batch of computer chips; the manufacturer is interested in the {\it
total} number of defectives chips produced, say in a month.  Since the
mean is by definition a {\it total} (divided by the number of data
points), the mean will be of direct interest to casinos etc.

By contrast, in describing how wealthy people of a town are, the total
height of all the residents is not relevant.  Similarly, in describing
how well students did on an exam, the sum of the scores of all the
students doesn't tell us much.  (Unless the professor gets \$10 for each
point in the exam scores of each of the students!) A better description
for heights and examp scores might be the median height or score.

Nevertheless, the mean has certain mathematical properties, such as
(\ref{eofsum}), that have allowed the rich development of the fields of
probability and statistics over the years.  The median, by contrast,
does not have nice mathematical properties.  In many cases, the mean
won't be too different from the median anyway (barring Bill Gates moving
into town), so you might think of the mean as a convenient substitute
for the median.  The mean has become entrenched in statistics, and we
will use it often.

\section{Variance}
\label{variance}

As in Section \ref{expval}, the concepts and properties introduced in
this section form the very core of probability and statistics.  {\bf
Except for some specific calculations, these apply to both discrete and
continuous random variables.}

\subsection{Definition}

While the expected value tells us the average value a random variable
takes on, we also need a measure of the random variable's
variability---how much does it wander from one line of the notebook to
another?  In other words, we want a measure of {\bf dispersion}.  The
classical measure is {\bf variance}, defined to be the mean squared
difference between a random variable and its mean:

\begin{definition}  For a random variable U for which the expected
values written below exist, the {\bf variance} of U is defined to be

\begin{equation}
Var(U) = E[(U-EU)^2]
\end{equation}

\end{definition}

For X in the die example, this would be

\begin{equation}
\label{dievar} 
Var(X) = E[(X-3.5)^2]
\end{equation}

Remember what this means:  We have a random variable {\bf X}, and we're
creating a new random variable, $W = (X-3.5)^2$, which is a function of
the old one.  We are then finding the expected value of that new random
variable W.

In the notebook view, $E[(X-3.5)^2]$ is the long-run average of the W
column:

\begin{tabular}{|r|r|r|}
\hline
line & X & W \\ \hline 
\hline
1 & 2 & 2.25 \\ \hline 
2 & 5 & 2.25 \\ \hline 
3 & 6 & 6.25 \\ \hline 
4 & 3 & 0.25 \\ \hline 
5 & 5 & 2.25 \\ \hline 
6 & 1 & 6.25 \\ \hline 
\end{tabular}

To evaluate this, apply (\ref{egofx}) with $g(c) = (c-3.5)^2$:

\begin{equation}
Var(X) = \sum_{c=1}^6 (c-3.5)^2 \cdot \frac{1}{6} = 2.92
\end{equation}

You can see that variance does indeed give us a measure of dispersion.
In the expression $Var(U) = E[(U-EU)^2]$, if the values of U are mostly
clustered near its mean, then $(U-EU)^2$ will usually be small, and thus
the variance of U will be small; if there is wide variation in U, the
variance will be large.

The properties of E() in (\ref{eofsum}) and (\ref{eoflin}) can be used
to show:

{\bf Property F:}

\begin{equation} 
\label{varuformula}
Var(U) = E(U^2) - (EU)^2
\end{equation}

The term $E(U^2)$ is again evaluated using (\ref{egofx}). 

Thus for example, if X is the number of dots which come up when we roll
a die.   Then, from (\ref{varuformula}), 

\begin{equation}
Var(X) = E(X^2) - (EX)^2
\end{equation}

Let's find that first term (we already know the second is 3.5).  From
(\ref{egofx}),

\begin{equation}
E(X^2) =  \sum_{i=1}^6 i^2 \cdot \frac{1}{6} = \frac{91}{6}
\end{equation}

Thus $Var(X) = E(X^2) - (EX)^2 = \frac{91}{6} - 3.5^2$

Remember, though, that (\ref{varuformula}) is a shortcut formula for
finding the variance, not the {\it definition} of variance.

An important behavior of variance is: 

{\bf Property G:}

\begin{equation}
\label{varcu}
Var(cU) = c^2 Var(U)
\end{equation}

for any random variable U and constant c. It should make sense to you:
If we multiply a random variable by 5, say, then its average squared
distance to its mean should increase by a factor of 25. 

Let's prove (\ref{varcu}).  Define V = cU.  Then

\begin{eqnarray}
Var(V) &=& E[(V-EV)^2] \textrm{  (def.)} \\ 
&=& E\{[cU - E(cU)]^2\} \textrm{  (subst.)} \\
&=& E\{[cU - cEU]^2\} \textrm{  ((\ref{eoflin}))} \\
&=& E\{c^2 [U - EU]^2\} \textrm{  (algebra)} \\
&=& c^2 E\{[U - EU]^2\} \textrm{  ((\ref{eoflin}))} \\
&=& c^2 Var(U) \textrm{  (def.)}
\end{eqnarray}

Shifting data over by a constant does not change the amount of variation
in them:

{\bf Property H:}

\begin{equation}
\label{affinevar}
Var(U+d) = Var(U)
\end{equation}

for any constant d.

Intuitively, the variance of a constant is 0---after all, it never
varies!  You can show this formally using (\ref{varuformula}):

\begin{equation}
Var(c) = E(c^2) - [E(c)]^2 = c^2 - c^2 = 0
\end{equation}

The square root of the variance is called the {\bf standard deviation}.

Again, we use variance as our main measure of dispersion for historical
and mathematical reasons, not because it's the most meaningful measure.
The squaring in the definition of variance produces some distortion, by
exaggerating the importance of the larger differences.  It would be more
natural to use the {\bf mean absolute deviation} (MAD), $E(|U-EU|)$.
However, this is less tractable mathematically, so the statistical
pioneers chose to use the mean squared difference, which lends itself to
lots of powerful and beautiful math, in which the Pythagorean Theorem
pops up in abstract vector spaces.  (See Section \ref{elegant} for
details.)

{\bf As with expected values, the properties of variance discussed
above, and also in Section \ref{covar} below, are key to the entire
remainder of this book.  You should notice immediately when you are in a
setting in which they are applicable.  For instance, if you see the
variance of the sum of two random variables, you should instinctively
think of (\ref{varsum}) right away.}

\subsection{Central Importance of the Concept of Variance}

No one needs to be convinced that the mean is a fundamental descriptor
of the nature of a random variable.  But the variance is of central
importance too, and will be used constantly throughout the remainder of
this book.

The next section gives a quantitative look at our notion of variance as
a measure of dispersion.  

\subsection{Intuition Regarding the Size of Var(X)}

{\it A billion here, a billion there, pretty soon, you're talking real
money}---attribted to the late Senator Everitt Dirksen, replying to a
statement that some federal budget item cost ``only'' a billion dollars

\bigskip

Recall that the variance of a random variable X is suppose to be a measure
of the dispersion of X, meaning the amount that X varies from one
instance (one line in our notebook) to the next.  But if Var(X) is, say,
2.5, is that a lot of variability or not?  We will pursue this question
here.

\subsubsection{Chebychev's Inequality}

This inequality states that for a random variable X with mean $\mu$ and
variance $\sigma^2$, 

\begin{equation}
\label{cheb}
P(|X - \mu| \geq c \sigma) \leq \frac{1}{c^2}
\end{equation}

In other words, X strays more than, say, 3 standard deviations from its
mean at most only 1/9 of the time.  This gives some concrete meaning to
the concept of variance/standard deviation.

You've probably had exams in which the instructor says something like
``An A grade is 1.5 standard deviations above the mean.''  Here c in
(\ref{cheb}) would be 1.5.

We'll prove the inequality in Section \ref{chebproof}.

\subsubsection{The Coefficient of Variation}

Continuing our discussion of the magnitude of a variance, look at our
remark following (\ref{cheb}):

\begin{quote}
In other words, X does not often stray more than, say, 3 standard
deviations from its mean.  This gives some concrete meaning to the
concept of variance/standard deviation.
\end{quote}

Or, think of the price of, say, widgets.  If the price hovers around a
\$1 million, but the variation around that figure is only about a
dollar, you'd say there is essentially no variation.  But a variation
of about a dollar in the price of a hamburger would be a lot.

These considerations suggest that any discussion of the size of Var(X)
should relate to the size of E(X).  Accordingly, one often looks at the
{\bf coefficient of variation}, defined to be the ratio of the standard
deviation to the mean:

\begin{equation}
\textrm{coef. of var.} = \frac{\sqrt{Var(X)}}{EX}
\end{equation}

This is a scale-free measure (e.g. inches divided by inches), and serves
as a good way to judge whether a variance is large or not.

\section{Indicator Random Variables, and Their Means and Variances}
\label{indicator}

\begin{definition}

A random variable that has the value 1 or 0, according to whether a
specified event occurs or not is called an {\bf indicator random
variable} for that event.  

\end{definition}

You'll often see later in this book that the notion of an indicator
random variable is a very handy device in certain derivations.  But for
now, let's establish its properties in terms of mean and variance.

\begin{quote}

{\bf Handy facts:}  Suppose X is an indicator random variable for the
event A.  Let p denote P(A).  Then 

\begin{equation}
\label{eofxeqp}
E(X) = p
\end{equation}

\begin{equation}
\label{varofeqp1p}
Var(X) = p(1-p)
\end{equation}

\end{quote}

This two facts are easily derived.  In the first case we have, using our
properties for expected value,

\begin{equation}
EX = 1 \cdot P(X = 1) + 0 \cdot P(X = 0) = P(X = 1) = P(A) = p
\end{equation}

The derivation for Var(X) is similar (use (\ref{varuformula})).

\section{A Combinatorial Example}
\label{combex}

A committee of four people is drawn at random from a set of six men and
three women.  Suppose we are concerned that there may be quite a gender
imbalance in the membership of the committee.  Toward that end, let M
and W denote the numbers of men and women in our committee, and let D =
M-W.  Let's find E(D), in two different ways.

D can take on the values 4-0, 3-1, 2-2 and 1-3, i.e. 4, 2, 0 and -2.
So from (\ref{a})

\begin{equation}
\label{ed}
ED = 
-2 \cdot P(D = -2) +
0 \cdot P(D = 0) +
2 \cdot P(D = 2) +
4 \cdot P(D = 4) 
\end{equation}

Now, using reasoning along the lines in Section \ref{comb}, we have

\begin{equation}
P(D = -2) = P(M = 1 \textrm{ and } W = 3) =
\frac
{\binom{6}{1} \binom{3}{3}}
{\binom{9}{4}} 
\end{equation}

After similar calculations for the other probabilities in (\ref{ed}),
we find the $ED = \frac{4}{3}$.  

Note what this means: If we were to perform this experiment many times,
i.e. choose committees again and again, on average we would have a
little more than one more man than women on the committee.

Now let's use our ``mailing tubes'' to derive ED a different way:

\begin{eqnarray}
ED &=& E(M - W) \\ 
&=& E[ M - (4-M)] \\
&=& E(2 M - 4) \\
&=& 2 EM - 4 ~~ (\textrm{from } (\ref{eoflin})) 
\end{eqnarray}

Now, let's find EM by using indicator random variables.  Let $G_i$
denote the indicator random variable for the event that the i$^{th}$
person we pick is male, i = 1,2,3,4.  Then

\begin{equation}
\label{massum}
M = G_1 + G_2 + G_3 + G_4
\end{equation}

so 

\begin{eqnarray}
EM &=& E(G_1 + G_2 + G_3 + G_4) \\ 
&=& EG_1 + EG_2 + EG_3 + EG_4 ~~  [\textrm{ from } (\ref{eofsum})] \\
&=& P(G_1 = 1) + P(G_2 = 1) + P(G_3 = 1) + P(G_4 = 1) ~~ [\textrm{ from } (\ref{eofxeqp})]
\label{4g}
\end{eqnarray}

Note carefully that the second equality here, which uses (\ref{eofsum}),
is true in spite of the fact that the $G_i$ are not independent.
Equation (\ref{eofsum}) does \underline{not} require independence.

Another key point is that, due to symmetry, $P(G_i = 1)$ is the same for
all i.  same expected value.  (Note that we did not write a {\it
conditional} probability here.)  To see this, suppose the six men that
are available for the committee are named Alex, Bo, Carlo, David,
Eduardo and Frank.  When we select our first person, any of these men
has the same chance of being chosen (1/9).  {\it But that is also true
for the second pick.}  Think of a notebook, with a column named ``second
pick.''  In some lines, that column will say Alex, in some it will say
Bo, and so on, and in some lines there will be women's names.  But in
that column, Bo will appear the same fraction of the time as Alex, due
to symmetry, and that will be the same fraction is for, say, Alice,
again 1/9.

Now,

\begin{equation}
P(G_1 = 1) = \frac{6}{9} = \frac{2}{3}
\end{equation}

Thus 

\begin{equation}
ED = 2 \cdot (4 \cdot \frac{2}{3}) -4  = \frac{4}{3}
\end{equation}

\section{A Useful Fact}
\label{usefulfact}

\label{mingcpage}
For a random variable X, consider the function

\begin{equation}
g(c) = E[(X-c)^2]
\end{equation}

Remember, the quantity $E[(X-c)^2]$ is a number, so g(c) really is a
function, mapping a real number c to some real output.  

We can ask the question, What value of c minimizes g(c)?  To answer that
question, write:

\begin{equation}
\label{gofc}
g(c) = E[(X-c)^2] = E(X^2 -2cX + c^2) = E(X^2) - 2c EX + c^2
\end{equation}

where we have used the various properties of expected value derived in
recent sections.

Now differentiate with respect to c, and set the result to 0.
Remembering that $E(X^2)$ and EX are constants, we have

\begin{equation}
0 = -2 EX + 2c
\end{equation}

so the minimizing c is c = EX!  

In other words, the minimum value of $E[(X-c)^2]$ occurs at c = EX.

Moreover:  Plugging c = EX into (\ref{gofc}) shows that the minimum value
of g(c) is $E(X-EX)^2]$ , which is Var(X)!

\section{Covariance} 

This is a topic we'll cover fully in Chapter \ref{mul}, but at least
introduce here.

A measure of the degree to which U and V vary together is their {\bf
covariance},

\begin{equation}
Cov(U,V) = E[(U-EU)(V-EV)]
\end{equation}

Except for a divisor, this is essentially {\bf correlation}.  If U is
usually large at the same time V is small, for instance, then you can
see that the covariance between them witll be negative.  On the other
hand, if they are usually large together or small together, the
covariance will be positive.

Again, one can use the properties of E() to show that

\begin{equation}
\label{covshortcut}
Cov(U,V) = E(UV) - EU \cdot EV
\end{equation}

Also 

\begin{equation}
\label{genvarsum}
Var(U+V) = Var(U) + Var(V) + 2 Cov(U,V)
\end{equation}

Suppose U and V are independent.  Then (\ref{factoring}) and
(\ref{covshortcut}) imply that Cov(U,V) = 0.  In that case,

\begin{equation}
\label{varsum}
Var(U+V) = Var(U) + Var(V) 
\end{equation}

By the way, (\ref{varsum}) is actually the Pythagorean Theorem in a
certain esoteric, infinite-dimesional vector space (related to a similar
remark made earlier).  This is pursued in Section \ref{elegant} for
the mathematically inclined.

\section{Expected Value, Etc. in the ALOHA Example}

Finding expected values etc. in the ALOHA example is straightforward.
For instance, 

\begin{equation}
EX_1 = 0 \cdot P(X_1 = 0) + 1 \cdot P(X_1 = 1) + 2 \cdot P(X_1 = 2)
= 1 \cdot 0.48 + 2 \cdot 0.52 = 1.52
\end{equation}

Here is R code to find various values approximately by simulation:

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
# finds E(X1), E(X2), Var(X2), Cov(X1,X2)
sim <- function(p,q,nreps) {
   sumx1 <- 0
   sumx2 <- 0
   sumx2sq <- 0
   sumx1x2 <- 0
   for (i in 1:nreps) {
      numsend <- 0
      for (i in 1:2)
         if (runif(1) < p) numsend <- numsend + 1
      if (numsend == 1)  X1 <- 1
      else X1 <- 2
      numactive <- X1
      if (X1 == 1 && runif(1) < q) numactive <- numactive + 1
      if (numactive == 1)
         if (runif(1) < p) X2 <- 0
         else X2 <- 1
      else {  # numactive = 2
         numsend <- 0
         for (i in 1:2)
            if (runif(1) < p) numsend <- numsend + 1
         if (numsend == 1) X2 <- 1
         else X2 <- 2
      }
      sumx1 <- sumx1 + X1
      sumx2 <- sumx2 + X2
      sumx2sq <- sumx2sq + X2^2
      sumx1x2 <- sumx1x2 + X1*X2
   }
   # print results
   meanx1 <- sumx1 /nreps
   cat("E(X1):",meanx1,"\n")
   meanx2 <- sumx2 /nreps
   cat("E(X2):",meanx2,"\n")
   cat("Var(X2):",sumx2sq/nreps - meanx2^2,"\n")
   cat("Cov(X1,X2):",sumx1x2/nreps - meanx1*meanx2,"\n")
}
\end{Verbatim}

As a check on your understanding so far, you should find at least one of
these values by hand, and see if it jibes with the simulation output.

\section{Distributions}
\label{dstrdef}

The idea of the {\bf distribution} of a random variable is central to
probability and statistics.

\begin{definition}
Let U be a discrete random variable.  Then the
distribution of U is simply a list of all the values U takes on, and 
their associated probabilities: 
\end{definition}

{\bf Example:}  Let X denote the number of dots one gets in rolling a
die.  Then the values X can take on are 1,2,3,4,5,6, each with
probability 1/6.  So

\begin{equation}
\textrm{distribution of } X =
\{
(1,\frac{1}{6}),
(2,\frac{1}{6}),
(3,\frac{1}{6}),
(4,\frac{1}{6}),
(5,\frac{1}{6}),
(6,\frac{1}{6})
\}
\end{equation}

{\bf Example:}  Recall the ALOHA example. There $X_1$ took on the values
1 and 2, with probabilities 0.48 and 0.52, respectively.  So, 

\begin{equation}
\textrm{distribution of } X_1 =
\{
(0,0.00),
(1,0.48),
(2,0.52)
\}
\end{equation}

{\bf Example:}  Recall our example in which N is the number of tosses of a
coin needed to get the first head.  N can take on the values 1,2,3,...,
the probabilities of which we found earlier to be 1/2, 1/4, 1/8,...  So,

\begin{equation}
\textrm{distribution of } N =
\label{geomdie}
\{
(1,\frac{1}{2}),
(2,\frac{1}{4}),
(3,\frac{1}{8}),
...
\}
\end{equation}

It is common to express this in functional notation:  

\begin{definition}
The {\bf probability mass function} (pmf) of a 
discrete random variable V, denoted $p_V$, as

\begin{equation}
p_V(k) = P(V = k)
\end{equation}

for any value k which V can take on.
\end{definition}

(Please keep in mind the notation.  It is customary to use the
lower-case p, with a subscript consisting of the name of the random
variable.)

Note that $p_V()$ is just a function, like any function (with integer
domain) you've had in your previous math courses.  For each input value,
there is an output value.

\subsection{Example:  Toss Coin Until First Head}

In (\ref{geomdie}),

\begin{equation}
\label{geompre}
p_N(k) = \frac{1}{2^k}, k = 1,2,...
\end{equation}

\subsection{Example:  Sum of Two Dice}

In the dice example, in which S = X+Y,

\begin{equation}
\label{dicesum}
p_S(k) = 
\begin{cases}
\frac{1}{36}, & k = 2 \cr
\frac{2}{36}, & k = 3 \cr
\frac{3}{36}, & k = 4 \cr
... \cr
\frac{1}{36}, & k = 12
\end{cases}
\end{equation}

It is important to note that there may not be some nice closed-form
expression for $p_V$ like that of (\ref{geompre}).  There was no such
form in (\ref{dicesum}), nor is there in our ALOHA example for
$p_{X_1}$ and $p_{X_2}$.

\subsection{Example:  Watts-Strogatz Random Graph Model}
\label{strogatz}

Random graph models are used to analyze many types of link systems, such
as power grids, social networks and even movie stars.  The following is
a variation on a famous model of that type, due to Duncan Watts and
Steven Strogatz.

We have a graph of n nodes (e.g. each node is a person).\footnote{The
word {\it graph} here doesn't mean ``graph'' in the sense of a picture.
Here we are using the computer science sense of the word, meaning a
system of vertices and edges.  It's common to call those {\it nodes} and
{\it links}.} Think of them as being linked in a circle, so we already
have n links.  One can thus reach any node in the graph from any other,
by following the links of the circle.  (We'll assume all links are
bidirectional.)

We now randomly add k more links (k is thus a parameter of the model),
which will serve as ``shortcuts.''  There are$\binom{n}{2} = n(n-1)/2$
possible links between nodes, but remember, we already have n of those
in the graph, so there are only $n(n-1)/2 - n = n^2/2 - 3n/2$
possibilities left.  We'll be forming k new links, chosen at random from
those $n^2/2 - 3n/2$ possibilities.

Let M denote the number of links that attached to a particular node, known
as the {\bf degree} of a node.  M is a random variable (we are choosing
the shortcut links randomly), so we can talk of its pmf, $p_M$, termed
the {\bf degree distribution}, which we'll calculate now.

Well, $p_M(r)$ is the probability that this node has r links.  Since the
node already had 2 links before the shortcuts were constructed, $p_M(r)$
is the probability that r-2 of the k shortcuts attach to this node.

This problem is similar in spirit to (though admittedly more difficult to
think about than) kings-and-hearts example of Section \ref{hearts}.
Other than the two neighboring links in the original circle and the
``link'' of a node to itself, there
are n-3 possible shortcut links to attach to our given node.  We're
interested in the probability that r-2 of them are chosen, and that
k-(r-2) are chosen from the other possible links.  Thus our probability
is:

\begin{equation}
p_M(r) =
\frac
{\binom{n-3}{r-2}
\binom{n^2/2-3n/2-(n-3)}{k-(r-2)}}
{\binom{n^2/2-3n/2}{k}} =
\frac
{\binom{n-2}{r-2}
\binom{n^2/2-5n/2+3}{k-(r-2)}}
{\binom{n^2/2-3n/2}{k}} 
\end{equation}

\section{Parameteric Families of pmfs}

Consider plotting the curves sin(ct).  For each c, we get the familiar
sine function.  For larger c, the curve is more ``squished'' and for c
strictly between 0 and 1, we get a broadened sine curve.  So we have a
family of sine curves of different proportions.  We say the family is
{\bf indexed} by the {\bf parameter} c, meaning, each c gives us a
different member of the family, i.e. a different curve.

Probability mass functions, and in the next chapter, probability density
functions, can also come in families, indexed by one or more parameters.
In fact, we just had an example above, in Section \ref{strogatz}.  Since
we get a different function $p_M$ for each different value of k, that
was a parametric family of pmfs, indexed by k.

Some parametric families of pmfs have been found to be so useful over
the years that they've been given names.  We will discuss some of those
families here.  But remember, they are famous just because they have
been found useful, i.e. that they fit real data well in various
settings.  {\bf Do not jump to the conclusion that we always ``must''
use pmfs from some family.}

\subsection{The Geometric Family of Distributions}
\label{geom}

Recall our example of tossing a coin until we get the first head, with N
denoting the number of tosses needed.  In order for this to take k
tosses, we need k-1 tails and then a head.  Thus

\begin{equation}
p_N(k) = \large (1 - \frac{1}{2} \large )^{k-1} \cdot \frac{1}{2}, k = 1,2,...
\end{equation}

We might call getting a head a ``success,'' and refer to a tail as a
``failure.''  Of course, these words don't mean anything; we simply
refer to the outcome of interest as ``success.''

Define M to be the number of rolls of a die needed until the number 5
shows up.  Then

\begin{equation}  
p_M(k) = \left (1 - \frac{1}{6} \right )^{k-1} \frac{1}{6}, k = 1,2,...
\end{equation}

reflecting the fact that the event \{M = k\} occurs if we get k-1 non-5s
and then a 5.  Here ``success'' is getting a 5.

The tosses of the coin and the rolls of the die are known as {\bf
Bernoulli trials}, which is a sequence of independent events.  We call
the occurrence of the event {\bf success} and the nonoccurrence {\bf
failure} (just convenient terms, not value judgments).  The associated
indicator random variable are denoted $B_i$, i = 1,2,3,...  So $B_i$ is
1 for success on the i$^{th}$ trial, 0 for failure, with success
probability p.  For instance, p is 1/2 in the coin case, and 1/6 in the
die example.  

In general, suppose the random variable W is defined to be the number 
of trials needed to get a success in a sequence of Bernoulli trials.
Then

\begin{equation}
\label{geompmf}
p_W(k) = (1-p)^{k-1} p, k = 1,2,...
\end{equation}

Note that there is a different distribution for each value of p, so we
call this a {\bf parametric family} of distributions, indexed by the
parameter p.  We say that W is {\bf geometrically distributed} with
parameter p.\footnote{Unfortunately, we have overloaded the letter {\it
p} here, using it to denote the probability mass function on the left
side, and the unrelated parameter p, our success probability on the
right side.  It's not a problem as long as you are aware of it, though.}

It should make good intuitive sense to you that

\begin{equation}
\label{eofgeom}
E(W) = \frac{1}{p}
\end{equation}

This is indeed true, which we will now derive.  First we'll need some
facts (which you should file mentally for future use as well):

{\bf Properties of Geometric Series:}

\begin{itemize}

\item [(a)] For any $t \neq 1$ and any nonnegative integers $r \leq s$,

\begin{equation}
\label{finitegeom}
\sum_{i=r}^s t^i = t^r \frac{1-t^{s-r+1}}{1-t}
\end{equation}

This is easy to derive for the case r = 0, using mathematical induction.
For the general case, just factor out $t^{s-r}$.

\item [(b)] For $|t| < 1$,

\begin{equation}
\label{infgeom}
\sum_{i=0}^{\infty} t^i = \frac{1}{1-t}
\end{equation}

To prove this, just take r = 0 and let $s \rightarrow \infty$ in
(\ref{finitegeom}).

\item [(b)] For $|t| < 1$,

\begin{equation}
\label{derivgeom}
\sum_{i=1}^{\infty} i t^{i-1} = \frac{1}{(1-t)^2}
\end{equation}

This is derived by applying $\frac{d}{dt}$ to
(\ref{infgeom}).\footnote{To be more carefully, we should differentiate
(\ref{finitegeom}) and take limits.}

\end{itemize}

Deriving (\ref{eofgeom}) is then easy, using (\ref{derivgeom}):

\begin{eqnarray}
EW &=& \sum_{i=1}^{\infty} i (1-p)^{i-1} p \\ 
&=& p \sum_{i=1}^{\infty} i (1-p)^{i-1} \\ 
&=& p \cdot \frac{1}{[1-(1-p)]^2} \\
&=& \frac{1}{p}  \label{meangeom}
\end{eqnarray}

Using similar computations, one can show that

\begin{equation}
\label{vargeom}
Var(W) = \frac{1-p}{p^2}
\end{equation}

We can also find a closed-form expression for the quantities $P(W \leq
m)$, m = 1,2,...  (This has a formal name $F_W(m)$ , as will be seen
later in Section \ref{presentsaproblem}.)  For any positive integer m we
have

\begin{eqnarray}
F_W(m) &=& P(W \leq m) \\
&=& 1 - P(W > m) \\
&=& 1 - P(\textrm{the first m trials are all failures}) \\
&=& 1 -  (1-p)^{m}
\label{geomcdf}
\end{eqnarray}

By the way, if we were to think of an experiment involving a geometric
distribution in terms of our notebook idea, the notebook would have an
infinite number of columns, one for each $B_i$.  Within each row of the
notebook, the $B_i$ entries would be 0 until the first 1, then NA (``not
applicable'' after that).

\subsubsection{R Functions} 

You can simulate geometrically distributed random variables via R's {\bf
rgeom()} function.  Its first argument specifies the number of such
random variables you wish to generate, and the second is the success
probability p.

For example, if you run

\begin{Verbatim}[fontsize=\relsize{-2}]
> y <- rgeom(2,0.5)
\end{Verbatim}

then it's simulating tossing a coin until you get a head ({\bf y[1]}) and
then tossing the coin until a head again ({\bf y[2]}).  Of course, you
could simulate on your own, say using {\bf sample()} and {\bf while()},
but R makes is convenient for you.

Here's the full set of functions for a geometrically distributed random
variable X with success probability p:

\begin{itemize}

\item {\bf dgeom(i,p)}, to find $P(X = i)$

\item {\bf pgeom(i,p)}, to find $P(X \leq i)$

\item {\bf qgeom(q,p)}, to find c such that $P(X \leq c) = q$

\item {\bf rgeom(n,p)}, to generate n variates from this geometric
distribution

\end{itemize}

{\bf Important note:}  Some books define geometric distributions slightly
differently, as the number of failures before the first success, rather
than the number of trials to the first success.  Thus for example in
calling {\bf dgeom()}, subtract 1 from the value used in our definition.

\subsubsection{Example:  a Parking Space Problem}
\label{parking}

Suppose there are 10 parking spaces per block on a certain street.  You
turn onto the street at the start of one block, and your destination is
at the start of the next block.  You take the first parking space you
encounter.  Let D denote the distance of the parking place you find from
your destination, measured in parking spaces.  Suppose each space is
open with probability 0.15, with the spaces being independent.  Find ED.

To solve this problem, you might at first think that D follows a
geometric distribution. {\bf But don't jump to conclusions!}  Actually
this is not the case, given that D is a somewhat complicated distance.
But clearly D is a function of N, where the latter denotes the number of
parking spaces you see until you find an empty one---and N {\it is}
geometrically distributed.

As noted, D is a funciton of N: 

\begin{equation}
D = 
\begin{cases}
11-N,  & N \leq 10 \cr
N-11,  & N  > 10 \cr
\end{cases}
\end{equation}

Since D is a function of N, we can use (\ref{egofx}):

\begin{equation}
\label{expectdist}
ED = \sum_{i=1}^{10} (11-i) 0.85^{i-1} 0.15 +
\sum_{i=11}^{\infty} (i-11) 0.85^{i-1} 0.15 
\end{equation}

This can now be evaluated using the properties of geometric series
presented above.

Alternatively, here's how we could find the result by simulation:

\begin{lstlisting}[numbers=left]
parksim <- function(nreps) {
   # do the experiment nreps times, recording the values of N
   nvals <- rgeom(nreps,0.2) + 1
   # now find the values of D
   dvals <- ifelse(nvals <= 10,11-nvals,nvals-11)
   # return ED
   return(mean(dvals))
}
\end{lstlisting}

Note the vectorized addition and recycling (Section
\ref{improvingsimcode} in the line

\begin{lstlisting}
nvals <- rgeom(nreps,0.2) + 1
\end{lstlisting}

The call to {\bf ifelse()} is another instance of R's vectorization, a
vectorized if-then-else.  The first argument evaluates to a vector of
TRUE and FALSE values.  For each TRUE, the corresponding element of {\bf
dvals} will be set to the corresponding element of the vector {\bf
11-nvals} (again involving vectorized addition and recycling), and for
each false, the element of {\bf dvals} will be set to the element of
{\bf nvals-11}.

\subsection{The Binomial Family of Distributions}
\label{binom}

A geometric distribution arises when we have Bernoulli trials with
parameter p, with a variable number of trials (N) but a fixed number of
successes (1).  A {\bf binomial distribution} arises when we have the
opposite---a fixed number of Bernoulli trials (n) but a variable number
of successes (say X).\footnote{Note again the custom of using capital
letters for random variables, and lower-case letters for constants.}

For example, say we toss a coin five times, and let X be the number of
heads we get.  We say that X is binomially distributed with parameters n
= 5 and p = 1/2.  Let's find P(X = 2).  There are many orders in which
that could occur, such as HHTTT, TTHHT, HTTHT and so on.  Each order has
probability $0.5^2(1-0.5)^3$, and there are $\binom{5}{2}$ orders.  Thus

\begin{equation}
P(X = 2) = \binom{5}{2} 0.5^2(1-0.5)^3 = \binom{5}{2} / 32 = 5/16 
\end{equation}

For general n and p,

\begin{equation}
P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}  
\end{equation}

So again we have a parametric family of distributions, in this case a
family having two parameters, n and p.

Let's write X as a sum of those 0-1 Bernoulli variables we used in the
discussion of the geometric distribution above:

\begin{equation}
X = \sum_{i=1}^{n} B_i
\end{equation}

where $B_i$ is 1 or 0, depending on whether there is success on the
$i^{th}$ trial or not.  Note again that the $B_i$ are indicator random
variables (Section \ref{indicator}),  so 

\begin{equation}
EB_i = p
\end{equation}

and 

\begin{equation}
Var(B_i) = p(1-p)
\end{equation}

Then the reader should use our earlier properties of E() and Var() in
Sections \ref{expval} and \ref{variance} to fill in the details in the
following derivations of the expected value and variance of a binomial
random variable:

\begin{equation}
\label{binmean}
EX = E(B_1+...,+B_n) = EB_1 + ... + EB_n = np
\end{equation}

and from (\ref{varsum}),

\begin{equation}
\label{binvar}
Var(X) = Var(B_1+...,+B_n) = Var(B_1) + ... + Var(B_n) = np(1-p)
\end{equation}

Again, (\ref{binmean}) should make good intuitive sense to you.

\subsubsection{R Functions} 

Relevant functions for a binomially  distributed random variable X for k
trials and with success probability p are:

\begin{itemize}

\item {\bf dbinom(i,k,p)}, to find $P(X = i)$

\item {\bf pbinom(i,k,p)}, to find $P(X \leq i)$

\item {\bf qbinom(q,k,p)}, to find c such that $P(X \leq c) = q$

\item {\bf rbinom(n,k,p)}, to generate n independent values of X

\end{itemize}

\subsubsection{Example:  Flipping Coins with Bonuses}
\label{bonusflip}

A game involves flipping a coin k times.  Each time you get a head, you
get a bonus flip, not counted among the k.  (But if you get a head from
a bonus flip, that does not give you its own bonus flip.) Let X denote
the number of heads you get among all flips, bonus or not.  Let's find
the distribution of X.

As with the parking space example above, we should be careful not to
come to hasty conclusions.  The situation here ``sounds'' binomial, but
X, based on a variable number of trials, doesn't fit the definiiton of
binomial.

But let Y denote the number of heads you obtain through
nonbonus flips.  Y then has a binomial distribution with parameters k
and 0.5.  To find the distribution of X, we'll condition on Y.  

We will as usual ask, ``How can it happen?'', but we need to take extra
care in forming our sums, recognizing constraints on Y:

\begin{itemize}

\item $Y \geq X/2$

\item $Y \leq X$

\item $Y \leq k$

\end{itemize}

Keeping those points in mind, we have

\begin{eqnarray}
p_X(m) &=& P(X = m) \\ 
&=& \sum_{i=\textrm{ceil}(m/2)}^{\min(m,k)} 
   P(X = m \textrm{ and } Y = i) \\
&=& \sum_{i=\textrm{ceil}(m/2)}^{\min(m,k)} 
   P(X = m | Y = i) ~ P(Y = i)\\
&=& \sum_{i=\textrm{ceil}(m/2)}^{\min(m,k)} 
   \binom{i}{m-i} 0.5^i \binom{k}{i} 0.5^k \\
&=& 0.5^k \sum_{i=\textrm{ceil}(m/2)}^{\min(m,k)} 
   \frac{k!}{m!(2i-m)!(k-i)!} 0.5^i
\end{eqnarray}

There doesn't seem to be much further simplification possible here.

\subsubsection{Example:  Analysis of Social Networks}
\label{socnets}

Let's continue our earlier discussion from Section \ref{strogatz}.

One of the earliest---and now the simplest---models of social networks
is due to Erd\"{o}s and Renyi.  Say we have n people (or n Web sites,
etc.), with $\binom{n}{2}$ potential links between pairs.  (We are
assuming an undirected graph here.)  In this model, each potential link
is an actual link with probability p, and a nonlink with probability
1-p, with all the potential links being independent.

Recall the notion of degree distribution from Section \ref{strogatz}.
Clearly the degree distribution here for a single node is binomial with
parameters n-1 and p.  But consider k nodes, and the total T of their
degrees. Let's find the distribution of T.

That distribution is again binomial, but the number of trials is not $k
\binom{n-1}{2}$, due to overlap.  There are $\binom{k}{2}$ potential
links among these k nodes, and each of the k nodes has $\binom{n-k}{2}$
potential links to the ``outside world,'' i.e. to the remaining n-k
nodes.  So, the distribution of T is binomial with

\begin{equation}
k \binom{n-k}{2} + \binom{k}{2}
\end{equation}

trials and success probability p.

\subsection{The Negative Binomial Family of Distributions}
\label{negbin}

Recall that a typical example of the geometric distribution family
(Section \ref{geom}) arises as N, the number of tosses of a coin needed
to get our first head.  Now generalize that, with N now being the number
of tosses needed to get our $r^{th}$ head, where r is a fixed value.
Let's find P(N = k), k = r, r+1, ...  For concreteness, look at the case
r = 3, k = 5.  In other words, we are finding the probability that it
will take us 5 tosses to accumulate 3 heads.

First note the equivalence of two events: 

\begin{equation}
\{N = 5\} = \{2 {\rm ~heads~in~the~first~} 4 {\rm ~tosses~and~head~on~the~}
5^{th} {\rm ~toss} \}
\end{equation}

That event described before the ``and'' corresponds to a binomial
probability:

\begin{equation}
P(2 {\rm ~heads~in~the~first~4~tosses}) = \binom{4}{2}
\left (\frac{1}{2} \right )^{4}
\end{equation}

Since the probability of a head on the $k^{th}$ toss is 1/2 and the
tosses are independent, we find that

\begin{equation}
P(N = 5) =  \binom{4}{2} \left (\frac{1}{2} \right )^5 = \frac{3}{16}
\end{equation}

The negative binomial distribution family, indexed by
parameters r and p, corresponds to random variables which count the
number of independent trials with success probability p needed until we
get r successes.  The pmf is

\begin{equation}
\label{negbinpmf}
P(N = k) =  \binom{k-1}{r-1} (1-p)^{k-r} p^r, k = r, r+1, ...
\end{equation}

We can write

\begin{equation}
N = G_1+...+G_r
\end{equation}

where $G_i$ is the number of tosses between the successes numbers i-1
and i.  But each $G_i$ has a geometric distribution!  Since the mean of
that distribution is 1/p, we have that

\begin{equation}
E(N) = r \cdot \frac{1}{p}
\end{equation}

In fact, those r geometric variables are also independent, so we know
the variance of N is the sum of their variances:

\begin{equation}
Var(N) = r \cdot \frac{1-p}{p^2}
\end{equation} 

\subsection{The Poisson Family of Distributions}
\label{poisfam}

Another famous parametric family of distributions is the set of {\bf
Poisson Distributions}.  

This family is a little different from the geometric, binomial and
negative binomial families, in the sense that in those cases there were
qualitative descriptions of the settings in which such distributions
arise.  Geometrically distributed random variables, for example occur as
the number of Bernoulli trials needed to get the first success.

By contrast, the Poisson family does not really have this kind of
qualitative description.\footnote{Some such descriptions are possible in
the Poisson case, but they are complicated and difficult to veryify.}.
It is merely something that people have found to be a reasonably
accurate model of actual data.  We might be interested, say, in the
number of disk drive failures in periods of a specified length of time.
If we have data on this, we might graph it and it looks like the pmf
form below, then we might adopt it as our model.

The pmf is

\begin{equation}
\label{poispmf}
P(X = k) = \frac{e^{- \lambda} \lambda^k}{k!}, k = 0,1,2,...
\end{equation}

It turns out that

\begin{equation}
EX = \lambda
\end{equation}

\begin{equation}
Var(X) = \lambda
\end{equation}

The derivations of these facts are similar to those for the geometric
family in Section \ref{geom}.  One starts with the Maclaurin series
expansion for $e^t$:

\begin{equation}
e^t = \sum_{i=0}^{\infty} \frac{t^i}{i!}
\end{equation}

and finds its derivative with respect to t, and so on.  The details are
left to the reader.

\label{bankex}
The Poisson family is very often used to model count data.  For example,
if you go to a certain bank every day and count the number of customers
who arrive between 11:00 and 11:15 a.m., you will probably find that that
distribution is well approximated by a Poisson distribution for some
$\lambda$.

There is a lot more to the Poisson story than we see in this short
section.  We'll return to this distribution family in Section
\ref{connpoi}. 

\subsubsection{R Functions}

Relevant functions for a Poisson distributed random variable X
with parameter lambda are:

\begin{itemize}

\item {\bf dpois(i,lambda)}, to find $P(X = i)$

\item {\bf ppois(i,lambda)}, to find $P(X \leq i)$

\item {\bf qpois(q,lambda)}, to find c such that $P(X \leq c) = q$

\item {\bf rpois(n,lambda)}, to generate n independent values of X

\end{itemize}

\subsection{The Power Law Family of Distributions}

Here

\begin{equation}
p_X(k) = c k^{-\gamma}, ~ k = 1,2,3,...
\end{equation}

It is required that $\gamma > 1$, as otherwise the sum of probabilities
will be infinite.  For $\gamma$ satisfying that condition, the value c
is chosen so that that sum is 1.0:

\begin{equation}
1.0 = \sum_{k=1}^{\infty} c k^{-\gamma} \approx
c \int_1^{\infty} k^{-\gamma} ~ dk = c/(\gamma - 1)
\end{equation}

so $c \approx \gamma -1$.

Here again we have a parametric family of distributions, indexed by the
parameter $\gamma$.

The power law family is an old-fashioned model (an old-fashioned term
for {\it distribution} is {\it law}), but  there has been a resurgence
of interest in it in recent years.  Analysts have found that many types
of social networks in the real world exhibit approximately power law
behavior in their degree distributions.

For instance, in a famous study of the Web (A. Barabasi and R. Albert,
Emergence of Scaling in Random Networks, {\it Science}, 1999, 509-512),
degree distribution on the Web (a directed graph, with incoming links
being the ones of interest here) it was found that the number of links
leading to a Web page has an approximate power law distribution with
$\gamma =2.1$.  The number of links leading out of a Web page was found
to be approximately power-law distributed, with $\gamma = 2.7$.

Much of the interest in power laws stems from their {\bf fat tails}, a
term meaning that values far from the mean are more likely under a power
law than they would be under a normal distribution with the same mean.
In recent popular literature, values far from the mean have often been
called {\bf black swans}.  The financial crash of 2008, for example, is
blamed by some on the ignorance by {\bf quants} (people who develop
probabilistic models for guiding investment) in underestimating the
probabilities of values far from the mean.

Some examples of real data that are, or are not, fit well by power law
models are given in the paper {\it Power-Law Distributions in Empirical
Data}, by A. Clauset, C. Shalizi and M. Newman, at
\url{http://arxiv.org/abs/0706.1062}.  Methods for estimating the
parameter $\gamma$ are discussed and evaluated.

A variant of the power law model is the {\bf power law with exponential
cutoff}, which essentially consists of a blend of the power law and a
geometric distribution.  Here

\begin{equation}
p_X(k) = c k^{-\gamma} q^k
\end{equation}

This now is a two-parameter family, the parameters being $\gamma$ and q.
Again c is chosen so that the pmf sums to 1.0.

This model is said to work better than a pure power law for some types
of data.  Note, though, that this version does not really have the fat
tail property, as the tail decays exponentially now.

\section{Recognizing Some Parametric Distributions When You See Them}

Three of the discrete distribution families we've considered here arise
in settings with very definite structure, all dealing with independent
trials:

\begin{itemize}

\item the binomial family gives the distribution of the number of
successes in a fixed number of trials

\item the geometric family gives the distribution of the number of
trials needed to obtain the first success

\item the negative binomial family gives the distribution of the number of
trials needed to obtain the k$^{th}$ success

\end{itemize}

Such situations arise often, hence the fame of these distribution
families.

By contrast, the Poisson and power law distributions have no underlying
structure.  They are famous for a different reason, that it has been
found empirically that they provide a good fit to many real data sets.

In other words, the Poisson and power law distributions are typically
fit to data, in attempt to find a good model, whereas in the binomial,
geometric and negative binomial cases, the fundamental nature of the
setting implies one of those distributions.

{\bf You should make a strong effort to get to the point at which you
automatically recognize such settings when your encounter them}.

\subsection{Example:  a Coin Game}
\label{coingame}

{\it Life is unfair}---former President Jimmie Carter

\bigskip

Consider a game played by Jack and Jill. Each of them tosses a coin many
times, but Jack gets a head start of two tosses. So by the time Jack has
had, for instance, 8 tosses, Jill has had only 6; when Jack tosses for
the 15$^{th}$ time, Jill has her 13$^{th}$ toss; etc.

Let $X_k$ denote the number of heads Jack has gotten through his
k$^{th}$ toss, and let $Y_k$ be the head count for Jill at that same
time, i.e.  among only k-2 tosses for her. (So, $Y_1 = Y_2 = 0$.)
Let's find the probability that Jill is winning after the k$^{th}$
toss, i.e. $P(Y_6 > X_6)$.

Your first reaction might be, ``Aha, binomial distribution!''  You would
be on the right track, but the problem is that you would not be thinking
precisely enough.  Just WHAT has a binomial distribution?  The answer is
that both $X_6$ and $Y_6$ have binomial distributions, both with p =
0.5, but n = 6 for $X_6$ while n = 4 for $Y_6$.

Now, as usual, ask the famous question, ``How can it happen?''  How can
it happen that  $Y_6 > X_6$?  Well, we could have, for example, $Y_6 = 3$
and $X_6 = 1$, as well as many other possibilities.  Let's write it
mathematically:

\begin{equation}
\label{ygx}
P(Y_6 > X_6) = \sum_{i=1}^4 \sum_{j=0}^{i-1} P(Y_6 = i \textrm{ and } X_6 = j)
\end{equation}

Make SURE your understand this equation.

Now, to evaluate $P(Y_6 = i \textrm{ and } X_6 = j)$, we see the ``and''
so we ask whether $Y_6$ and $X_6$ are independent.  They in fact are;
Jill's coin tosses certainly don't affect Jack's.  So,

\begin{equation}
P(Y_6 = i \textrm{ and } X_6 = j) =
P(Y_6 = i) \cdot P(X_6 = j)
\end{equation}

It is at this point that we finally use the fact that $X_6$ and $Y_6$
have binomial distributions.  We have

\begin{equation}
\label{y6}
P(Y_6 = i) = \binom{4}{i} 0.5^i (1-0.5)^{4-i}  
\end{equation}

and

\begin{equation}
\label{x6}
P(X_6 = j) = \binom{6}{j} 0.5^j (1-0.5)^{6-j}  
\end{equation}

We would then substitute (\ref{y6}) and (\ref{x6}) in (\ref{ygx}).  We
could then evaluate it by hand, but it would be more convenient to use
R's {\bf dbinom()} function:

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
prob <- 0
for (i in 1:4)
   for (j in 0:(i-1))
      prob <- prob + dbinom(i,4,0.5) * dbinom(j,6,0.5)
print(prob)
\end{Verbatim}

We get an answer of about 0.17.  If Jack and Jill were to play this game
repeatedly, stopping each time after the 6$^{th}$ toss, then Jill
would win about 17\% of the time.

\subsection{Example:  Tossing a Set of Four Coins}

Consider a game in which we have a set of four coins.  We keep tossing
the set of four until we have a situation in which exactly two of them
come up heads.  Let N denote the numbr of times we must toss the set of
four coins.

For instance, on the first toss of the set of four, the outcome might be
HTHH.  The second might be TTTH, and the third could be THHT.  In the
situation, N = 3.

Let's find P(N = 5).  Here we recognize that N has a geometric
distribution, with ``success'' defined as getting two heads in our set
of four coins.  What value does the parameter p have here?  

Well, p is P(X = 2), where X is the number of heads we get from a toss
of the set of four coins.  We recognize that X is binomial!  Thus

\begin{equation}
p = \binom{4}{2} 0.5^4 = \frac{3}{8}
\end{equation}

Thus using the fact that N has a geometric distribution,

\begin{equation}
P(N = 5) = (1-p)^4 p = 0.057
\end{equation}

\subsection{Example:  the ALOHA Example Again}
\label{alohaagain}

As an illustration of how commonly these parametric families arise,
let's again look at the ALOHA example.  Consider the general case, with
transmission probability p, message creation probability q, and m
network nodes.  We will not restrict our observation to just two epochs.

Suppose $X_i = m$, i.e. at the end of epoch i all nodes have a message
to send.  Then the number which attempt to send during epoch i+1 will be
binomially distributed, with parameters m and p.\footnote{Note that
this is a conditional distribution, given $X_i = m$.}  For instance, the
probability that there is a successful transmission is equal to the
probability that exactly one of the m nodes attempts to send,

\begin{equation}
\label{onesends}
\binom{m}{1} p (1-p)^{m-1} = mp(1-p)^{m-1}
\end{equation}

Now in that same setting, $X_i = m$, let K be the number of epochs it
will take before some message actually gets through.  In other words, we
will have $X_i = m$, $X_{i+1} = m$, $X_{i+2} = m$,...  but finally
$X_{i+K-1} = m-1$.  Then K will be geometrically distributed, with
success probability equal to (\ref{onesends}).

There is no Poisson distribution in this example, but it is central to
the analysis of Ethernet, and almost any other network.  We will discuss
this at various points in later chapters.

\section{A Preview of Markov Chains} 
\label{markovintro}

Here we introduce Markov chains, a topic covered in much more detail in
Chapter \ref{mar}.  

The basic idea is that we have random variables $X_1, X_2, ...$, with
the index representing time.  Each one can take on any value in a given
set, called the {\bf state space}; $X_n$ is then the {\bf state} of the
system at time n.

The key aspect that we assume the {\bf Markov property}, which in rough
terms can be described as:

\begin{quote}
The probabilities of future states, given the present state and the past
state, depends only on the present state; the past is irrelevant.
\end{quote}

In formal terms:

\begin{equation}
P(X_{t+1}=s_{t+1}|X_{t}=s_{t},X_{t-1}=s_{t-1},\ldots ,X_{0}=s_{0})=P(X_{t+1}=s_{t+1}|X_{t}=s_{t})
\end{equation}

We define $p_{ij}$ to be the probability of going from state i to state
j in one time step.  This forms a matrix P, whose row i, column j
element is $p_{ij}$, which is called the {\bf transition matrix}.

\subsection{Example:  Die Game}

As our first example of Markov chains, consider the following game.  One
repeatedly rolls a die, keeping a running total.  Each time the total
exceeds 10, we receive one dollar, and continue playing, resuming where
we left off, mod 10.  Say for instance we have a total of 8, then roll a
5.  We receive a dollar, and now our total is 3.

This process clearly satisfies the Markov property.  If our current
total is 6, for instance, then the probability that we next have a total
of 9 is 1/6, {\it regardless of what happend our previous rolls}.  We
have $p_{25}$, $p_{72}$ and so on all equal to 1/6, while for instance
$p_{29} = 0$.  Here's the code to find P:

\begin{lstlisting}[numbers=left]
p <- matrix(rep(0,100),nrow=10)
onesixth <- 1/6
for (i in 1:10) {
   for (j in 1:6) {
      k <- i + j
      if (k > 10) k <- k - 10 
      p[i,k] <- onesixth
   }
}
\end{lstlisting}

\subsection{Long-Run State Probabilities}

Let $N_{it}$ denote the number of times we have visited state i during
times 1,...,t.  Than as discussed in Section \ref{discpi}, in typical
applications

\begin{equation}
\label{limntcpy}
\pi_{i}=\lim_{t\rightarrow \infty }\frac{N_{it}}{t}
\end{equation}
                 
exists for each state i.  Under a couple more conditions, we have the
stronger result,

\begin{equation}
\label{limprob}
\lim_{t\rightarrow \infty }P(X_{t}=i)=\pi_{i}
\end{equation}

These quantities $\pi_i$ are typically the focus of analyses of Markov
chains.

In Chapter \ref{mar} it is shown that the $\pi_i$ are easy to
find (in the case of finite state spaces, the subject of this section
here), by solving the matrix equation

\begin{equation}
(I-P') \pi = 0
\end{equation}

subject to the constraint

\begin{equation}
\sum_{i}\pi_{i}=1
\end{equation}

Here I is the identity matrix, and ' denotes matrix transpose.  R code
to do all this (after some algebraic manipulations), {\bf findpi1()}, is
provided in Section \ref{solvbal}, reproduced here for convenience:

\begin{lstlisting}[numbers=left]
findpi1 <- function(p) {
   n <- nrow(p)
   imp <- diag(n) - t(p)  # I-P
   imp[n,] <- rep(1,n)
   rhs <- c(rep(0,n-1),1)
   pivec <- solve(imp,rhs)
   return(pivec)
}
\end{lstlisting}

Consider the die game example above.  Guess what!  All the $\pi_i$ turn
out to be 1/10.  In retrospect, this should be obvious.  If we were to
draw the states 1 through 10 as a ring, with 1 following 10, it should
be clear that all the states are completely symmetric.

\subsection{Example:  3-Heads=in-a-Row Game}

How about the following game?  We keep tossing a coin until we get three
consecutive heads.  What is the expected value of the number of tosses
we need?

We can model this as a Markov chain with states 0, 1, 2 and 3, where
state i means that we have accumulated i consecutive heads so far. 
If we simply stop playing the game when we reach state 3, that state
would be known as an {\bf absorbing state}, one that we never leave.

We could proceed on this basis, but to keep things elementary, let's
just model the game as being played repeatedly, as in the die game
above.  You'll see that that will still allow us to answer the original
question.  Note that now that we are taking that approach, it will
suffice to have just three states, 0, 1 and 2.

Clearly we have transition probabilities such as $p_{01}$, $p_{12}$,
$p_{10}$ and so on all equal to 1/2.  Note from state 2 we can only go
to state 0, so $p_{20} = 1$.

Here's the code below.  Of course, since R subscripts start at 1 instead
of 0, we must recode our states as 1, 2 an 3.

\begin{Verbatim}[fontsize=\relsize{-2}]
p <- matrix(rep(0,9),nrow=3)
onehalf <- 1/2
p[1,1] <- onehalf
p[1,2] <- onehalf
p[2,3] <- onehalf
p[2,1] <- onehalf
p[3,1] <- 1
findpi1(p)
\end{Verbatim}

It turns out that

\begin{equation}
\pi = (0.5714286, 0.2857143, 0.1428571)
\end{equation}

So, in the long run, about 57.1\% of our tosses will be done while in state
0, 28.6\% while in state 1, and 14.3\% in state 2.

Now, look at that latter figure.  Of the tosses we do while in state 2,
half will be heads, so half will be wins.  In other words, about 0.071
of our tosses will be wins.  And THAT figure answers our original
question, through the following reasoning:

Think of, say, 10000 tosses.  There will be about 710 wins sprinkled
among those 10000 tosses.  Thus the average number of tosses between wins
will be about 10000/710 = 14.1.  In other words, the expected time until
we get three consecutive heads is about 14.1 tosses.

\subsection{Example:  ALOHA}

Consider our old friend, the ALOHA network model.  (You
may wish to review the statement of the model in Section
\ref{basicprobcomp} before continuing.)  The key point in that system is
that it was ``memoryless,'' in that the probability of what happens at
time k+1 depends only on the state of the system at time k.

For instance, consider what might happen at time 6 if $X_5 = 2$.  Recall
that the latter means that at the end of epoch 5, both of our two
network nodes were active.  The possibilities for $X_6$ are then

\begin{itemize}

\item $X_6$ will be 2 again, with probability $p^2 + (1-p)^2$

\item $X_6$ will be 1, with probability $2p(1-p)$

\end{itemize}

The central point here is that the past history of the system---i.e. the
values of $X_1, X_2, X_3, X_4 \textrm{ and } X_5$---don't have any impact.
We can state that precisely:

\begin{quote}

The quantity

\begin{equation}
\label{condp6}
P(X_6 = j | X_1 = i_1, X_2 = i_2, X_3 = i_3, X_4 = i_4, X_5 = i)
\end{equation}

\noindent 
does not depend on $i_m, m = 1,...,4$.  Thus we can write (\ref{condp6})
simply as $P(X_6 = j | X_5 = i)$.  

\end{quote}

Furthermore, that probability is the same as $P(X_9 = j | X_8 = i)$ and
in general $P(X_{k+1} = j | X_k = i)$.  We denote this probability by
$p_{ij}$, and refer to it as the {\bf transition probability} from state
i to state j.  

Since this is a three-state chain, the $p_{ij}$ form a 3x3 matrix:

\begin{equation}
P = 
\left (
\begin{array}{ccc}
(1-q)^2 + 2q(1-q)p & 
   2q(1-q)(1-p) + 2q^2p(1-p) & 
   q^2 [p^2+(1-p)^2] \\
(1-q) p & 
   2qp(1-p) + (1-q)(1-p) & 
   q[p^2+(1-p)^2] \\
0 & 
   2p(1-p) & 
   p^2+(1-p)^2
\end{array}
\right )
\end{equation}

For instance, the element in row 0, column 2, $p_{02}$, is
$q^2[p^2+(1-p)^2$, reflecting the fact that to go from state 0 to state
2 would require that both inactive nodes become active (which has
probability $q^2$, and then either both try to send or both refrain from
sending (probability $p^2+(1-p)^2$.

For the ALOHA example here, with p = 0.4 and q = 0.3, the solution is
$\pi_0 = 0.47$, $\pi_1 = 0.43$ and $\pi_2 = 0.10$.

So we know that in the long run, about 47\% of the epochs will have no
active nodes, 43\% will have one, and 10\% will have two.  From this we
see that the long-run average number of active nodes is

\begin{equation}
0 \cdot 0.47 + 1 \cdot 0.43 + 2 \cdot 0.10 = 0.63
\end{equation}

\subsection{Example:  Bus Ridership Problem}

Consider the bus ridership problem in Section \ref{busridership}.  Make
the same assumptions now, but add a new one:  There is a maximum
capacity of 20 passengers on the bus.

The random variables $L_i$, i = 1,2,3,... form a Markov chain.  Let's
look at some of the transition probabilities:

\begin{equation}
p_{00} = 0.5 
\end{equation}

\begin{equation}
p_{01} = 0.4 
\end{equation}

\begin{equation}
p_{20} = (0.2)^2 (0.5) = 0.02
\end{equation}

\begin{equation}
p_{20,20} = (0.8)^20 + 
20 \cdot (0.2) (0.8)^{19} (0.4) + 
190 \cdot (0.2)^2 (0.8^{18} (0.1)
\end{equation}

After finding the $\pi$ vector as above, we can find quantities such as
the long-run average number of passengers on the bus,

\begin{equation}
\sum_{i=0}^{20} \pi_i i
\end{equation}

and the long-run average number of would-be passengers who fail to board
the bus,

\begin{equation}
1 \cdot [\pi_{19} (0.1) + \pi_{20}(0.4)] + 2 \cdot [\pi_{20} (0.1)]
\end{equation}

\subsection{An Inventory Model}

Consider the following simple inventory model.  A store
has 1 or 2 customers for a certain item each day, with probabilities p
and q (p+1 = 1).  Each customer is allowed to buy only 1 item.

When the stock on hand reaches 0 on a day, it is replenished to r
items immediately after the store closes that day.

If at the start of a day the stock is only 1 item and 2 customers wish
to buy the item, only one customer will complete the purchase, and the
other customer will leave emptyhanded.

Let $X_n$ be the stock on hand at the end of day n ({\it after}
replenishment, if any).  Then $X_1, X_2,...$ form a Markov chain, with
state space 1,2,...,r.

Let's write a function {\bf inventory(p,q,r)} that returns the
$\pi$ vector for this Markov chain.  It will call {\bf findpi1()},
similarly to the two code snippets on page \pageref{codesnippets}.

\begin{lstlisting}
inventory <- function(p,q,r) {
   tm <- matrix(rep(0,r^2),nrow=r)
   for (i in 3:r) {
      tm[i,i-1] <- p
      tm[i,i-2] <- q
   }
   tm[2,1] <- p
   tm[2,r] <- q
   tm[1,r] <- 1
   return(findpi1(tm))
}
\end{lstlisting}


\section{A Cautionary Tale}

\subsection{Trick Coins, Tricky Example}

Suppose we have two trick coins in a box. They look identical, but one
of them, denoted coin 1, is heavily weighted toward heads, with a 0.9
probability of heads, while the other, denoted coin 2, is biased in the
opposite direction, with a 0.9 probability of tails.  Let $C_1$ and
$C_2$ denote the events that we get coin 1 or coin 2, respectively.

Our experiment consists of choosing a coin at random from the box, and
then tossing it n times. Let $B_i$ denote the outcome of the $i^{th}$
toss, i = 1,2,3,..., where $B_i = 1$ means heads and $B_i = 0$ means
tails.  Let $X_i = B_1+...+B_i$, so $X_i$ is a count of the number of
heads obtained through the $i^{th}$ toss.

The question is: ``Does the random variable $X_i$ have a binomial
distribution?''  Or, more simply, the question is, ``Are the random
variables $B_i$ independent?'' To most people's surprise, the answer is
No (to both questions).  Why not?

The variables $B_i$ are indeed 0-1 variables, and they have a common
success probability.  But they are not independent!  Let's see why they
aren't.

Consider the events $A_i = \{B_i=1\}$, i = 1,2,3,...  In fact, just look
at the first two.  By definition, they are independent if and only if 

\begin{equation}
\label{notindep}
P(A_1\textrm{~and~}A_2) = P(A_1)P(A_2)
\end{equation}

First, what is $P(A_1)$?   {\bf \Large Now, wait a minute!}  Don't
answer, ``Well, it depends on which coin we get,'' because this is NOT a
conditional probability.  Yes, the {\it conditional} probabilities
$P(A_1 | C_1)$ and $P(A_1 | C_2)$ are 0.9 and 0.1, respectively, but the
{\it unconditional} probability is $P(A_1) = 0.5$.  You can deduce that
either by the symmetry of the situation, or by 

\begin{equation}
P(A_1) = P(C_1) P(A_1 |
C_1) + P(C_2) P(A_1 | C_2) = (0.5) (0.9) + (0.5) (0.1) = 0.5
\end{equation}

You should think of all this in the notebook context.  Each line of the
notebook would consist of a report of three things:  which coin we get;
the outcome of the first toss; and the outcome of the second toss.
(Note by the way that in our experiment we don't know which coin we get,
but conceptually it should have a column in the notebook.)  If we do this
experiment for many, many lines in the notebook, about 90\% of the lines
in which the coin column says ``1'' will show Heads in the second
column.  But 50\% of the lines {\it overall} will show Heads in that
column.

So, the right hand side of Equation (\ref{notindep}) is equal to 0.25.
What about the left hand side?

\begin{eqnarray}
P(A_1\textrm{ and }A_2) & = & P(A_1\textrm{ and }A_2\textrm{ and
}C_1)+P(A_1\textrm{ and }A_2\textrm{ and }C_2)\textrm{ }\label{lhs} \\
 & = & P(A_1\textrm{ and }A_2|C_1)P(C_1)+P(A_1\textrm{ and}A_2|C_2)
 P(C_2)\\
 & = & (0.9)^2(0.5)+(0.1)^{2}(0.5)\\
 & = & 0.41
\end{eqnarray}

Well, 0.41 is not equal to 0.25, so you can see that the events are not
independent, contrary to our first intuition.  And that also means that
$X_i$ is not binomial.

\subsection{Intuition in Retrospect} 

To get some intuition here, think about what would happen if we tossed
the chosen coin 10000 times instead of just twice. If the tosses were
independent, then for example knowledge of the first 9999 tosses should
not tell us anything about the 10000th toss. But that is not the case at
all. After 9999 tosses, we are going to have a very good idea as to
which coin we had chosen, because by that time we will have gotten
about 9000 heads (in the case of coin $C_1$) or about 1000 heads
(in the case of $C_2$). In the former case, we know that the
10000th toss is likely to be a head, while in the latter case it is
likely to be tails. \textbf{In other words, earlier tosses do indeed
give us information about later tosses, so the tosses aren't
independent.}

\subsection{Implications for Modeling}

The lesson to be learned is that independence can definitely be a tricky
thing, not to be assumed cavalierly.   And in creating probability
models of real systems, we must give very, very careful thought to the
conditional and unconditional aspects of our models----it can make a
huge difference, as we saw above.  Also, the conditional aspects often
play a key role in formulating models of nonindependence. 

This trick coin example is just that---tricky---but similar situations
occur often in real life.  If in some medical study, say, we sample
people at random from the population, the people are independent of each
other.  But if we sample {\it families} from the population, and then
look at children within the families, the children within a family are
not independent of each other.

% \section{More on the ``Notebook'' Idea}
% 
% Consider a multistage experiment, with stages of identical form.  A good
% example of this was our ALOHA model in Section \ref{aloha}.  There we
% looked at two epochs, which we are calling ``stages'' here.  Each line
% of our notebook (Table \ref{alohanotebook}) consisted of the 2-tuple
% $(X_1,X_2)$ (plus information derived from this 2-tuple).
% 
% Another example would be an experiment in which we toss a die five
% times.  Let $X_i$ denote the outcome of the $i^{th}$ roll of the die.
% Each line of our notebook would consist of the 5-tuple
% $(X_1,X_2,X_3,X_4,X_5)$.
% 
% We could also consider an experiment in which we toss a die infinitely
% many times.  Or, since this may seem rather artificial, consider the
% experiment in which the ALOHA network runs for infinitely many epochs,
% which is realistic because the network is an ongoing process.  Here
% each line of the notebook would consist of the infinite-tuple
% $(X_1,X_2,X_3,\ldots)$. 
% 
% With this in mind, let's return to our discussion of the Strong Law of
% Large Numbers (SLLN) from Section \ref{reconcil}.  While Equation
% (\ref{slln}) made good intuitive sense, it was incomplete.  In what
% sense does the limit exist?  The actual SLLN says that the limit exists
% ``with probability 1.''  This is a very subtle issue, but be patient and
% you should feel more comfortable with it.
% 
% For concreteness, let's use the example of tossing the die infinitely
% many times.  Let $X_i$ denote the outcome of the $i^{th}$ toss of the
% die, i = 1,2,3,...   Recall that E(X) = 3.5.  Then the mathematically
% precise statement of the SLLN is then
% 
% \begin{equation}
% \label{sllnprecise}
% P \left [\lim_{n \rightarrow \infty} \frac{X_1+...+X_n}{n} = 3.5 \right ] = 1 
% \end{equation}
% 
% In light of our concept of a multistage experiment, that means again
% that each line of our notebook consists of the infinite-tuple
% $(X_1,X_2,X_3,\ldots)$.  We have infinitely many lines in our notebook,
% reflecting the fact that we are tossing the die infinitely many times,
% {\it infinitely many times}!  Even though (\ref{sllnprecise}) looks
% abstract, all it's saying is the following.
% 
% In each line of the notebook, the long-run average of all the $X_i$ values
% in that line is either 3.5---or not.  So imagine a new column added to
% the notebook, labeld ``limit exists?'', with the entries being Yes or
% No.  What (\ref{sllnprecise}) says is, as we go from line to line in the
% notebook, the long-run fraction of lines which say Yes is 1.
% 
% No doubt about it, this is pretty abstract.  But it has practical
% implications.  For example, suppose we wish to write a program
% simulating the ALOHA system, and we wish to find the long-run average
% number of nodes which have a message to send.  What (\ref{sllnprecise})
% says is that one line of the notebook suffices; we can simulate the
% system until some large number of epochs, say 100000, and compute the
% average number of ready messages among those 100 epochs. 

\section{Why Not Just Do All Analysis by Simulation?}

Now that computer speeds are so fast, one might ask why we need to do
mathematical probability analysis; why not just do everything by
simulation?  There are a number of reasons:

\begin{itemize}

\item Even with a fast computer, simulations of complex systems can take
days, weeks or even months.

\item Mathematical analysis can provide us with insights that may not
be clear in simulation.

\item Like all software, simulation programs are prone to bugs.  The
chance of having an uncaught bug in a simulation program is reduced by
doing mathematical analysis for a special case of the system being
simulated.  This serves as a partial check.

\item Statistical analysis is used in many professions, including
engineering and computer science, and in order to conduct meaningful,
\underline{useful} statistical analysis, one needs a firm understanding
of probability principles.

\end{itemize}

An example of that second point arose in the computer security research
of a graduate student at UCD, C. Senthilkumar, who was working on a way
to more quickly detect the spread of a malicious computer worm.  He was
evaluating his proposed method by simulation, and found that things
``hit a wall'' at a certain point.  He wasn't sure if this was a real
limitation; maybe, for example, he just wasn't running his simulation on
the right set of parameters to go beyond this limit.  But a mathematical
analysis showed that the limit was indeed real.

\section{Proof of Chebychev's Inequality}
\label{chebproof}

To prove (\ref{cheb}), let's first state and prove Markov's Inequality:
For any nonnegative random variable Y,

\begin{equation}
\label{mi}
P(Y \geq d) \leq \frac{EY}{d}
\end{equation}

To prove (\ref{mi}), let Z be the indicator random variable for the
event $Y \geq d$ (Section \ref{indicator}). 

Now note that

\begin{equation}
Y \geq d Z
\end{equation}

To see this, just think of a notebook, say with d = 3.  Then the
notebook might look like Table \ref{ygeqz}.

\begin{table}
\begin{center}

\begin{tabular}{|r|r|r|r|}
\hline
notebook line & Y & dZ & $Y \geq dZ$? \\ \hline
\hline
1 & 0.36 & 0 & yes \\ \hline
2 & 3.6 & 3 & yes \\ \hline
3 & 2.6 & 0 & yes \\ \hline
\end{tabular}

\end{center}
\caption{Illustration of Y and Z}
\label{ygeqz}
\end{table}

So

\begin{equation}
\label{eyez}
EY \geq d EZ
\end{equation}

(Again think of the notebook.  The long-run average in the Y column will
be $\geq$ the corresponding average for the dZ column.

The right-hand side of (\ref{eyez}) is $d P(Y \geq d)$, so (\ref{mi})
follows.

Now to prove (\ref{cheb}), define

\begin{equation}
Y = (X-\mu)^2
\end{equation}

and set $d = c^2 \sigma^2$.  Then (\ref{mi}) says

\begin{equation}
\label{miout}
P [ (X-\mu)^2 \geq c^2 \sigma^2 ]
\leq 
\frac
{E [(X-\mu)^2]}
{c^2 \sigma^2}
\end{equation}

Since

\begin{equation}
(X-\mu)^2 \geq c^2 \sigma^2 \textrm{ if and only if }
|X-\mu| \geq c \sigma
\end{equation}

the left-hand side of (\ref{miout}) is the same as the left-hand side of
(\ref{cheb}).  The numerator of the right-hand size of (\ref{miout}) is
simply Var(X), i.e. $\sigma^2$, so we are done.

\section{Reconciliation of Math and Intuition (optional section)} 
\label{reconcil}

Here is a more theoretical definition of probability, as opposed to the
intuitive ``notebook'' idea in this book.  The definition is an
abstraction of the notions of events (the sets A in $\cal W$ below) and
probabilities of those events (the values of the function P(A)):

\begin{definition}
Let S be a set, and let $\cal W$ be a collection of
subsets of S.  Let P be a real-valued function on $\cal W$.  Then S,
$\cal W$ and P form a {\bf probability space} if the following
conditions hold:

\begin{itemize}

\item $S \in \cal W$.

\item $\cal W$ is closed under complements (if a set is in $\cal W$,
then the set's complement with respect to S is in $\cal W$ too) and
under unions of countably many members of $\cal W$.

\item $P(A) \geq 0$ for any A in $\cal W$.

\item If $A_1, A_2,... \in \cal W$ and the $A_i$ are pairwise disjoint,
then

\begin{equation}
P(\cup_i A_i) = \sum_i P(A_i)
\end{equation}

\end{itemize}

A {\bf random variable} is any function $X: S \rightarrow \cal
R$.\footnote{The function must also have a property called {\bf
measurability}, which we will not discuss here.}

\end{definition}

Using just these simple axioms, one can prove (with lots of heavy math)
theorems like the Strong Law of Large Numbers:

\begin{theorem}
Consider a random variable U, and a sequence of independent random
variables $U_1, U_2, ...$ which all have the same distribution as U,
Then

\begin{equation}
\label{slln}
\lim_{n \rightarrow \infty} \frac{U_1+...+U_n}{n} = E(U) \textrm{ with
probability 1}
\end{equation}
\end{theorem}

In other words, the average value of U in all the lines of the notebook
will indeed converge to EU.

\startproblemset

\oneproblem
Consider a game in which one rolls a single die until one accumulates a
total of at least four dots. Let $X$ denote the number of rolls needed.
Find $P(X \leq 2)$ and $E(X)$.

\oneproblem
Recall the committee example in Section \ref{combex}.  Suppose now,
though, that the selection protocol is that there must be at least one
man and at least one woman on the committee.  Find $E(D)$ and $Var(D)$.

\oneproblem
Suppose a bit stream is subject to errors, with each bit having
probability p of error, and with the bits being independent. Consider a
set of four particular bits. Let X denote the number of erroneous bits
among those four.

\begin{itemize}

\item [(a)] Find P(X = 2) and EX.

\item [(b)] What famous parametric family of distributions does the
distribution of X belong to?

\item [(c)] Let Y denote the maximum number of consecutive erroneous bits.
Find P(Y = 2) and Var(Y). 

\end{itemize}

\oneproblem
Derive (\ref{vargeom}).

\oneproblem
Finish the computation in (\ref{expectdist}).

\oneproblem
Derive the facts that for a Poisson-distributed random variable X with
parameter $\lambda$, $EX = Var(X) = \lambda$.  Use the hints in Section
\ref{poisfam}.

\oneproblem
A civil engineer is collecting data on a certain road.  She needs to
have data on 25 trucks, and 10 percent of the vehicles on that road are
trucks.  State the famous parametric family that is relevant here, and
find the probability that she will need to wait for more than
200 vehicles to pass before she gets the needed data.

\oneproblem
In the ALOHA example:

\begin{itemize}

\item [(a)] Find $E(X_1)$ and $Var(X_1)$, for the case p = 0.4, q = 0.8.
You are welcome to use quantities already computed in the text, e.g.
$P(X_1 = 1) = 0.48$, but be sure to cite equation numbers.

\item [(b)] Find P(collision during epoch 1) for general p, q.

\end{itemize}

\oneproblem
Our experiment is to toss a nickel until we get a head, taking
X rolls, and then toss a dime until we get a head, taking Y tosses.
Find:

\begin{itemize}

\item [(a)] Var(X+Y).

\item [(b)] Long-run average in a ``notebook'' column labeled $X^2$.

\end{itemize}

\oneproblem
Consider the game in Section \ref{coingame}.  Find $E(Z)$ and $Var(Z)$,
where $Z = Y_6 - X_6$.

\oneproblem
Say we choose six cards from a standard deck, one at a time WITHOUT
replacement.  Let $N$ be the number of kings we get.  Does $N$ have a
binomial distribution?  Choose one: (i)  Yes.  (ii) No, since trials are
not independent.  (iii) No, since the probability of success is not
constant from trial to trial.  (iv) No, since the number of trials is not
fixed.  (v) (ii) and (iii).  (iv) (ii) and (iv).  (vii) (iii) and (iv).  

\oneproblem
Suppose we have n independent trials, with the probability
of success on the i$^{th}$ trial being $p_i$.  Let $X$ = the number of
successes.  Use the fact that ``the variance of the sum is the sum of
the variance'' for independent random variables to derive $Var(X)$.

\oneproblem
Prove Equation (\ref{varuformula}).

\oneproblem
Show that if $X$ is a nonnegative-integer valued random variable, then

\begin{equation}
EX = \sum_{i=1}^{\infty} P(X \geq i)
\end{equation}

Hint:  Write $i = \sum_{j=1}^i 1$, and when you see an iterated sum,
reverse the order of summation.

\oneproblem
Suppose we toss a fair time n times, resulting in $X$ heads.  Show that
the term {\it expected value} is a misnomer, by showing that

\begin{equation}
\lim_{n \rightarrow \infty} P(X = n/2) = 0
\end{equation}

Use Stirling's approximation, 

\begin{equation}
k! \approx \sqrt{2 \pi k} \left ( \frac{k}{e} \right )^k
\end{equation}

\oneproblem
Suppose X and Y are independent random variables with standard
deviations 3 and 4, respectively.

\begin{itemize}

\item [(a)] Find Var(X+Y).

\item [(b)] Find Var(2X+Y).

\end{itemize}

\oneproblem
Fill in the blanks in the following simulation, which finds
the approximate variance of N, the number of rolls of a die needed to
get the face having just one dot.

\begin{Verbatim}[fontsize=\relsize{-2}]
onesixth <- 1/6
sumn <- 0
sumn2 <- 0
for (i in 1:10000) {
   n <- 0
   while(TRUE) {
      ________________________________________
      if (______________________________ < onesixth) break
   }
   sumn <- sumn + n
   sumn2 <- sumn2 + n^2
}
approxvarn <- ____________________________________________
cat("the approx. value of Var(N) is ",approx,"\n")
\end{Verbatim}

\oneproblem
Let X be the total number of dots we get if we roll three
dice.  Find an upper bound for $P(X \geq 15)$, using our course
materials.

\oneproblem
Suppose X and Y are independent random variables, and let Z = XY. Show
that $Var(Z) = E(X^2)   E(Y^2) - [E(X)]^2   [E(Y)]^2$. 

\oneproblem
This problem involves a very simple model of the Web. (Far more complex
ones exist.)

Suppose we have n Web sites. For each pair of sites i and j, $i \neq j$,
there is a link from i to j with probability p, and no link (in that
direction) with probability 1-p. Let $N_i$ denote the number of sites
that site i is linked to; note that $N_i$ can range from 0 to n-1.
Also, let $M_{ij}$ denote the number of outgoing links that i and j have
in common, not counting the one between them, if any. Assume
that each site forms its outgoing links independently of the others.

Say n = 10, p = 0.2. Find the followinga:

\begin{itemize}

\item [(a)] $P(N_1 = 3)$

\item [(b)] $P(N_1 = 3 \textrm{ and } N_2 = 2)$

\item [(c)] $Var(N_1)$

\item [(d)] $Var(N_1 + N_2)$

\item [(e)] $P(M_{12} = 4)$

\end{itemize}

Note: There are some good shortcuts in some of these problems, making
the work much easier. But you must JUSTIFY your work.

\oneproblem
Let X denote the number of heads we get by tossing a coin 50 times.
Consider Chebychev's Inequality for the case of 2 standard deviations.
Compare the upper bound given by the inequality to the exact probability.

\oneproblem
Suppose the number N of cars arriving during a given time period at a
toll booth has a Poisson distribution with parameter $\lambda$. Each 
car has a probability p of being in a car pool. Let M be the number of
car-pool cars that arrive in the given period. Show that M also has a Poisson   
distribution, with parameter $p \lambda$. (Hint: Use the Maclaurin series for   
$e^x$.) 

\oneproblem
Consider a three-sided die, as on page \pageref{threesideddiepage}.

\begin{itemize}

   \item [(a)] (10) State the value of $p_X(2)$.

   \item [(b)] (10) Find EX and Var(X).

   \item [(c)] (15) Suppose you win \$2 for each dot.  Find EW,
   where W is the amount you win.

\end{itemize}

\oneproblem
Consider the parking space problem in Section \ref{parking}.  Find
Var(M), where M is the number of empty spaces in the first block, and
Var(D).

\oneproblem
Suppose X and Y are independent, with variances 1 and 2,
respectively.  Find the value of c that minimizes Var[cX + (1-c)Y].

\oneproblem
In the cards example in Section \ref{hearts}, let H denote the number of
hearts. Find EH and Var(H).

\oneproblem
In the bank example in Section \ref{bankex}, suppose you observe the
bank for n days. Let X denote the number of days in which at least 2 
customers entered during the 11:00-11:15 observation period. Find P(X = k).

\oneproblem
Find $E(X^3)$, where X has a geometric distribution with parameter p.

\oneproblem
Supppose we have a nonnegative random variable X, and define a new
random variable Y, which is equal to X if X $>$ 8 and equal to 0
otherwise.  Assume X takes on only a finite number of values (just a
mathematical nicety, not really an issue).  Which one of the following
is true:

\begin{itemize}

\item [(i)] $EY \leq EX$.

\item [(ii)] $EY \geq EX$.

\item [(iii)] Either of $EY$ and $EX$ could be larger than the other,
depending on the situation.

\item [(iv)] $EY$ is undefined.

\end{itemize}

\oneproblem
Say we roll two dice, a blue one and a yellow one.  Let B
and Y denote the number of dots we get, respectively.  Now let G denote
the indicator random variable for the event S = 2.  Find E(G).

\oneproblem Suppose $I_1, I_2 \textrm{ and } I_3$ are independent
indicator random variables, with $P(I_j = 1) = p_j$, j = 1,2,3.  Find
the following in terms of the $p_j$, writing your derivation with
reasons in the form of mailing tube numbers. 

\oneproblem Consider the ALOHA example, Section \ref{alohaagain} .
Write a call to the built-in R function {\bf dbinom()} to evaluate
(\ref{onesends}) for general m and p.

\oneproblem Consider the bus ridership example, Section
\ref{busridership}.  Suppose upon arrival to a certain stop, there are 2
passengers.  Let A denote the number of them who choose to alight at
that stop.

\begin{itemize}

\item [(a)] State the parametric family that the distribution of A
belongs to.

\item [(b)] Find $p_A(1)$ and $F_A(1)$, writing each answer in
decimal expression form e.g. $12^8 \cdot 0.32 + 0.3333$.

\end{itemize}

\oneproblem
Suppose you have a large disk farm, so heavily used that the lifetimes L
are measured in months. They come from two different factories, in
proportions q and 1-q. The disks from factory i have geometrically
distributed lifetime with parameter $p_i$, i = 1,2. Find Var(L) in terms of
q and the $p_i$. 

