\chapter{Continuous Probability Models}
\label{chap:contin}

There are other types of random variables besides the discrete ones you
studied in Chapter \ref{dis}.  This chapter will cover another major
class, {\it continuous random variables}.  It is for such random
variables that the calculus prerequisite for this book is needed.

\section{A Random Dart}
\label{dart}

Imagine that we throw a dart at random at the interval (0,1).  Let D
denote the spot we hit.  By ``at random'' we mean that all subintervals
of equal length are equally likely to get hit.  For instance, the
probability of the dart landing in (0.7,0.8) is the same as for
(0.2,0.3), (0.537,0.637) and so on.

Because of that randomness, 

\begin{equation}
\label{vminusu}
P(u \leq D \leq v) = v-u
\end{equation}

for any case of $0 \leq u < v \leq 1$.

The first crucial point to note is that 

\begin{equation}
\label{itszero}
P(D = c) = 0
\end{equation}

for any individual point c.  This may seem counterintuitive, but it can
be seen in a couple of ways:

\begin{itemize}

\item Take for example the case c = 0.3.  Then

\begin{equation}
\label{1921}
P(D = 0.3) \leq P( 0.29 \leq D \leq 0.31) = 0.02
\end{equation}

the last equality coming from (\ref{vminusu}).  

So, $P(D = 0.3) \leq 0.02$.  But we can replace 0.29 and 0.31 in
(\ref{1921}) by 0.299 and 0.301, say, and get $P(D = 0.3) \leq 0.002$.  
So, $P(D = 0.3)$ must be smaller than any positive number, and thus it's
actually 0.

\item Reason that there are infinitely many points, and if they all had
some nonzero probability w, say, then the probabilities would sum to
infinity instead of to 1; thus they must have probability 0.

\end{itemize}

Remember, we have been looking at probability as being the long-run
fraction of the time an event occurs, in infinitely many repetitions of
our experiment.  So (\ref{itszero}) doesn't say that D = c can't occur;
it merely says that it happens so rarely that the long-run fraction of
occurrence is 0.

All this may still sound odd to you, but remember, this is an
idealization.  D actually cannot be just any old point in (0,1).  Our
dart has nonzero thickness, our measuring instrument has only finite
precision, and so on.  So it really is an idealization, though an
extremely useful one.  It's like the assumption of ``massless string''
in physics analyses; there is no such thing, but it's a good
approximation to reality.

\section{But Equation (\ref{itszero}) Presents a Problem}
\label{presentsaproblem}

But Equation (\ref{itszero}) presents a problem for us in defining the
term {\bf distribution} for variables like this.  In Section
\ref{dstrdef}, we defined this for a discrete random variable Y as a
list of the values Y takes on, together with their probabilities.  But
that would be impossible here---all the probabilities of individual
values here are 0.

Instead, we define the distribution of a random variable W which puts 0
probability on individual points in another way.  To set this up, we
first must define a key function:

\begin{definition}
For any random variable W (including discrete ones),
its {\bf cumulative distribution function} (cdf), $F_W$, is defined by

\begin{equation}
\label{cdf}
F_W(t) = P(W \leq t), -\infty < t < \infty
\end{equation}
\end{definition}

(Please keep in mind the notation.  It is customary to use capital F to
denote a cdf, with a subscript consisting of the name of the random
variable.)

What is t here?  It's simply an argument to a function.  The function
here has domain $(-\infty, \infty)$, and we must thus define that
function for every value of t.  This is a simple point, but a crucial
one.

For an example of a cdf, consider our ``random dart'' example above.  We
know that, for example for t = 0.23,

\begin{equation}
F_D(0.23) = P(D \leq 0.23) = P(0 \leq D \leq 0.23) = 0.23
\end{equation}

Also, 

\begin{equation}
F_D(-10.23) = P(D \leq -10.23) = 0
\end{equation}

and

\begin{equation}
F_D(10.23) = P(D \leq 10.23) = 1
\end{equation}

In general for our dart,

\begin{equation}
\label{unifcdf}
F_D(t) = 
\begin{cases}
0, & \text{if $t \leq 0$} \\
t, & \text{if $0 < t < 1$} \\ 
1, & \text{if $t \geq 1$} \\
\end{cases}
\end{equation}

Here is the graph of $F_D$: 

\includegraphics[width=4in]{FD.pdf}

The cdf of a discrete random variable is defined as in Equation
(\ref{cdf}) too.  For example, say Z is the number of heads we get from
two tosses of a coin.  Then

\begin{equation}
\label{binomcdf}
F_Z(t) =
\begin{cases}
0, & \text{if $t <  0$} \\
0.25, & \text{if $0 \leq t <  1$} \\
0.75, & \text{if $1 \leq t <  2$} \\
1, & \text{if $t \geq 2$} \\
\end{cases}
\end{equation}

For instance, $F_Z(1.2) = P(Z \leq 1.2) = P(Z = 0 \textrm{ or } Z = 1) =
0.25 + 0.50 = 0.75$.  (Make sure you confirm this!)  $F_Z$ is graphed
below.  

\includegraphics[width=4in]{FZ.pdf}

The fact that one cannot get a noninteger number of heads is what makes
the cdf of Z flat between consecutive integers.          

In the graphs you see that $F_D$ in (\ref{unifcdf}) is continuous while
$F_Z$ in (\ref{binomcdf}) has jumps.  For this reason, we call random
variables like D---ones which have 0 probability for individual
points---{\bf continuous random variables}. 

Students sometimes ask, ``What is t?"  The answer is that it's simply
the argument of a mathematical function, just like the role of t in,
say, $g(t) = sin(\pi t), -\infty < t < \infty$.  $F_Z()$ is a function,
just like this g(t) or the numerous functions that you worked with in
calculus.  Each input yields an ouput; the input 1.2 yields the output
0.75 in the case of $F_Z()$ while the input 1 yields the output 0 in the
case of g(t).

At this level of study of probability, most random variables are either
discrete or continuous, but some are not.  

\section{Density Functions}

{\fbox {\parbox{6.5in}{
Intuition is key here.  Make SURE you develop a good intuitive 
understanding of density functions, as it is vital in being able to 
apply probability well.  We will use it a lot in our course.  
}}}

\subsection{Motivation, Definition and Interpretation}
\label{densitymotivation}

OK, now we have a name for random variables that have probability 0 for
individual points---``continuous''---and we have solved the problem of
how to describe their distribution.  Now we need something which will be
continuous random variables' analog of a probability mass function.  
(The reader may wish to review pmfs in Section \ref{dstrdef}.)

Think as follows.  From (\ref{cdf}) we can see that for a discrete
random variable, its cdf can be calculated by summing is pmf.  Recall
that in the continuous world, we integrate instead of sum.
So, our continuous-case analog of the pmf should be
something that integrates to the cdf.  That of course is the derivative
of the cdf, which is called the {\bf density}:  

\begin{definition} 
(Oversimplified from a theoretical math point of view.) Consider a 
continuous random variable W.  Define

\begin{equation}
\label{dens}
f_W(t) = \frac{d}{dt} F_W(t), -\infty < t < \infty
\end{equation}

wherever the derivative exists.  The function $f_W$ is called the {\bf
density} of W.
\end{definition}

(Please keep in mind the notation.  It is customary to use lower-case f
to denote a density, with a subscript consisting of the name of the
random variable.)

Recall from calculus that an integral is the area under the curve,
derived as the limit of the sums of areas of rectangles drawn at the
curve, as the rectangles become narrower and narrower.  Since the
integral is a limit of sums, its symbol $\int$ is shaped like an S. 

Now look at Figure \ref{pm}, depicting a density function $f_X$.  (It so
happens that in this example, the density is an increasing function, but
most are not.) A rectangle is drawn, positioned horizontally at $1.3 \pm
0.1$, and with height equal $f_X(1.3)$.  The area of the rectangle
approximates the area under the curve in that region, which in turn is a
probability:

\begin{eqnarray}
2 (0.1) f_X(1.3) &\approx& \int_{1.2}^{1.4} f_X(t) ~ dt \textrm{\ \ \ \ (rect.
   approx. to slice of area)}\\ 
&=& F_X(1.4) - F_X(1.2) \textrm{\ \ \ \ ($f_X = F_X'$)}\\
&=&  P(1.2 < X \leq 1.4) \textrm{\ \ \ \ (def. of $F_X$)}\\
&=&  P(1.2 < X < 1.4) \textrm{\ \ \ \  (prob. of single pt. is 0)}
\end{eqnarray}

In other words, for any density $f_X$ at any point t, and for small
values of c,

\begin{equation}
2c f_X(t) \approx P(t-c < X < t+c)
\end{equation}

Thus we have:

\begin{quote}

{\bf Interpretation of Density Functions}

For any density $f_X$ and any two points r and s,

\begin{equation}
\frac{P(r-c < X < r+c)}{P(s-c < X < s+c)} \approx \frac{f_X(r)}{f_X(s)}
\end{equation}

So, X will take on values in regions in which $f_X$ is large much more
often than in regions where it is small, with the ratio of frequencies
being proportion to the values of $f_X$.

\end{quote}

\begin{figure}
\centerline{
\includegraphics[width=4in]{PlusMinusC.pdf}
}
\caption{Approximation of Probability by a Rectangle}
\label{pm}
\end{figure}

For our dart random variable D, $f_D(t) = 1$ for t in (0,1), and it's 0
elsewhere.\footnote{The derivative does not exist at the points 0 and 1,
but that doesn't matter.}  Again, $f_D(t)$ is NOT P(D = t), since the
latter value is 0, but it is still viewable as a ``relative
likelihood.''  The fact that $f_D(t) = 1$ for all t in (0,1) can be
interpreted as meaning that all the points in (0,1) are equally likely
to be hit by the dart.  More precisely put, you can view the constant
nature of this density as meaning that all subintervals of the same
length within (0,1) have the same probability of being hit.

Note too that if, say, X has the density in the previous paragraph, then
$f_X(3) = 6/15 = 0.4$ and thus $P(1.99 < X < 2.01) \approx 0.008$.
Using our notebook viewpoint, think of many repetitions of the
experiment, with each line in the notebook recording the value of X in
that repetition.  Then in the long run, about 0.8\% of the lines would
have X in (1.99,2.01).

The interpretation of the density is, as seen above, via the relative
heights of the curve at various points.  The absolute heights are not
important.  Think of what happens when you view a histogram of grades on
an exam.  Here too you are just interested in relative heights.  (In a
later unit, you will see that a histogram is actually an estimate for a
density.)

\subsection{Properties of Densities}

Equation (\ref{dens}) implies 

{\bf Property A:}

\begin{equation}
\label{ntab}
P(a < W \leq b) = F_W(b) - F_W(a) = \int_{a}^{b} f_W(t) ~ dt
\end{equation}

Since P(W = c) = 0 for any single point c, this also means:

{\bf Property B:}

\begin{equation}
P(a < W \leq b) = 
P(a \leq W \leq b) = 
P(a \leq W < b) = 
P(a < W < b) = 
\int_{a}^{b} f_W(t) ~ dt
\end{equation}

This in turn implies: 

{\bf Property C:}

\begin{equation}
\label{total1}
\int_{-\infty}^{\infty} f_W(t) ~ dt = 1
\end{equation}

Note that in the above integral, $f_W(t)$ will be 0 in various ranges of
t corresponding to values W cannot take on.  For the dart example, for
instance, this will be the case for $t < 0$ and $t > 1$.

What about E(W)?  Recall that if W were discrete, we'd have 

\begin{equation}
E(W) = \sum_{c} c p_W(c)
\end{equation}

where the sum ranges overall all values c that W can take on.  If for
example W is the number of dots we get in rolling two dice, c will range
over the values 2,3,...,12.

So, the analog for continuous W is:

{\bf Property D:}

\begin{equation}
E(W) = \int_t t f_W(t) ~ dt
\end{equation}

where here t ranges over the values W can take on, such as the interval
(0,1) in the dart case.  Again, we can also write this as

\begin{equation}
E(W) = \int_{-\infty}^{\infty}  t f_W(t) ~ dt
\end{equation}

in view of the previous comment that $f_W(t)$ might be 0 for various
ranges of t.

And of course, 

\begin{equation}
E(W^2) = \int_t t^2 f_W(t) ~ dt
\end{equation}

and in general, similarly to (\ref{egofx}):

{\bf Property E:}

\begin{equation}
\label{eofgofw}
E[g(W)] = \int_t g(t) f_W(t) ~ dt
\end{equation}

Most of the properties of expected value and variance stated previously
for discrete random variables hold for continuous ones too:

{\bf Property F:}

Equations (\ref{eofsum}), (\ref{eoflin}), (\ref{factoring}),
(\ref{varuformula}), (\ref{varcu}), (\ref{affine}) still hold in the
continuous case.

\subsection{A First Example}

Consider the density function equal to 2t/15 on the
interval (1,4), 0 elsewhere.  Say X has this density.  Here are some
computations we can do: 

\begin{equation}
\label{ex2t15}
EX = \int_{1}^{4} t \cdot 2t/15 ~ dt = 2.8
\end{equation}

\begin{equation}
P(X > 2.5) = \int_{2.5}^{4} 2t/15 ~ dt = 0.65
\end{equation}

\begin{equation}
F_X(s) = \int_{1}^{s} 2t/15 ~ dt = \frac{s^2-1}{15} ~~~~ \textrm{for s in
(1,4) (cdf is 0 for t $<$ 1, and 1 for t $>$ 4)}
\end{equation}

\begin{eqnarray}
Var(X) &=& E(X^2) - (EX)^2 ~~~~ \textrm{(from (\ref{varuformula})}\\ 
&=& \int_{1}^{4} t^2 2t/15 ~ dt - 2.8^2 ~~~~ \textrm{(from (\ref{ex2t15}))} \\
&=& 5.7
\end{eqnarray}

\begin{eqnarray}
P(\textrm{tenths digit of X is even}) 
&=& \sum_{i=0}^{28} 
P \left [ 1+i/10 < X < 1+(i+1)/10 \right ] \\
&=& \sum_{i=0}^{28}
\int_{1+i/10}^{1+(i+1)/10} 2t/15 ~ dt \\
&=& ... \textrm{ (integration left to the reader)}
\end{eqnarray}

Suppose L is the lifetime of a light bulb (say in years), with the
density that X has above.  Let's find some quantities in that context:

{\bf  Proportion of bulbs with lifetime less than
the mean lifetime:}

\begin{equation}
P(L < 2.8) = \int_{1}^{2.8} 2t/15 ~ dt = (2.8^2 - 1)/15
\end{equation}

{\bf Mean of 1/L:}

\begin{equation}
E(1/L) = \int_{1}^{4} \frac{1}{t} \cdot 2t/15 ~ dt = \frac{2}{5}
\end{equation}

{\bf In testing many bulbs, mean number of bulbs that it takes to find
two that have lifetimes longer than 2.5:}

Use (\ref{negbinpmf}) with k = 2 and p = 0.65.

\section{Famous Parametric Families of Continuous Distributions}

\subsection{The Uniform Distributions}

\subsubsection{Density and Properties}
\label{unifprops} 

In our dart example, we can imagine throwing the dart at the interval
(q,r) (so this will be a two-parameter family).  Then to be a uniform
distribution, i.e. with all the points being ``equally likely,'' the
density must be constant in that interval.  But it also must integrate
to 1 [see (\ref{total1}).  So, that constant must be 1 divided by the
length of the interval:

\begin{equation}
f_D(t) = \frac{1}{r-q} 
\end{equation}

for t in (q,r), 0 elsewhere.  

It easily shown that $E(D) = \frac{q+r}{2}$ and $Var(D) = \frac{1}{12} (r-q)^2$.

The notation for this family is U(q,r).

\subsubsection{R Functions}

Relevant functions for a uniformally  distributed random variable X 
on (r,s) are:

\begin{itemize}

\item {\bf punif(q,r,s)}, to find $P(X \leq q)$

\item {\bf qunif(q,r,s)}, to find c such that $P(X \leq c) = q$

\item {\bf runif(n,r,s)}, to generate n independent values of X

\end{itemize}

\subsubsection{Example:  Modeling of Disk Performance}

Uniform distributions are often used to model computer disk requests.
Recall that a disk consists of a large number of concentric rings,
called {\bf tracks}.  When a program issues a request to read or write a
file, the {\bf read/write head} must be positioned above the track of
the first part of the file.  This move, which is called a {\bf seek},
can be a significant factor in disk performance in large systems, e.g. a
database for a bank.

\label{defrag}
If the number of tracks is large, the position of the read/write head,
which I'll denote at X, is like a continuous random variable, and 
often this position is modeled by a uniform distribution.  This 
situation may hold just before a defragmentation operation.  After 
that operation, the files tend to be bunched together in the central 
tracks of the disk, so as to reduce seek time, and X will not have a
uniform distribution anymore.

Each track consists of a certain number of {\bf sectors} of a given
size, say 512 bytes each.  Once the read/write head reaches the proper
track, we must wait for the desired sector to rotate around and pass
under the read/write head.  It should be clear that a uniform
distribution is a good model for this {\bf rotational delay}.

\subsubsection{Example:  Modeling of Denial-of-Service Attack}

In one facet of computer security, it has been found that a uniform
distribution is actually a warning of trouble, a possible indication of
a {\bf denial-of-service attack}.  Here the attacker tries to
monopolize, say, a Web server, by inundating it with service requests.
According to the research of David Marchette,\footnote{{\it Statistical
Methods for Network and Computer Security}, David J. Marchette, Naval
Surface Warfare Center,
\url{rion.math.iastate.edu/IA/2003/foils/marchette.pdf}.} attackers
choose uniformly distributed false IP addresses, a pattern not normally
seen at servers.

\subsection{The Normal (Gaussian) Family of Continuous Distributions}  
\label{normalfam}

These are the famous ``bell-shaped curves,'' so called because their
densities have that shape.\footnote{Note that other parametric families,
notably the Cauchy, also have bell shapes.  The difference lies in the
rate at which the tails of the distribution go to 0.  However, due to
the Central Limit Theorem, to be presented below, the normal family is
of prime interest.}

\subsubsection{Density and Properties}

{\bf Density and Parameters:}

The density for a normal distribution is

\begin{equation}
\label{nointegral}
f_W(t) = \frac{1}{\sqrt{2\pi} \sigma} ~ e^{- 0.5 \left (\frac{t-\mu}{\sigma}
\right )^2}, -\infty < t < \infty
\end{equation}

Again, this is a two-parameter family, indexed by the parameters $\mu$
and $\sigma$, which turn out to be the mean\footnote{Remember, this is a
synonym for expected value.} and standard deviation  $\mu$ and $\sigma$,
The notation for it is $N(\mu,\sigma^2)$ (it is customary to state
the variance $\sigma^2$ rather than the standard deviation).

\label{affine}
{\bf Closure Under Affine Transformation:}

The family is closed under affine transformations, meaning that
if X has the distribution $N(\mu,\sigma^2)$, then Y = cX + d has the
distribution $N(c\mu+d,c^2\sigma^2)$, i.e. Y too has a normal
distribution.  

Consider this statement carefully.  It is saying much more than simply
that Y has mean $c\mu + d$ and variance $c^2\sigma^2$, which would
follow from (\ref{affine}) {\it even if X did not have a normal
distribution}.  The key point is that this new variable Y is also a
member of the normal family, i.e. its density is still given by
(\ref{nointegral}), now with the new mean and variance.

Let's derive this.  For convenience, suppose $c > 0$.  Then

\begin{eqnarray}
F_Y(t) &=& P(Y \leq t)  ~~ (\textrm{definition of } F_Y)  \\ 
&=& P(cX + d \leq t) ~~ (\textrm{definition of Y}) \\
&=& P \left ( X \leq \frac{t-d}{c} \right ) ~~ (\textrm{algebra}) \\
&=& F_X \left (\frac{t-d}{c} \right )~~ (\textrm{definition of } F_X)
\label{xcdf}
\end{eqnarray}

Therefore

\begin{eqnarray}
f_Y(t) &=& \frac{d}{dt} F_Y(t) ~~ (\textrm{definition of } f_Y) \\
&=& \frac{d}{dt} F_X \left (\frac{t-d}{c} \right ) ~~ (\textrm{from } 
(\ref{xcdf})) \\      
&=& f_X \left ( \frac{t-d}{c} \right ) \cdot \frac{d}{dt} \frac{t-d}{c}
~~ (\textrm{definition of } f_X \textrm{ and the Chain Rule}) \\
&=& \frac{1}{c} \cdot 
\frac{1}{\sqrt{2\pi} \sigma} ~ e^{- 0.5 
\left (\frac{\frac{t-d}{c}-\mu}{\sigma} \right )^2} 
~~ (\textrm{from } (\ref{nointegral}) \\
&=& \frac{1}{\sqrt{2\pi} (c \sigma)} ~ 
e^{- 0.5 \left (\frac{t-(c\mu +d)}{c\sigma} \right )^2} ~~ (\textrm{algebra})
\end{eqnarray}

That last expression is the $N(c\mu+d,c^2\sigma^2)$ density, so we are
done!

{\bf Closure Under Independent Summation}

If X and Y are independent random variables, each having a normal
distribution, then their sum S = X + Y also is normally distributed.

This is a pretty remarkable phenomenon, not true for most other
parametric families.  If for instance X and Y each with, say, a U(0,1)
distribution, then the density of S turns out to be triangle-shaped, NOT
another uniform distribution.  (This can be derived using the methods of
Section \ref{convolution}.)

Note that if X and Y are independent and normally distributed, then the
two properties above imply that cX + dY will also have a normal
distribution, for any constants c and d.

{\bf Evaluating the Normal cdf}

The function in (\ref{nointegral}) does not have a closed-form
indefinite integral.  Thus probabilities involving normal random
variables must be approximated.  Traditionally, this is done with a
table for the cdf of N(0,1).  This one table is sufficient for the
entire normal family, because if X has the distribution
$N(\mu,\sigma^2)$ then

\begin{equation}
\frac{X-\mu}{\sigma}
\end{equation}

has a N(0,1) distribution too, due to the affine transformation closure
property discussed above.  

By the way, the N(0,1) cdf is traditionally denoted by $\Phi$.  
As noted, traditionally it has played a central role, as one could
transform any probability involving some normal distribution to an
equivalent probability involving N(0,1).  One would then use a table of
N(0,1) to find the desired probability.

Nowadays, probabilities for any normal distribution, not just N(0,1),
are easily available by computer.  In the R statistical package, the
normal cdf for any mean and variance is available via the function {\bf
pnorm()}.  The signature is

\begin{Verbatim}[fontsize=\relsize{-2}]
pnorm(q,mean=0,sd=1)
\end{Verbatim}

This returns the value of the cdf evaluated at {\bf q}, for a normal
distribution having the specified mean and standard deviation (default
values of 0 and 1).

We can use {\bf rnorm()} to simulate normally distributed random
variables.  The call is 

\begin{Verbatim}[fontsize=\relsize{-2}]
rnorm(n,mean=0,sd=1)
\end{Verbatim}

which returns a vector of {\bf n} random variates from the specified
normal distribution.

We'll use both methods in our first couple of examples below.

\subsubsection{Example:  Network Intrusion}
\label{netintrude}

As an example, let's look at a simple version of the network intrusion
problem.  Suppose we have found that in Jill's remote logins to a certain
computer, the number of disk sectors she reads or writes X has a normal
distribution has a mean of 500 and a standard deviation of 15.  Say our
network intrusion monitor finds that Jill---or someone posing as
her---has logged in and has read or written 535 sectors.  Should we be
suspicious?  

To answer this question, let's find $P(X \geq 535)$:  Let $Z =
(X-500)/15$.  From our discussion above, we know that Z has a N(0,1)
distribution, so

\begin{equation}
P(X \geq 535) = P \left (Z \geq \frac{535-500}{15} \right ) 
= 1 - \Phi(35/15) = 0.01
\end{equation}

Again, traditionally we would obtain that 0.01 value from a N(0,1) cdf
table in a book.  With R, we would just use the function {\bf pnorm()}:

\begin{Verbatim}[fontsize=\relsize{-2}]
> 1 - pnorm(535,500,15)
[1] 0.009815329
\end{Verbatim}

Anyway, that 0.01 probability makes us suspicious.  While it {\it could}
really be Jill, this would be unusual behavior for Jill, so we start to
suspect that it isn't her.  Of course, this is a very crude analysis,
and real intrusion detection systems are much more complex, but you can
see the main ideas here.

\subsubsection{Example:  Class Enrollment Size}

After years of experience with a certain course, a university has found
that online pre-enrollment in the course is approximately normally
distributed, with mean 28.8 and standard deviation 3.1.  Suppose that in
some particular offering, pre-enrollment was capped at 25, and it hit
the cap.  Find the probability that the actual demand for the course was
at least 30.

Note that this is a conditional probability!  Evaulate it as follows.
Let N be the actual deman.  Then the key point is that we are given that
$N \geq 25$, so

\begin{eqnarray}
P(N \geq 30 | N \geq 25) 
&=&  
\frac
{P(N \geq 30 \textrm{ and } N \geq 25)}
{P(N \geq 25)} ~~~~ ((\ref{genand})) \\
&=&  
\frac
{P(N \geq 30)}
{P(N \geq 25)} \\
&=& 
\frac
{1 -\Phi \left [ (30-28.8)/3.1 \right ] }
{1 -\Phi \left [ (25-28.8)/3.1 \right ] } \\
&=& 0.39
\end{eqnarray}

Sounds like it may be worth moving the class to a larger room before
school starts.

Since we are approximating a discrete random variable by a continuous
one, it might be more accurate here to use a {\bf correction for
continuity}, described in Section \ref{correctcontin}.

\subsubsection{The Central Limit Theorem}
\label{theclt}

The Central Limit Theorem (CLT) says, roughly speaking, that a random
variable which is a sum of many components will have an approximate
normal distribution.  So, for instance, human weights are approximately
normally distributed, since a person is made of many components.  The
same is true for SAT test scores,\footnote{This refers to the raw
scores, before scaling by the testing company.} as the total score is
the sum of scores on the individual problems.

There are many versions of the CLT.  The basic one requires that the
summands be independent and identically distributed:\footnote{A more
mathematically precise statement of the theorem is given in Section
\ref{formalclt}.}

\begin{theorem}

Suppose $X_1, X_2, ...$ are independent random variables, all having the
same distribution which has mean m and variance $v^2$.  Form the new
random variable $T = X_1+...+X_n$.  Then for large n, the distribution
of T is approximately normal with mean nm and variance $nv^2$.

\end{theorem}

The larger n is, the better the approximation, but typically n = 20 or
even n = 10 is enough.

\subsubsection{Example:  Cumulative Roundoff Error}

Suppose that computer roundoff error in computing the square roots of
numbers in a certain range is distributed uniformly on (-0.5,0.5), and
that we will be computing the sum of n such square roots.  Suppose we
compute a sum of 50 square roots.  Let's find the approximate
probability that the sum is more than 2.0 higher than it should be.
(Assume that the error in the summing operation is negligible compared
to that of the square root operation.)

Let $U_1,...,U_{50}$ denote the errors on the individual terms in the
sum.  Since we are computing a sum, the errors are added too, so our
total error is

\begin{equation}
T = U_1 + ... + U_{50}
\end{equation}

By the Central Limit Theorem, T has an approximately normal
distribution, with mean 50 EU and variance 50 Var(U), where U is a
random variable having the distribution of the $U_i$.  From Section
\ref{unifprops}, we know that 

\begin{equation}
EU = (-0.5+0.5) / 2 = 0, ~~ Var(U) = \frac{1}{12} [0.5-(-0.5)]^2 =
\frac{1}{12}
\end{equation}

So, the approximate distribution of T is N(0,50/12).  We can then use R to
find our desired probability:

\begin{lstlisting}
> 1 - pnorm(2,mean=0,sd=sqrt(50/12))
[1] 0.1635934
\end{lstlisting}

\subsubsection{Example:  Bug Counts}

As an example, suppose the number of bugs per 1,000 lines of code has a
Poisson distribution with mean 5.2.  Let's find the probability of
having more than 106 bugs in 20 sections of code, each 1,000 lines long.
We'll assume the different sections act independently in terms of bugs.

Here $X_i$ is the number of bugs in the i$^{th}$ section of code, and T
is the total number of bugs.  Since each $X_i$ has a Poisson
distribution, $m = v^2 = 5.2$.  So, T is approximately distributed
normally with mean and variance $20 \times 5.2$.  So, we can find the
approximate probability of having more than 106 bugs:

\begin{Verbatim}[fontsize=\relsize{-2}]
> pnorm(106,20*5.2,sqrt(20*5.2))
[1] 0.5777404
\end{Verbatim}

\subsubsection{Example:  Coin Tosses}
\label{correctcontin}

Binomially distributed random variables, though discrete, also are
approximately normally distributed.  Here's why:

Say T has a binomial distribution with n trials.  Then we  
can write T as a sum of indicator random variables (Section
\ref{indicator}):

\begin{equation}
T = T_1+...+T_n
\end{equation}

where $T_i$ is 1 for a success and 0 for a failure on the i$^{th}$
trial.  Since we have a sum of independent, identically distributed
terms, the CLT applies.  Thus we use the CLT if we have binomial
distributions with large n.

For example, let's find the approximate probability of getting more than
12 heads in 20 tosses of a coin.  X, the number of heads, has a binomial
distribution with n = 20 and p = 0.5  Its mean and variance are then
np = 10 and np(1-p) = 5.  So, let $Z = (X-10)/\sqrt{5}$, and write

\begin{equation}
\label{gt12}
P(X > 12) = P(Z > \frac{12-10}{\sqrt{5}}) 
\approx 1 - \Phi(0.894) = 0.186
\end{equation}

Or:

\begin{Verbatim}[fontsize=\relsize{-2}]
> 1 - pnorm(12,10,sqrt(5))
[1] 0.1855467
\end{Verbatim}

The exact answer is 0.132.  Remember, the reason we could do this was
that X is approximately normal, from the CLT.  This is an approximation
of the distribution of a discrete random variable by a continuous one,
which introduces additional error.

We can get better accuracy by using the {\bf correction of continuity},
which can be motivated as follows.  As an alternative to (\ref{gt12}),
we might write

\begin{equation}
P(X > 12) = P( X \geq 13) = P(Z > \frac{13-10}{\sqrt{5}}) 
\approx 1 - \Phi(1.342) = 0.090
\end{equation}

That value of 0.090 is considerably smaller than the 0.186 we got from
(\ref{gt12}).  We could ``split the difference'' this way:

\begin{equation}
P(X > 12) = P( X \geq 12.5) = P(Z > \frac{12.5-10}{\sqrt{5}}) 
\approx 1 - \Phi(1.118) = 0.132
\end{equation}

(Think of the number 13 ``owning'' the region between 12.5 and 13.5, 14
owning the part between 13.5 and 14.5 and so on.) Since the exact answer
to seven decimal places is 0.131588, the strategy has improved accuracy
substantially.  

The term {\it correction for continuity} alludes to the fact that we
are approximately a discrete distribution by a continuous one.

\subsubsection{Museum Demonstration}

Many science museums have the following visual demonstration of the CLT.

There are many balls in a chute, with a triangular array of r rows of pins
beneath the chute.  Each ball falls through the rows of pins, bouncing
left and right with probability 0.5 each, eventually being collected
into one of r bins, numbered 0 to r.  A ball will end up in bin i if it
bounces rightward in i of the r rows of pins, i = 0,1,...,r.  Key point:

\begin{quote}
Let X denote the bin number at which a ball ends up.  X is the number of
rightward bounces (``successes'') in r rows (``trials'').  Therefore X
has a binomial distribution with n = r and p = 0.5
\end{quote}

Each bin is wide enough for only one ball, so the balls in a bin will
stack up.  And since there are many balls, the height of the stack in
bin i will be approximately proportional to P(X = i).  And since the
latter will be approximately given by the CLT, the stacks of balls will
roughly look like the famous bell-shaped curve!

There are many online simulations of this museum demonstration, such as
\url{http://www.mathsisfun.com/data/quincunx.html}.  By collecting the
balls in bins, the apparatus basically simulates a histogram for $X$,
which will then be approximately bell-shaped.

\subsubsection{Optional topic:  Formal Statement of the CLT}
\label{formalclt}

\begin{definition}
A sequence of random variables $L_1, L_2, L_3,...$ {\bf converges in
distribution} to a random variable $M$ if

\begin{equation}
\lim_{n \rightarrow \infty} P(L_n \leq t) = P(M \leq t), \textrm{ for
all t}
\end{equation}
\end{definition}

Note by the way, that these random variables need not be defined on the
same probability space.

The formal statement of the CLT is:

\begin{theorem}
Suppose $X_1, X_2, ...$ are independent random variables, all having the
same distribution which has mean m and variance $v^2$.  Then

\begin{equation}
\label{cltquant}
Z = 
\frac
{X_1+...+X_n -n m}
{v \sqrt{n}}
\end{equation}

converges in distribution to a N(0,1) random variable.  

\end{theorem}

\subsubsection{Importance in Modeling}
\label{normalimp}

Needless to say, there are no random variable in the real world that are
exactly normally distributed.  In addition to our comments at the
beginning of this chapter that no real-world random variable has a
continuous distribution, there are no practical applications in which a
random variable is not bounded on both ends.  This contrasts with normal
distributions, which extend from $-\infty$ to $\infty$.

Yet, many things in nature do have approximate normal distributions,
normal distributions play a key role in statistics.  Most of the
classical statistical procedures assume that one has sampled from a
population having an approximate distributions.  This should come as no
surprise, knowing the CLT.  In addition, the CLT tells us in many of
these cases the quantities used for statistical estimation are
approximately normal, even if the data they are calculated from do not.

\subsection{The Chi-Square Family of Distributions}

\subsubsection{Density and Properties}

Let $Z_1, Z_2,..., Z_k$ be independent N(0,1) random variables.  Then
the distribution of 

\begin{equation}
Y = Z_1^2+...+Z_k^2
\end{equation}

is called {\bf chi-square with k degrees of freedom}.  We write such a
distribution as $\chi_k^2$.  Chi-square is a one-parameter family of
distributions, and arises quite frequently in statistical applications,
as will be seen in future chapters.

It turns out that chi-square is a special case of the gamma family in
Section \ref{gammafam} below, with r = k/2 and $\lambda = 0.5$.

The R functions {\bf dchisq()}, {\bf pchisq()}, {\bf qchisq()} and {\bf
rchisq()} give us the density, cdf, quantile function and random number
generator for the chi-squared family.  The second argument in each case
is the number of degrees of freedom.  The first argument is the argument
to the corresponding math function in all cases but {\bf rchisq()}, in
which it is the number of random variates to be generated.

For instance, to get the value of $f_X(5.2)$ for a chi-squared random
variable having 3 degrees of freedom, we make the following call:

\begin{lstlisting}
> dchisq(5.2,3)
[1] 0.06756878
\end{lstlisting}

\subsubsection{Example:  Error in Pin Placement}

Consider a machine that places a pin in the middle of a flat,
disk-shaped object. The placement is subject to error. Let $X$ and $Y$ be
the placement errors in the horizontal and vertical directions,
respectively, and let $W$ denote the distance from the true center to the
pin placement. Suppose $X$ and $Y$ are independent and have normal
distributions with mean 0 and variance 0.04. Let's find $P(W > 0.6)$. 

Since a distance is the square root of a sum of squares, this sounds
like the chi-squared distribution might be relevant.  So, let's first
convert the problem to one involving squared distance:

\begin{equation}
P(W > 0.6) = P(W^2 > 0.36)
\end{equation}

But $W^2 = X^2 + Y^2$, so

\begin{equation}
P(W > 0.6) = P(X^2 + Y^2 > 0.36)
\end{equation}

This is not quite chi-squared, as that distribution involves the sum of
squares of independent N(0,1) random variables.  But due to the normal
family's closure under affine transformations (page \pageref{affine}),
we know that X/0.2 and Y/0.2 do have N(0,1) distributions.  So write

\begin{equation}
P(W > 0.7) = P[(X/0.2)^2 + (Y/0.2)^2 > 0.36/0.2^2]
\end{equation}

Now evaluate the right-hand side:

\begin{lstlisting}
> 1 - pchisq(0.36/0.04,2)
[1] 0.01110900
\end{lstlisting}

\subsubsection{Importance in Modeling}

This distribution is used widely in statistical applications.  As will
be seen in our chapters on statistics, many statistical methods involve a
sum of squared normal random variables.\footnote{The motivation for the
term {\it degrees of freedom} will be explained in those chapters too.}

\subsection{The Exponential Family of Distributions}
\label{exponfam}

Please note:  We have been talking here of parametric families of
distributions, and in this section will introduce one of the most
famous, the family of exponential distributions.  This should not be
confused, though, with the term {\it exponential family} that arises in
mathematical statistics, which includes exponential distributions but is
much broader.

\subsubsection{Density and Properties}
\label{expon}

The densities in this family have the form

\begin{equation}
\label{expdens}
f_W(t) = \lambda e^{-\lambda t}, 0 < t < \infty
\end{equation}

This is a one-parameter family of distributions. 

After integration, one finds that $E(W) = \frac{1}{\lambda}$ and $Var(W)
= \frac{1}{\lambda^2}$.  You might wonder why it is customary to index
the family via $\lambda$ rather than $1/\lambda$ (see (\ref{expdens})),
since the latter is the mean.  But this is actually quite natural, for
the reason cited in the following subsection.

\subsubsection{R Functions}

Relevant functions for a uniformally  distributed random variable X 
with parameter lambda are

\begin{itemize}

\item {\bf pexp(q,lambda)}, to find $P(X \leq q)$

\item {\bf qexp(q,lambda)}, to find c such that $P(X \leq c) = q$

\item {\bf rexp(n,lambda)}, to generate n independent values of X

\end{itemize}

\subsubsection{Example:  Refunds on Failed Components}

Suppose a manufacturer of some electronic component finds that its
lifetime L is exponentially distributed with mean 10000 hours. They give a
refund if the item fails before 500 hours. Let $M$ be the number of items
they have sold, up to and including the one on which they make the first
refund. Let's find $EM$ and $Var(M)$.  

First, notice that $M$ has a geometric distribution!  It is the number of
independent trials until the first success, where a ``trial'' is one component,
``success'' (no value judgment, remember) is giving a refund, and the
success probability is

\begin{equation}
P(L < 500) = \int_{0}^{500} 0.0001 e^{-0.0001 t} ~ dt = 0.05
\end{equation}

Then plug p = 0.05 into (\ref{meangeom}) and (\ref{vargeom}). 

\subsubsection{Example:  Overtime Parking Fees}

A certain public parking garage charges parking fees of \$1.50 for the
first hour, and \$1 per hour after that.  Suppose parking times T are
exponentially distributed with mean 1.5 hours.  Let W denote the total
fee paid.  Let's find E(W) and Var(W).

The key point is that W is a function of T:

\begin{equation}
\label{ag}
W = 
\begin{cases}
1.5 T, & \text{if $T \leq 1$} \\
1.5 + 1 \cdot (T-1) = T + 0.5, & \text{if $T > 1$} 
\end{cases}
\end{equation}

That's good, because we know how to find the expected value of a
function of a continuous random variable, from (\ref{eofgofw}).
Defining g() as in (\ref{ag}) above, we have

\begin{equation}
EW = \int_{0}^{\infty} g(t) ~ \frac{1}{1.5} e^{-\frac{1}{1.5}t} dt
= \int_{0}^{1} 1.5 t ~ \frac{1}{1.5} e^{-\frac{1}{1.5}t} dt +
\int_{1}^{\infty} (t+0.5) ~ \frac{1}{1.5} e^{-\frac{1}{1.5}t} dt
\end{equation}

The integration is left to the reader.

Now, what about Var(W)?  As is often the case, it's easier to use
(\ref{varuformula}), so we need to find $E(W^2)$.  The above integration
becomes

\begin{equation}
E(W^2) = \int_{0}^{\infty} g^2(t) ~ \frac{1}{1.5} e^{-\frac{1}{1.5}t} dt
= \int_{0}^{1} 1.5^2 t ~ \frac{1}{1.5} e^{-\frac{1}{1.5}t} dt +
\int_{1}^{\infty} (t+0.5)^2 ~ \frac{1}{1.5} e^{-\frac{1}{1.5}t} dt
\end{equation}

After evaluating this, we subtract $(EW)^2$, giving us the variance of
W.

\subsubsection{Connection to the Poisson Distribution Family}
\label{connpoi}

Suppose the lifetimes of a set of light bulbs are independent and
identically distributed {\bf (i.i.d.)}, and consider the following
process.  At time 0, we install a light bulb, which burns an amount of
time $X_1$.  Then we install a second light bulb, with lifetime $X_2$.
Then a third, with lifetime $X_3$, and so on.  

Let 

\begin{equation}
\label{ti}
T_r = X_1+...+X_r
\end{equation}

denote the time of the $r^{th}$ replacement.  Also, let N(t) denote the
number of replacements up to and including time t.  Then it can be shown
that if the common distribution of the $X_i$ is exponentially
distributed, the N(t) has a Poisson distribution with mean $\lambda t$.
And the converse is true too:  If the $X_i$ are independent and
identically distributed and N(t) is Poisson, then the $X_i$ must have
exponential distributions.  In summary:

\begin{theorem}
Suppose $X_1, X_2, ...$ are i.i.d. nonnegative continuous random
variables.  Define

\begin{equation}
T_r = X_1+...+X_r
\end{equation}

and 

\begin{equation}
N(t) = \max\{k: T_k \leq t\}
\end{equation}

Then the distribution of N(t) is Poisson with parameter $\lambda t$ for
all t if and only the $X_i$ have an exponential distribution with
parameter $\lambda$.

\end{theorem}

In other words, N(t) will have a Poisson distribution if and only if the
lifetimes are exponentially distributed.  

\begin{proof}

{\it ``Only if'' part:}

The key is to notice that the event $X_1 > t$ is exactly equivalent to
$N(t) = 0$.  If the first light bulb has lasts longer than t, then the
count of burnouts at time t is 0, and vice versa.  Then

\begin{eqnarray}
\label{onlyif}
P(X_1 > t) &=& P[N(t) = 0] ~~ \textrm{(see above equiv.) } \\
&=& \frac{(\lambda t)^0}{0!} \cdot e^{-\lambda t} ~~ ((\ref{poispmf})\\
&=& e^{-\lambda t} 
\end{eqnarray}

Then

\begin{equation} 
f_{X_1}(t) = \frac{d}{dt} (1 - e^{-\lambda t}) =
\lambda e^{-\lambda t} 
\end{equation}

That shows that $X_1$ has an exponential distribution, and since the
$X_i$ are i.i.d., that implies that all of them have that distribution.

{\it ``If'' part:}

We need to show that if the $X_i$ are exponentially distributed with
parameter $\lambda$, then for u nonnegative and each positive integer k,

\begin{equation}
\label{poispmfcopy}
P[N(u) = k] = \frac{(\lambda t)^k e^{-\lambda t}}{k!}
\end{equation}

The proof for the case k = 0 just reverses (\ref{onlyif}) above.  The
general case, not shown here, notes that $N(u) \leq k$ is equivalent to
$T_{k+1} > u$.  The probability of the latter event can be found be
integrating (\ref{erldens}) from u to infinity.  One needs to perform
k-1 integrations by parts, and eventually one arrives at
(\ref{poispmfcopy}), summed from 1 to k, as required.

\end{proof}
  
The collection of random variables N(t) $t \geq 0$, is called a {\bf
Poisson process}.

% \begin{equation}
% \label{ntmean}
% E[N(t)] = \lambda t
% \end{equation}

The relation $E[N(t)] = \lambda t$ says that replacements are occurring
at an average rate of $\lambda$ per unit time.  Thus $\lambda$ is called
the {\bf intensity parameter} of the process.  It is because of this
``rate'' interpretation that makes $\lambda$ a natural indexing
parameter in (\ref{expdens}).

\subsubsection{Importance in Modeling}

Many distributions in real life have been found to be approximately
exponentially distributed.  A famous example is the lifetimes of air
conditioners on airplanes.  Another famous example is interarrival
times, such as customers coming into a bank or messages going out onto a
computer network.  It is used in software reliability studies too.

Exponential distributions are the only continous ones that are
``memoryless.''  This point is pursued in Chapter \ref{haz}.  Due to
this property, exponential distributions play a central role in Markov
chains (Chapter \ref{mar}).  

\subsection{The Gamma Family of Distributions}
\label{gammafam}

\subsubsection{Density and Properties}

Recall Equation (\ref{ti}), in which the random variable $T_r$ was
defined to be the time of the $r^{th}$ light bulb replacement.  $T_r$ is
the sum of r independent exponentially distributed random variables with
parameter $\lambda$.  The distribution of $T_r$ is called an {\bf Erlang}
distribution, with density

\begin{equation}
\label{erldens}
f_{T_r}(t) = \frac{1}{(r-1)!} \lambda^r t^{r-1} e^{-\lambda t}, ~ t > 0
\end{equation}

This is a two-parameter family.

Again, it's helpful to think in ``notebook'' terms.  Say r = 8.  Then we
watch the lamp for the durations of eight lightbulbs, recording $T_8$,
the time at which the eighth burns out.  We write that time in the first
line of our notebook.  Then we watch a new batch of eight bulbs, and
write the value of $T_8$ for those bulbs in the second line of our
notebook, and so on.  Then after recording a very large number of lines
in our notebook, we plot a histogram of all the $T_8$ values.  The point
is then that that histogram will look like (\ref{erldens}).

then

We can generalize this by allowing r to take noninteger values, by
defining a generalization of the factorial function:

\begin{equation}
\Gamma(r) = \int_{0}^{\infty} x^{r-1} e^{-x}  ~ dx
\end{equation}

This is called the gamma function, and it gives us the gamma family of
distributions, more general than the Erlang:

\begin{equation}
f_W(t) = \frac{1}{\Gamma(r)} \lambda^r t^{r-1} e^{-\lambda t}, ~ t > 0
\end{equation}

(Note that $\Gamma(r)$ is merely serving as the constant that makes the
density integrate to 1.0.  It doesn't have meaning of its own.)

This is again a two-parameter family, with r and $\lambda$ as
parameters.

A gamma distribution has mean $r/\lambda$ and variance $r/\lambda^2$.
In the case of integer r, this follows from (\ref{ti}) and the fact that
an exponentially distributed random variable has mean and variance
$1/\lambda$ and variance $1/\lambda^2$, and it can be derived in
general.  Note again that the gamma reduces to the exponential when r =
1.  

Recall from above that the gamma distribution, or at least the Erlang,
arises as a sum of independent random variables.  Thus the Central Limit
Theorem implies that the gamma distribution should be approximately
normal for large (integer) values of r.  We see in Figure \ref{gammas}
that even with r = 10 it is rather close to normal.

It also turns out that the chi-square distribution with d degrees of
freedom is a gamma distribution, with r = d/2 and $\lambda = 0.5$.

\subsubsection{Example:  Network Buffer}

Suppose in a network context (not our ALOHA example), a node does not
transmit until it has accumulated five messages in its buffer. Suppose
the times between message arrivals are independent and  exponentially
distributed with mean 100 milliseconds.  Let's find the probability that
more than 552 ms will pass before a transmission is made, starting with
an empty buffer. 

Let $X_1$ be the time until the first message arrives, $X_2$ the time
from then to the arrival of the second message, and so on.  Then the
time until we accumulate five messages is $Y = X_1+...+X_5$.  Then from
the definition of the gamma family, we see that Y has a gamma
distribution with r = 5 and $\lambda = 0.01$.  Then

\begin{equation}
P(Y > 552) = 
\int_{552}^{\infty}
\frac{1}{4!} 0.01^5 t^4 e^{-0.01t} ~ dt
\end{equation}

This integral could be evaluated via repeated integration by parts, but
let's use R instead:

\begin{Verbatim}[fontsize=\relsize{-2}]
> 1 - pgamma(552,5,0.01)
[1] 0.3544101
\end{Verbatim}

\subsubsection{Importance in Modeling}

As seen in (\ref{ti}), sums of exponentially distributed random
variables often arise in applications.  Such sums have gamma
distributions.

You may ask what the meaning is of a gamma distribution in the case of
noninteger r.  There is no particular meaning, but when we have a real
data set, we often wish to summarize it by fitting a parametric family
to it, meaning that we try to find a member of the family that
approximates our data well.

In this regard, the gamma family provides us with densities which rise
near t = 0, then gradually decrease to 0 as t becomes large, so the
family is useful if our data seem to look like this.  Graphs of some
gamma densities are shown in Figure \ref{gammas}.

\begin{figure}
\centerline{
\includegraphics[width=5.0in]{Gamma.pdf}
}
\caption{Various Gamma Densities}
\label{gammas}
\end{figure}

\subsection{The Beta Family of Distributions}
\label{betafamily}

As seen in Figure \ref{gammas}, the gamma family is a good choice to
consider if our data are nonnegative, with the density having
a peak near 0 and then gradually tapering off to the right.  What about
data in the range (0,1)?  The beta family provides a very flexible
model for this kind of setting, allowing us to model many different
concave up or concave down curves.

The densities of the family have the following form:

\begin{equation}
\frac{\Gamma(\alpha+\lambda)}{\Gamma(\alpha) \Gamma(\lambda)}
(1-t)^{\alpha-1}t^{\beta-1}
\end{equation}

There are two parameters, $\alpha$ and $\beta$.  Here are two
possibilities.

\includegraphics[width=4in]{Beta.pdf}

% bt <- function(t) {
%    b <- gamma(alpha+beta) / (gamma(alpha) * gamma(beta)) 
%    b <- b * (1-t)^(alpha-1) * t^(beta-1)
% }
% 
% alpha <- 2
% beta <- 2
% curve(bt,0,1)
% 
% alpha <- 0.5
% beta <- 0.5
% curve(bt,0,1,add=TRUE)

The mean and variance are

\begin{equation}
\frac{\alpha}{\alpha+\beta}
\end{equation}

and

\begin{equation}
\frac
{\alpha \beta}
{(\alpha+\beta)^2 (\alpha+\beta+1)}
\end{equation}

\section{Choosing a Model}

The parametric families presented here are often used in the real world.
As indicated previously, this may be done on an empirical basis.  We
would collect data on a random variable X, and plot the frequencies of
its values in a histogram.  If for example the plot looks roughly like
the curves in Figure \ref{gammas}, we could choose this as the family
for our model.

Or, our choice may arise from theory.  If for instance our knowledge of
the setting in which we are working says that our distribution is
memoryless, that forces us to use the exponential density family.

In either case, the question as to which member of the family we choose
to will be settled by using some kind of procedure which finds the
member of the family which best fits our data.  We will discuss this in
detail in our chapters on statistics, especially Chapter \ref{chap:mod}.

Note that we may choose not to use a parametric family at all.  We may
simply find that our data does not fit any of the common parametric
families (there are many others than those presented here) very well.
Procedures that do not assume any parametric family are termed {\bf
nonparametric}.

\section{A General Method for Simulating a Random Variable}
\label{genrannumgen}

Suppose we wish to simulate a random variable X with cdf $F_{X}$ for
which there is no R function. This can be done via $F_X^{-1}(U)$, where
U has a U(0,1) distribution.  In other words, we call {\bf runif()} and
then plug the result into the inverse of cdf of X. Here ``inverse" is in
the sense that, for instance, squaring and ``square-rooting," exp() and
ln(), etc. are inverse operations of each other.

For example, say X has the density 2t on (0,1). Then $F_X(t) = t^2$, so
$F^{-1}(s) = s^{0.5}$.  We can then generate X in R as {\bf
sqrt(runif(1))}.  Here's why:

For brevity, denote $F_X^{-1}$ as G and $F_X$ as H.  Our generated
random variable is G(U).  Then

\begin{eqnarray}
P[G(U)\leq t] \nonumber \\
& = & P[U\leq G^{-1}(t)] \nonumber \\
& = & P[U\leq H(t)] \nonumber \\
& = & H(t)
\end{eqnarray}

In other words, the cdf of G(U) is $F_X$!  So, G(U) has the same
distribution as X.

Note that this method, though valid, is not necessarily practical, since
computing $F_X^{-1}$ may not be easy.

\section{``Hybrid'' Continuous/Discrete Distributions}
\label{mixed}

A random variable could have a distribution that it partly discrete and
partly continuous.  Recall our first example, from Section \ref{dart},
in which D is the position that a dart hits when thrown at the interval
(0,1).  Suppose our measuring instrument is broken, and registers any
value of D past 0.8 as being equal to 0.8.  Let W denote the actual
value recorded by this instrument.  

Then P(W = 0.8) = 0.2, so W is not a continuous random variable, in
which every point has mass 0.  On the other hand, P(W = t) = 0 for every
t before 0.8, so W is not discrete either.

In the advanced theory of probability, some very odd mixtures, beyond
this simple discrete/continuous example, can occur, though primarily of
theoretical interest.

\startproblemset

\oneproblem
Fill in the blanks, in the following statements about
continuous random variables.  Make sure to use our book's notation.

\begin{itemize}

\item [(a)] $\frac{d}{dt} P(X \leq t) = 
\textrm{ \_\_\_\_\_\_\_\_\_\_\_\_\_\_}$

\item [(b)] $P(a < X < b) =  
\textrm{ \_\_\_\_\_\_\_\_\_\_\_\_\_\_} -
\textrm{ \_\_\_\_\_\_\_\_\_\_\_\_\_\_}$

\end{itemize}

\oneproblem
Suppose $X$ has a uniform distribution on (-1,1), and let $Y = X^2$.
Find $f_Y$.

\oneproblem
In the network intrusion example in Section \ref{netintrude}, suppose 
X is not normally distributed, but instead has a uniform distribution on
(450,550).  Find $P(X \geq 535)$ in this case.

\oneproblem
Suppose X has an exponential distribution with parameter
$\lambda$.  Show that $EX = 1/\lambda$ and $Var(X) = 1/\lambda^2$.

\oneproblem
Suppose $f_X(t) = 3t^2$ for t in (0,1) and is zero elsewhere.  Find 
$F_X(0.5)$ and $E(X)$.

\oneproblem
Suppose light bulb lifetimes X are exponentially distributed
with mean 100 hours.  

\begin{itemize}

\item [(a)] Find the probability that a light bulb burns out before
25.8 hours.

\end{itemize}

In the remaining parts, suppose we have two light bulbs.  We install the
first at time 0, and then when it burns out, immediately replace it with
the second.

\begin{itemize}

\item [(b)] Find the probability that the first light bulb lasts less
than 25.8 hours and the lifetime of the second is more than 120 hours.

\item [(c)] Find the probability that the second burnout occurs after
time 192.5.

\end{itemize}

\oneproblem
Suppose for some continuous random variable X, $f_X(t)$ is equal to 2(1-t)
for t in (0,1) and is 0 elsewhere.

\begin{itemize}

\item [(a)] Why is the constant here 2? Why not, say, 168? 

\item [(b)] Find $F_X(0.2)$ and Var(X). 

\item [(c)] Using the method in Section \ref{genrannumgen}, write an 
R function, named {\bf oneminust()}, that generates a random variate
sampled from this distribution. Then use this function to verify your 
answers in (b) above. 

\end{itemize}

\oneproblem
The company Wrong Turn Criminal Mismanagement makes predictions every
day. They tend to err on the side of overpredicting, with the error
having a uniform distribution on the interval (-0.5,1.5). Find the
following: 

\begin{itemize}

\item [(a)] The mean and variance of the error. 

\item [(b)] The mean of the absolute error.

\item [(c)] The probability that exactly two errors are greater than 0.25 in
absolute value, out of 10 predictions. Assume predictions are
independent. 

\end{itemize}


\oneproblem
``All that glitters is not gold,'' and not every bell-shaped density is
normal. The family of Cauchy distributions, having density

\begin{equation}
f_X(t) = \frac{1}{\pi c}
\frac{1}{1 + (\frac{t-b}{c})^2}, ~ \infty < t < \infty
\end{equation}

is bell-shaped but definitely not normal.  

Here the parameters b and c correspond to mean and standard deviation in
the normal case, but actual neither the mean nor standard deviation
exist for Cauchy distributions. The mean's failure to exist is due to
technical problems involving the theoretical definition of integration.
In the case of variance, it does not exist because there is no mean, but
even more significantly, $E[(X-b)^2] = \infty$.  

However, a Cauchy distribution does have a median, b, so we'll use that
instead of a mean. Also, instead of a standard deviation, we'll use as
our measure of dispersion the interquartile range, defined (for any
distribution) to be the difference between the 75th and 25th
percentiles.

We will be investigating the Cauchy distribution that has b = 0 and c = 1.

\begin{itemize}

\item [(a)] Find the interquartile range of this Cauchy distribution. 

\item [(b)] Find the normal distribution that has the same median and interquartile range as this Cauchy distribution. 

\item [(c)] Use R to plot the densities of the two distributions on the same
graph, so that we can see that they are both bell-shaped, but different. 

\end{itemize}

\oneproblem Consider the following game. A dart will hit the random
point $Y$ in (0,1) according to the density $f_Y(t) = 2t$. You must
guess the value of $Y$.  (Your guess is a constant, not random.) You
will lose \$2 per unit error if Y is to the left of your guess, and will
lose \$1 per unit error on the right. Find best guess in terms of
expected loss.

\oneproblem
Fill in the blank:  Density functions for continuous random
variables are analogs of the
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
functions that are used for discrete random variables.

\oneproblem
Suppose for some random variable W, $F_W(t) = t^3$ for $0 <
t < 1$, with $F_W(t)$ being 0 and 1 for $t < 0$ and $t > 0$, respectively.
Find $f_W(t)$ for $0 < t < 1$.

\oneproblem
Suppose X has a binomial distribution with parameters n and
p.  Then X is approximately normally distributed with mean np and
variance np(1-p).  For each of the following, answer either A or E, for
``approximately'' or ``exact,'' respectively:

\begin{itemize}

\item [(a)] the distribution of X is normal

\item [(b)] E(X) is np

\item [(c)] Var(X) is np(1-p)

\end{itemize}

\oneproblem
Consider the density $f_Z(t) = 2t/15$ for $1 < t < 4$ and 0 elsewhere.
Find the median of Z, as well as Z's third moment, $E(Z^3)$, and its
third central moment, $E[(Z-EZ)^3]$.

\oneproblem
Suppose X has a uniform distribution on the interval
(20,40), and we know that X is greater than 25.  What is the probability
that X is greater than 32?

\oneproblem
Suppose U and V have the 2t/15 density on (1,4).  Let N
denote the number of values among U and V that are greater than 1.5,
so N is either 0, 1 or 2.  Find Var(N).

\oneproblem
Find the value of $E(X^4)$ if X has an N(0,1) distribution.  (Give your
answer as a number, not an integral.)


