\chapter{General Statistical Estimation and Inference}
\label{chap:est} 

In the last chapter, we often referred to certain estimators as being
``natural.''  For example, if we are estimating a population mean, an
obvious choice of estimator would be the sample mean.  But in many
applications, it is less clear what a ``natural'' estimate for a
parameter of interest would be.\footnote{Recall that we are using the
term {\it parameter} to mean any population quantity, rather an an index
into a parametric family of distributions.}  We will present general
methods for estimation in this section.

We will also discuss advanced methods of inference.

\section{General Methods of Parametric Estimation}
\label{genest}

Let's begin with a simple motivating example.

\subsection{Example:  Guessing the Number of Raffle Tickets Sold}
\label{raffle}

You've just bought a raffle ticket, and find that you have ticket number
68.  You check with a couple of friends, and find that their numbers are
46 and 79.  Let c be the total number of tickets.  How should we
estimate c, using our data 68, 46 and 79?

It is reasonable to assume that each of the three of you is equally
likely to get assigned any of the numbers 1,2,...,c.  In other words,
the numbers we get, $X_i$, i = 1,2,3 are uniformly distributed on the
set \{1,2,...,c\}.  We can also assume that they are independent; that's
not exactly true, since we are sampling without replacement, but for
large c---or better stated, for n/c small---it's close enough.  

So, we are assuming that the $X_i$ are independent and identically
distributed---famously written as {\bf i.i.d.} in the statistics
world---on the set \{1,2,...,c\}.  How do we use the $X_i$ to estimate
c?

\subsection{Method of Moments}
\label{mome}

One approach, an intuitive one, would be to reason as follows.  Note
first that 

\begin{equation}
\label{exinc}
E(X) = \frac{c+1}{2}
\end{equation}

Let's solve for c:

\begin{equation}
\label{cinex}
c = 2 EX - 1
\end{equation}

We know that we can use

\begin{equation}
\overline{X} = \frac{1}{n} \sum_{i=1}^{n} X_i
\end{equation}

to estimate EX, so by (\ref{cinex}), $2 \overline{X} - 1$ is an
intuitive estimate of c.  Thus we take our estimator for c to be

\begin{equation}
\label{cinexhat}
\widehat{c} = 2 \overline{X} - 1
\end{equation}

This estimator is called the Method of Moments estimator of c.

% \checkpoint

Let's step back and review what we did:

\begin{itemize}

\item We wrote our parameter as a function of the population mean EX of
our data item X.  Here, that resulted in (\ref{cinex}).

\item In that function, we substituted our sample mean $\overline{X}$ for EX,
and substituted our estimator $\widehat{c}$ for the parameter c, yielding
(\ref{cinexhat}).  We then solved for our estimator.

\end{itemize}

We say that an estimator $\widehat{\theta}$ of some parameter $\theta$ is
{\bf consistent} if 

\begin{equation}
\lim_{{n} \rightarrow \infty} \widehat{\theta} = \theta
\end{equation}

where n is the sample size.  In other words, as the sample size grows,
the estimator eventually converges to the true population value.

Of course here $\overline{X}$ is a consistent estimator of EX .  Thus you can
see from (\ref{cinex}) and (\ref{cinexhat}) that $\widehat{c}$ is a
consistent estimator of c.  In other words, the Method of Moments
generally gives us consistent estimators.

What if we have more than one parameter to estimate?  We generalize what
we did above:

\begin{itemize}

\item Suppose we are estimating a parametric distribution with
parameters $\theta_1,...,\theta_r$.  

\item Let $\eta_i$ denote the i$^{th}$ {\bf moment} of X, $E(X^i)$.

\item For i = 1,...,r we write $\eta_i$ as a function $g_i$ of all the
$\theta_k$.

\item For i = 1,...,r set

\begin{equation}
\widehat{\eta_i} = \frac{1}{n} \sum_{j=1}^n X_j^i
\end{equation}

\item Substitute the $\widehat{\theta_k}$ in the $g_i$ and then solve for
them.

\end{itemize}

In the above example with the raffle, we had r = 1, $\theta_1 = c$,
$g_1(c) = (c+1)/2$ and so on.  A two-parameter example will be given
below.

\subsection{Method of Maximum Likelihood}
\label{mle}

Another method, much more commonly used, is called the {\bf Method of
Maximum Likelihood}.  In our example above, it means asking the
question, ``What value of c would have made our data---68, 46, 79---most
likely to happen?''  Well, let's find what is called the {\bf
likelihood}, i.e. the probability of our particular data values occurring:

\begin{equation}
\label{1stlike}
L = P(X_1 = 68, X_2 = 46, X_3 = 79) = 
\begin{cases} 
(\frac{1}{c})^3, & \text{if $c \geq 79$} \\ 
0, & \text{otherwise} \\ 
\end{cases} 
\end{equation}

Now keep in mind that c is a fixed, though unknown constant.  It is not
a random variable.  What we are doing here is just asking ``What if''
questions, e.g. ``If c were 85, how likely would our data be?  What
about c = 91?''

Well then, what value of c maximizes (\ref{1stlike})?  Clearly, it is c
= 79.  Any smaller value of c gives us a likelihood of 0.  And for c
larger than 79, the larger c is, the smaller (\ref{1stlike}) is.  So,
our maximum likelihood estimator (MLE) is 79.  In general, if our sample
size in this problem were n, our MLE for c would be

\begin{equation}
\check{c} = \max_i X_i
\end{equation}

\subsection{Example:  Estimation the Parameters of a Gamma Distribution}
\label{gammamle}

As another example, suppose we have a random sample $X_1,...,X_n$ from 
a gamma distribution.

\begin{equation}
\label{gamma}
f_X(t) = \frac{1}{\Gamma(c)} \lambda^c t^{c-1} e^{-\lambda t}, ~ t > 0
\end{equation}

for some unknown $c$ and $\lambda$.  How do we estimate $c$ and
$\lambda$ from the $X_i$?

\subsubsection{Method of Moments}

Let's try the Method of Moments, as follows.  We have two population
parameters to estimate, c and $\lambda$, so we need to involve two
moments of X.  That could be EX and $E(X^2)$, but here it would more
conveniently be EX and Var(X).  We know from our previous unit on
continuous random variables, Chapter \ref{chap:contin}, that 

\begin{equation}
EX = \frac{c}{\lambda}
\end{equation}

\begin{equation}
Var(X) = \frac{c}{\lambda^2}
\end{equation}

In our earlier notation, this would be r = 2, $\theta_1 = c$ , $\theta_2
= \lambda$ and $g_1(c,\lambda) = c/\lambda$ and $g_2(c,\lambda) =
c/\lambda^2$.

Switching to sample analogs and estimates, we have

\begin{equation}
\frac{\widehat{c}}{\widehat{\lambda}} = \overline{X}
\end{equation}

\begin{equation}
\frac{\widehat{c}}{\widehat{\lambda} ^ 2} = s^2
\end{equation}

Dividing the two quantities yields

\begin{equation}
\widehat{\lambda} = \frac{\overline{X}}{s^2}
\end{equation}

which then gives

\begin{equation}
\widehat{c} = \frac{\overline{X}^2}{s^2}
\end{equation}

\subsubsection{MLEs}

What about the MLEs of c and $\lambda$?  Remember, the $X_i$ are
continuous random variables, so the likelihood function, i.e. the analog
of (\ref{1stlike}), is the product of the density values:

\begin{eqnarray}
L &=& 
   \Pi_{i=1}^n 
   \left [
   \frac{1}{\Gamma(c)} \lambda^c {X_i}^{c-1} e^{-\lambda {X_i}} 
   \right ] \\ 
&=& [\lambda^c/\Gamma(c)]^n 
\left ( \Pi_{i=1}^n X_i \right )^{c-1}
e^{-\lambda \sum_{i=1}^n X_i}
\end{eqnarray}

In general, it is usually easier to maximize the log likelihood
(and maximizing this is the same as maximizing the original likelihood):

\begin{equation}
\label{2ndlike}
l = (c-1) \sum_{i=1}^n \ln(X_i) - \lambda \sum_{i=1}^n X_i 
+ nc \ln(\lambda) - n \ln(\Gamma(c))
\end{equation}

One then takes the partial derivatives of (\ref{2ndlike}) with respect
to c and $\lambda$, and sets the derivatives to zero.  The solution
values, $\check{c}$ and $\check{\lambda}$, are then the MLEs of c and
$\lambda$.  Unfortunately, in this case, these equations do not have
closed-form solutions.  So the equations must be solved numerically.
(In fact, numerical methods are needed even more in this case, because
finding the derivative of $\Gamma(c)$ is not easy.)  

\subsubsection{R's mle() Function}

R provides a function, {\bf mle()}, for finding MLEs in mathematically
intractable situations such as the one in the last section.  Here's an
example in the that context.  We'll simulate some data from a gamma
distribution with given parameter values, then pretend we don't know
those, and find the MLEs from the data:

\begin{Verbatim}[fontsize=\relsize{-2}]
x <- rgamma(100,shape=2)  # Erlang, r = 2
n <- length(x)

ll <- function(c,lambda) {
   loglik <- (c-1) * sum(log(x)) - sum(x)*lambda + n*c*log(lambda) -
      n*log(gamma(c))
   return(-loglik)
}

summary(mle(minuslogl=ll,start=list(c=2,lambda=2)))
Maximum likelihood estimation

Call:
mle(minuslogl = ll, start = list(c = 1, lambda = 1))

Coefficients:
       Estimate Std. Error
c      1.993399  0.1770996
lambda 1.027275  0.1167195

-2 log L: 509.8227 
\end{Verbatim}

How did this work?  The main task we have is to write a function that
calculates negative the log likelihood, with that function's arguments
will be the parameters to be estimated.  (Note that in R, {\bf log()}
calculates the natural logarithm by default.)  Fortunately for us, {\bf
mle()} calculates the derivatives numerically too, so we didn't need to
specify them in the log likelihood function.  (Needless to say, this
function thus cannot be used in a problem in which derivatives cannot be
used, such as the lottery example above.)

We also need to supply {\bf mle()} with initial guesses for the
parameters.  That's done in the {\bf start} argument.  I more or less
arbitrarily chose 1.0 for these values.  You may have to experiment,
though, as some sets of initial values may not result in convergence.

The standard errors of the estimated parameters are also printed out,
enabling the formation of confidence intervals and significance tests.
See for instance Section \ref{stderrest}.  In fact, you can get the
estimated covariance matrix for the vector of estimated parameters.  In
our case here:

\begin{Verbatim}[fontsize=\relsize{-2}]
> mleout <- mle(minuslogl=ll,start=list(c=2,lambda=2))
Warning messages:
1: In log(lambda) : NaNs produced
2: In log(lambda) : NaNs produced
3: In log(lambda) : NaNs produced
> solve(mleout@details$hessian)
                c     lambda
c      0.08434476 0.04156666
lambda 0.04156666 0.02582428
\end{Verbatim}

By the way, there were also some warning messages, due to the fact that
during the iterative maximization process, some iterations generated
guesses for $\check{\lambda}$ were 0 or near it, causing problems with
{\bf log()}.

\subsection{More Examples}

Suppose $f_W(t) = ct^{c-1}$ for t in (0,1), with the density
being 0 elsewhere, for some unknown $c > 0$.  We have a random sample
$W_1,...,W_n$ from this density.  

Let's find the Method of Moments estimator.

\begin{equation}
EW = \int_{0}^{1} t ct^{c-1}~ dt = \frac{c}{c+1}
\end{equation}

So, set

\begin{equation}
\overline{W} = \frac{\widehat{c}}{\widehat{c}+1}
\end{equation}

yielding

\begin{equation}
\label{chatmom}
\widehat{c} = \frac{\overline{W}}{1-\overline{W}}
\end{equation}

What about the MLE?

\begin{equation}
L = \Pi_{i=1}^n c W_i^{c-1} 
\end{equation} 

so

\begin{equation}
l = n \ln{c} + (c-1) \sum_{i=1}^{n} \ln{W_i}
\end{equation}

Then set

\begin{equation}
0 = \frac{n}{\widehat{c}} + \sum_{i=1}^{n} \ln{W_i}
\end{equation}

and thus

\begin{equation}
\label{widehatc}
\widehat{c} = -\frac{1}{\frac{1}{n} \sum_{i=1}^{n} \ln{W_i}}
\end{equation}

% Note that we don't need the delta method to find a confidence interval
% here, as we can rewrite (\ref{chatmom}) as
% 
% \begin{equation}
% \widehat{c} = \frac{1}{\frac{1}{\overline{W}}-1}
% \end{equation}

As in Section \ref{mle}, not every MLE can be determined by taking
derivatives.  Consider a continuous analog of the example in that
section, with $f_W(t) = \frac{1}{c}$ on (0,c), 0 elsewhere, for some
$c > 0$. 

The likelihood is

\begin{equation}
\left ( \frac{1}{c} \right )^n
\end{equation}

as long as

\begin{equation}
c \geq \max_i W_i
\end{equation}

and is 0 otherwise.  So,

\begin{equation}
\label{unifmle}
\widehat{c} = \max_i W_i
\end{equation}

as before.

Now consider a different problem.  Suppose the random variable X is
equal to 1, 2 and 3, with probabilities c, c and 1-2c.  The value c is
thus a population parameter.  We have a random sample $X_1,...,X_n$ from
this population.  Let's find the Method of Moments Estimator of c, and
its bias.

First,

\begin{equation}
EX = c \cdot 1 + c \cdot 2 + (1-2c) \cdot 3 = 3 - 3c
\end{equation}

Thus 

\begin{equation}
c = (3-EX)/3
\end{equation}

and so set

\begin{equation}
\hat{c} = (3-\overline{X})/3
\end{equation}

Next,

\begin{eqnarray}
E \hat{c} &=& E[(3-\overline{X})/3) \\ 
&=& \frac{1}{3} \cdot (3 - E\overline{X}) \\
&=& \frac{1}{3} [ 3 - EX ] \\
&=& \frac{1}{3} [ 3 - (3-3c) ] \\
&=& c
\end{eqnarray}

So, the bias is 0.


\subsection{What About Confidence Intervals?}

Usually we are not satisfied with simply forming estimates (called {\bf
point estimates}).  We also want some indication of how accurate these
estimates are, in the form of confidence intervals ({\bf interval
estimates}).

In many special cases, finding confidence intervals can be done easily
on an {\it ad hoc} basis.  Look, for instance, at the Method of Moments
Estimator in Section \ref{mome}.  Our estimator (\ref{cinexhat}) is a
linear function of $\overline{X}$, so we easily obtain a confidence
interval for $c$ from one for $EX$.

Another example is (\ref{widehatc}).  Taking the limit as $n \rightarrow
\infty$ the equation shows us (and we could verify) that 

\begin{equation}
\label{cintermsofw}
c = \frac{1}{E[\ln{W}]}
\end{equation}

Defining $X_i = \ln{W_i}$ and $\overline{X} = (X_1+...+X_n)/$, we can
obtain a confidence interval for $EX$ in the usual way.  We then see
from (\ref{cintermsofw}) that we can form a confidence interval for $c$
by simply taking the reciprocal of each endpoint of the interval, and
swapping the left and right endpoints.

What about in general?  For the Method of Moments case, our estimators
are functions of the sample moments, and since the latter are formed
from sums and thus are asymptotically normal, the delta method (Section
\ref{delta}) can be used to show that our estimators are asymptotically
normal and to obtain asymptotic variances for them.

There is a well-developed asymptotic theory for MLEs, which under
certain conditions shows asymptotic normality with a certain asymptotic
variance, thus enabling confidence intervals.  The theory also
establishes that MLEs are in a certain sense optimal among all
estimators.  We will not pursue this here, but will note that {\bf mle()}
does give standard errors for the estimates, thus enabling the
formation of confidence intervals.

\section{Bias and Variance}
\label{bv}

The notions of {\bf bias} and {\bf variance} play central roles in the
evaluation of goodness of estimators.

\subsection{Bias}

\begin{definition}
Suppose $\hat{\theta}$ is an estimator of $\theta$.
Then the {\bf bias} of $\hat{\theta}$ is 

\begin{equation}
\textrm{bias} = E(\hat{\theta}) - \theta
\end{equation}

If the bias is 0, we say that the estimator is {\bf unbiased}.
\end{definition}

It's very important to note that, in spite of the perjorative-sounding
name, bias is not an inherently bad property for an estimator to have.
Indeed, most good estimators are at least slightly biased.  We'll
explore this in the next section.

\subsection{Why Divide by n-1 in $s^2$?}  
\label{whynot}

It should be noted that it is customary in (\ref{s2}) to divide by n-1
instead of n, for reasons that are largely historical.  Here's the
issue:

If we divide by n, as we have been doing, then it turns out that
$s^2$ is biased.

\begin{equation}
\label{s2biased}
E(s^2) = \frac{n-1}{n} \cdot \sigma^2
\end{equation}

Think about this in the Davis people example, once again in the notebook
context.  Remember, here n is 1000, and each line of the notebook
represents our taking a different random sample of 1000 people.  Within
each line, there will be entries for  $W_1$ through $W_{1000}$, the
weights of our 1000 people, and for $\overline{W}$ and $s$.  For convenience,
let's suppose we record that last column as $s^2$ instead of $s$.

Now, say we want to estimate the population variance $\sigma^2$.  As
discussed earlier, the natural estimator for it would be the sample
variance, $s^2$.  What (\ref{s2biased}) says is that after looking at an
infinite number of lines in the notebook, the average value of $s^2$
would be just...a...little...bit...too...small.  All the $s^2$ values
would average out to $0.999 \sigma^2$, rather than to $\sigma^2$.  We
might say that $s^2$ has a little bit more tendency to underestimate 
$\sigma^2$ than to overestimate it.

So, (\ref{s2biased}) implies that $s^2$ is a biased estimator of the
population variance $\sigma^2$, with the amount of bias being

\begin{equation}
\frac{n-1}{n} \cdot \sigma^2 - \sigma^2 = -\frac{1}{n} \cdot \sigma^2
\end{equation}

Let's prove (\ref{s2biased}).  As before, let W be a random variable
distributed as the population, and let $W_1,...,W_n$ be a random sample
from that population.  So, $EW_i = \mu$ and $Var(W_i) - \sigma^2$, where
again $\mu$ and $\sigma^2$ are the population mean and variance.

It will be more convenient to work with $ns^2$ than $s^2$, since it will
avoid a lot of dividing by n.  So, write

\begin{eqnarray}
ns^2 &=& \sum_{i=1}^n (W_i - \overline{W})^2 ~~~~ \textrm{(def.)} \\
&=& \sum_{i=1}^n 
\left [
(W_i - \mu) + (\mu - \overline{W})
\right ]^2 ~~~~ \textrm{(alg.)} \\
&=& 
\sum_{i=1}^n (W_i - \mu)^2 +
2 (\mu - \overline{W}) \sum_{i=1}^n (W_i - \mu) +
n (\mu - \overline{W})^2 ~~~~ \textrm{(alg.)}
\end{eqnarray}

But that middle sum is 

\begin{equation}
\sum_{i=1}^n (W_i - \mu) =
\sum_{i=1}^n W_i - n \mu =
n \overline{W} - n \mu
\end{equation}

So,

\begin{equation}
\label{intermed}
ns^2 = \sum_{i=1}^n (W_i - \mu)^2  - n (\overline{W} - \mu)^2 
\end{equation}

Now let's take the expected value of (\ref{intermed}).  First,

\begin{eqnarray}
E \left (
\sum_{i=1}^n (W_i - \mu)^2
\right )
&=& \sum_{i=1}^n E[(W_i - \mu)^2] ~~~~ \textrm{(E is lin.)} \\ 
&=& \sum_{i=1}^n E[(W_i - EW_i)^2] ~~~~ (W_i \textrm{ distr. as pop.}) \\ 
&=& \sum_{i=1}^n Var(W_i) ~~~~ \textrm{(def. of Var())} \\ 
&=& \sum_{i=1}^n \sigma^2 ~~~~ (W_i \textrm{ distr. as pop.}) \\ 
&=& n \sigma^2
\end{eqnarray}

Also,

\begin{eqnarray}
E[(\overline{W} - \mu)^2] &=& 
   E[(\overline{W} - E\overline{W})^2] ~~~~ ((\ref{barmean})) \\ 
&=& Var(\overline{W}) ~~~~ \textrm{ (def. of Var())} \\
&=& \frac{\sigma^2}{n} ~~~~ (\ref{oneovernpopvar})
\end{eqnarray}

Applying these last two findings to (\ref{intermed}), we get
(\ref{s2biased}).

% Write the first term as
% 
% \begin{eqnarray}
% E \left ( \frac{1}{n}\sum_{i=1}^n W_i^2 \right ) &=&  
% \frac{1}{n}E \sum_{i=1}^n W_i^2 ~~ \textrm{  (constants factor out of E())} \\
% &=& \frac{1}{n} \cdot n E(W^2) ~~ \textrm{  (each $W_i$ has distr. of W)} \\
% &=& E(W^2) \\
% &=& Var(W) + (EW)^2 \\
% &=& \sigma^2 + \mu^2
% \end{eqnarray}
% 
% Continuing to work from (\ref{alts2}) and using (\ref{oneovernpopvar}), write
% 
% \begin{equation}
% E[\overline{W}^2] = Var(\overline{W}) + [E(\overline{W})]^2 = \frac{1}{n} \sigma^2 +
% \mu^2
% \end{equation}
% 
% Now using all this in (\ref{alts2}), we get
% 
\begin{equation}
E(s^2) = \frac{n-1}{n} \sigma^2
\end{equation}

% \checkpoint

The earlier developers of statistics were bothered by this bias, so they
introduced a ``fudge factor'' by dividing by n-1 instead of n in
(\ref{s2}).  We will call that $\tilde{s}^2$:

\begin{equation}
\tilde{s}^2 = 
\frac{1}{n-1} \sum_{i=1}^{n} (W_i-\overline{W})^2 
\end{equation}

This is the ``classical'' definition of sample variance, in which we
divide by n-1 instead of n.  

But we will use n.  After all, when n is large---which is
what we are assuming by using the Central Limit Theorem in the entire
development so far---it doesn't make any appreciable difference.
Clearly it is not important in our Davis example, or our bus simulation
example.

Moreover, speaking generally now rather than necessarily for the case of
$s^2$ there is no particular reason to insist that an estimator be
unbiased anyway.  An alternative estimator may have a little bias but
much smaller variance, and thus might be preferable.  And anyway, even
though the classical version of $s^2$, i.e. $\tilde{s}^2$, is an
unbiased estimator for $\sigma^2$, $s$ is not an unbiased estimator for
$\sigma$, the population standard deviation.  In other words,
unbiasedness is not such an important property.

The R functions {\bf var()} and {\bf sd()} calculate the versions of
$s^2$ and $s$, respectively, that have a divisor of n-1.

\subsubsection{Example of Bias Calculation}

Let's find the bias of the estimator (\ref{unifmle}).

The bias is $E\widehat{C} - c$.  To get $E\widehat{c}$ we need the
density of that estimator, which we get as follows:

\begin{eqnarray}
P(\widehat{c} \leq t) &=& P(\textrm{all} ~ W_i \leq t)  ~~ (\textrm{definition})\\
&=& \left ( \frac{t}{c} \right )^n ~~ (\textrm{density of } W_i)
\end{eqnarray}

So,

\begin{equation}
f_{\widehat{c}}(t) = \frac{n}{c^n}  t^{n-1}
\end{equation}

Integrating against t, we find that

\begin{equation}
\label{notbad}
E\widehat{c} = \frac{n}{n+1} ~ c
\end{equation}

So the bias is c/(n+1), not bad at all.

\subsection{Tradeoff Between Variance and Bias}
\label{msesection}

Consider a general estimator Q of some population value b.  Then a
common measure of the quality (of course there are many others) of the
estimator Q is the {\bf mean squared error} (MSE),

\begin{equation}
\label{mse}
E[(Q-b)^2]
\end{equation}

Of course, the smaller the MSE, the better.

One can break (\ref{mse}) down into variance and (squared) bias
components, as follows:\footnote{In reading the following derivation,
keep in mind that EQ and b are constants.}

\begin{eqnarray}
MSE(Q) &=& E[(Q-b)^2] \textrm{  (definition)} \\
&=& E[\{(Q-EQ) + (EQ - b)\}^2] \textrm{  (algebra)} \\
&=& E[(Q-EQ)^2] + 2 E \left [ (Q-EQ) (EQ-b) \right ] + E[ (EQ - b)^2 ]
\textrm{  (alg., E props.)} \\
&=& E[(Q-EQ)^2] + E[ (EQ - b)^2 ] \textrm{  (factor out constant EQ-b)} \\
&=&  Var(Q) + (EQ - b)^2 
\textrm{  (def. of Var(), fact that EQ-b is const.)} 
\label{mseformula}\\
&=& \textrm{variance + squared bias} 
\label{vb2}
\end{eqnarray}

In other words, in discussing the accuracy of an estimator---especially
in comparing two or more candidates to use for our estimator---the
average squared error has two main components, one for variance and one
for bias.  In building a model, these two components are often at odds
with each other; we may be able to find an estimator with smaller bias
but more variance, or vice versa.

We also see from (\ref{vb2}) that a little bias in an estimator
may be quite tolerable, as long as the variance is low.  This is good,
because as mentioned earlier, most estimators are in fact biased.

These point will become central in Chapters \ref{chap:mod} and
\ref{chap:linreg}.

% \checkpoint

\section{More on the Issue of Independence/Nonindependence of Samples}

In Section \ref{indepsams}, we derived confidence intervals for the
difference between two population means (or proportions).  The
derivation depended crucially on the fact that the two sample means,
$\overline{X}$ and $\overline{Y}$, were independent.  This in turn
stemmed from the fact that the corresponding sample data sets were
separate.  

On the other hand, in Section \ref{depsams}, we had an example in which
the two sample means, $\overline{X}$ and $\overline{Y}$, were not
independent, as they came from the same set of kids.  The confidence
intervals derived in Section \ref{indepsams} were thus invalid, and new
ones were derived, based on differences.

Note that in both cases, the observations {\it within} a sample were
also independent.  In the example of children's heights in Section
\ref{depsams}, for instance, the fact that Mary was chosen as the first
child in the sample had no effect on whether Jane was chosen as the
second one.  This was important for the derivations too, as they used
(\ref{oneovernpopvar}), which assumed independence.

In this section, we will explore these points further, with our aim
being to state the concepts in precise random variable terms.  

As our concrete example, consider an election survey, in a small city.
Say there are equal numbers of men and women in the city, 5000 each.  We
wish to estimate the population proportion of people who plan to vote
for candidate A.  We take a random sample of size n from the population,
Define the following:

\begin{itemize}

\item Let V denote the indicator variable for the event that the person
plans to vote for A.  

\item We might be interested in differences between men and women in A's
support, so let G be 1 for male, 2 for female.  

\item Let $p$ denote the population proportion of people who plan to vote
for A.

\item Let $p_1$ and $p_2$ denote the population proportions planning to
vote for A, among men and women respectively.  Note that

\begin{equation}
\label{pp1p2}
p = 0.5 p_1 + 0./5 p_2
\end{equation}

\item Denote our data by $(V_1,G_1),...,(V_n,G_n)$, recording both the
planned vote and gender for each person in the sample.

\item For convenience, relabel the data by gender, with
$M_1,...,M_{N_1}$ and $F_1,...,F_{N_2}$ denoting the planned votes of
the men and women.

\end{itemize}

Clearly, the male data and female data are independent.  The fact that
Jack is chosen in the male sample has no impact on whether Jill is
chosen in the female one.

But what about data {\it within} a gender group?  For example, are $M_1$
and $M_2$, the planned votes of the first two men in our male sample,
independent?  Or are they correlated, since these two people have the
same gender?

The answer is that $M_1$ and $M_2$ are indeed independent.  The first
man could be any of the 5000 men in the city, with probability 1/5000
each, and the same is true of the second man.  Moreover, the choice of
the first man has no effect at all on the choice of the second one.
(Remember, in random samples we sample {\it with} replacement.)

Our estimate of $p$ is our usual sample proportion,

\begin{equation}
\widehat{p} = \frac{V_1+...+V_n}{n}
\end{equation}

Then we can use (\ref{propci}) to find a confidence interval for $p$.
But again, the reader might question this, saying something like, ``What
if $G_1$ and $G_2$ are both 1, i.e. the first two people in our sample
are both men?  Won't $V_1$ and $V_2$ then be correlated?''  The answer
is no, because the reader would be referring to the conditional
distribution of V given G, whereas our use of (\ref{propci}) does not
involve gender, i.e. it concerns the unconditional distribution of V.

This point is subtle, and is difficult for the beginning modeler to
grasp.  It is related to issues in our first discussions of probability
in Chapter \ref{probcalc}.  In the ALOHA model there, for instance,
beginning students who are asked to find $P(X_2 = 1)$ often object,
``Well, it depends on what $X_1$ is.''  That is incorrect thinking,
because they are confusing $P(X_2 = 1)$ with $P(X_2 = 1 | X_1 = i)$.
That confusion is resolved by thinking in ``notebook'' terms, with
$P(X_2 = 1)$ meaning the long-run proportion of notebook lines in which
$X_2 = 1$, \textit{regardless} of the value of $X_1$.  In our case
here, the reader must avoid confusing $P(V = 1)$ (which is $p$) with
$P(V = 1 | G = i)$ (which is $p_i$).

Continuing this point a bit more, note that our $\widehat{p}$ above {\it
is} an unbiased estimate of $p$:

\begin{eqnarray}
\label{phatunb}
E \widehat{p} &=& E(V_1) ~~~~ ((\ref{barmean})) \\ 
&=& P(V_1 = 1) ~~~~ ((\ref{eofxeqp})) \\
&=& P(G_1 = 1) P(V_1 = 1 | G_1 = 1)
+ P(G_1 = 2) P(V_1 = 1 | G_1 = 2) ~~~~ (\textrm{Chapter } \ref{probcalc}) \\
&=& 0.5 p_1 + 0.5 p_2 \\
&=& p ~~~~ ((\ref{pp1p2})) 
\end{eqnarray}

Due to the independence of the male and female samples, we can use
(\ref{2indepsamp}) to find a confidence interval for $p_1 - p_2$, so
that we can compare male and female support of A.  Note by the way that
$\overline{M}$ will be an unbiased estimate of $p_1$, with a similar
statement holding for the women.

Now, contrast all that with a different kind of sampling, as follows.
We choose a gender group at random, and then {\it sample n people from
\underline{that} gender group}.  Let R denote the group chosen, so that
$G_i = R$ for all i.  So, what about the answers to the above questions
in this new setting?

Conditionally on R, the $V_i$ and again independent, using the same
argument as we used to show that $M_1$ and $M_2$ were independent above.
And (\ref{phatunb}) still works, so our $\widehat{p}$ is still unbiased.

However:  The $V_i$ are no longer unconditionally independent:

\begin{equation}
P(V_1 = 1 \textrm{ and } V_2 = 1) = 0.5 p_1^2 + 0.5 p_2^2
\end{equation}

(the reader should fill in the details, with a conditioning
argument like that in (\ref{phatunb})), while

\begin{equation}
P(V_1 = 1) \cdot P(V_2 = 1) = p^2 = (0.5 p_1 + 0.5 p_2)^2
\end{equation}

So, 

\begin{equation}
P(V_1 = 1 \textrm{ and } V_2 = 1) \neq
P(V_1 = 1) \cdot P(V_2 = 1) 
\end{equation}

and thus $V_1$ and $V_2$ are not unconditionally independent.

This setting is very common.  We might, for instance, choose k trees
at random, and then collect data on r leaves in each tree.

\section{Nonparametric Distribution Estimation} 
\label{nonpardensest}

Here we will be concerned with estimating distribution functions and
densities in settings in which we do not assume our distribution belongs
to some parametric model.

\subsection{The Empirical cdf}
\label{ecdf}

Recall that $F_X$, the cdf of $X$, is defined as

\begin{equation}
F_X(t) = P(X \leq t), ~ -\infty < t < \infty
\end{equation}

Define its sample analog, called the {\bf empirical distribution
function}, by

\begin{equation}
\widehat{F}_X(t) = \frac{\# ~of~ X_i ~in~(-\infty,t)}{n}
\end{equation}

In other words, $F_X(t)$ is the proportion of X that are below t in the
population, and $\widehat{F}_X(t)$ is the value of that proportion in our
sample.  $\widehat{F}_X(t)$ estimates $F_X(t)$ for each t.

Graphically, $\widehat{F}_X$ is a step function, with jumps at the values of
the $X_i$.  Specifically, let $Y_j$, j = 1,...,n denote the sorted
version of the $X_I$.\footnote{A common notation for this is
$Y_j = X_{(j)}$, meaning that $Y_j$ is the $j^{th}$ smallest of the
$X_i$.  These are called the {\bf order statistics} of our sample.}
Then

\begin{equation}
\widehat{F}_X(t) = 
\begin{cases}
0, & \text{for $t < Y_1$} \\
\frac{j}{n}, & \text{for $Y_j \leq t < Y_{j+1}$} \\
1, & \text{for $t > Y_n$} \\
\end{cases} 
\end{equation}

Here is a simple example.  Say n = 4 and our data are 4.8, 1.2, 2.2 and
6.1.  We can plot the empirical cdf by calling R's {\bf ecdf()}
function:

\begin{Verbatim}[fontsize=\relsize{-2}]
> plot(ecdf(x))
\end{Verbatim}

Here is the graph:

\includegraphics[width=3.0in]{EmpCDF.pdf} 

Consider the Bus Paradox example again.  Recall that $W$ denoted the
time until the next bus arrives.  This is called the {\bf forward
recurrence time}.  The {\bf backward recurrence time} is the time since
the last bus was here, which we will denote by $R$.  

Suppose we are interested in estimating the density of $R$, $f_R()$,
based on the sample data $R_1,...,R_n$ that we gather in our simulation
in Section \ref{bus}, where n = 1000.  How can we do
this?\footnote{Actually, one can prove that R has an exponential
distribution.  However, here we'll pretend we don't know that.}

We could, of course, assume that $f_R$ is a member of some parametric
family of distributions, say the two-parameter gamma family.  We would
then estimate those two parameters as in Section \ref{genest}, and
possibly check our assumption using goodness-of-fit procedures,
discussed in our unit on modeling, Chapter \ref{chap:mod}.  On the other
hand, we may wish to estimate $f_R$ without making any parametric
assumptions.  In fact, one reason we may wish to do so is to visualize
the data in order to search for a suitable parametric model.

If we do not assume any parametric model, we have in essence change our
problem from estimating a finite number of parameters to an
infinite-parameter problem; the ``parameters'' are the values of $f_X(t)$ 
for all the different values of t.  Of course, we probably are willing
to assume {\it some} structure on $f_R$, such as continuity, but then we
still would have an infinite-parameter problem.

We call such estimation {\bf nonparametric}, meaning that we don't use a
parametric model.  However, you can see that it is really 
infinite-parametric estimation.

Again discussed in our unit on modeling, Chapter \ref{chap:mod}, the more
complex the model, the higher the variance of its estimator.  {\bf So,
nonparametric estimators will have higher variance than parametric
ones.}  The nonparametric estimators will also generally have smaller
bias, of course.

\subsection{Basic Ideas in Density Estimation}

Recall that 

\begin{equation}
f_R(t) = \frac{d}{dt} F_R(t) =  \frac{d}{dt} P(R \leq t)
\end{equation}

From calculus, that means that

\begin{eqnarray}
\label{diffquot}
f_R(t) &\approx& \frac{P(R \leq t+h) - P(R \leq t-h)}{2h} \\
&=& \frac{P(t-h < R \leq t+h)}{2h}
\end{eqnarray}

if h is small.  We can then form an estimate $\widehat{f}_R(t)$ by plugging
in sample analogs in the right-hand side of (\ref{diffquot}):

\begin{eqnarray}
\widehat{f}_R(t) &\approx& \frac{\#(t-h,t+h))/n}{2h} \\ 
&=& \frac{\#(t-h,t+h))}{2hn}  
\label{prehisto}
\end{eqnarray}

where the notation $\#(a,b)$ means the number of $R_i$ in the interval
(a,b).

There is an important issue of how to choose the value of h here, but
let's postpone that for now.  For the moment, let's take 

\begin{equation}
\label{h100}
h = \frac{\max_i R_i - \min_iR_i}{100}
\end{equation}

i.e. take h to be 0.01 of the range of our data.

At this point, we'd then compute (\ref{prehisto}) at lots of different
points t.  Although it would seem that theoretically we must compute
(\ref{prehisto}) at infinitely many such points, the graph of the
function is actually a step function.  Imagine t moving to the right,
starting at $\min_iR_i$.  The interval $(t-h,t+h)$ moves along with it.
Whenever the interval moves enough to the right to either pick up a new
$R_i$ or lose one that it had had,  (\ref{prehisto}) will change value,
but not at any other time.  So, we only need to evaluate the function at
about $2n$ values of t.

\subsection{Histograms} 

If for some reason we really want to save on computation, let's say that
we first break the interval $(\min_iR_i, \max_iR_i$ into 100
subintervals of size h given by (\ref{h100}).  We then compute
(\ref{prehisto}) only at the midpoints of those intervals, and pretend
that the graph of $\widehat{f}_R(t)$ is constant within each
subinterval.  Do you know what we get from that?  A histogram!  Yes, a
histogram is a form of density estimation.  (Usually a histogram merely
displays counts.  We do so here too, but we have scaled things so that
the total area under the curve is 1.)

Let's see how this works with our Bus Paradox simulation.  We'll use R's
{\bf hist()} to draw a histogram.  First, here's our simulation code:

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
doexpt <- function(opt) {
   lastarrival <- 0.0
   while (TRUE) {
      newlastarrival = lastarrival + rexp(1,0.1)
      if (newlastarrival > opt)
         return(opt-lastarrival)
      else lastarrival <- newlastarrival
   }
}

observationpt <- 240
nreps <- 10000
waits <- vector(length=nreps)
for (rep in 1:nreps) waits[rep] <- doexpt(observationpt)
hist(waits)
\end{Verbatim}

Note that I used the default number of intervals, 20.  Here is the
result:

\includegraphics[width=4.0in]{HistDefault.pdf} 

The density seems to have a shape like that of the exponential
parametric family.  (This is not surprising, because it {\it is}
exponential, but remember we're pretending we don't know that.)

Here is the plot with 100 intervals:

\includegraphics[width=4.0in]{Hist100.pdf} 

Again, a similar shape, though more raggedy.  

\subsection{Kernel-Based Density Estimation}

No matter what the interval width is, the histogram will consist of a
bunch of rectanges, rather than a curve.  That is basically because, for
any particular value of t, $\widehat{f_X(t)}$, depends only on the $X_i$
that fall into that interval.  We could get a smoother result if we used
all our data to estimate $f_X(t)$ but put more weight on the data that
is closer to t.  One way to do this is called {\bf kernel-based} density
estimation, which in R is handled by the function {\bf density()}.  

We need a set of weights, more precisely a weight function k, called the
{\bf kernel}.  Any nonnegative function which integrates to 1---i.e. a
density function in its own right---will work.  Our estimator is then

\begin{equation}
\label{kernel}
\widehat{f_R}(t) = \frac{1}{nh} \sum_{i=1}^{n} k \left ( \frac{t-R_i}{h} \right )
\end{equation}

To make this idea concrete, take k to be the uniform density on (-1,1),
which has the value 0.5 on (-1,1) and 0 elsewhere.  Then (\ref{kernel})
reduces to (\ref{prehisto}).  Note how the parameter h, called the {\bf
bandwidth}, continues to control how far away from to t we wish to go
for data points.

But as mentioned, what we really want is to include all data points, so
we typically use a kernel with support on all of $(-\infty, \infty)$.
In R, the default kernel is that of the N(0,1) density.  The bandwidth h
controls how much smoothing we do; smaller values of h place heavier
weights on data points near t and much lighter weights on the distant
points.  The default bandwidth in R is taken to the the standard
deviation of k.

% \checkpoint

For our data here, I took the defaults:

\begin{Verbatim}[fontsize=\relsize{-2}]
plot(density(r))
\end{Verbatim}

The result is seen in Figure \ref{h3}. 

\begin{figure}[tb] 
\centerline{
\includegraphics[width=4.0in]{Hist3.pdf} 
}
\caption{Kernel estimate, default bandwidth}
\label{h3}  
\end{figure}

\begin{figure}[tb] 
\centerline{
\includegraphics[width=4.0in]{Hist4.pdf}
}
\caption{Kernel estimate, bandwidth 0.5}
\label{h4}   
\end{figure}

I then tried it with a bandwidth of 0.5.  See Figure \ref{h4}.  This
curve oscillates a lot, so an analyst might think 0.5 is too small.
(We are prejudiced here, because we know the true population density
is exponential.)

\subsection{Proper Use of Density Estimates}

There is no good, practical way to choose a good bin width or bandwdith.
Moreover, there is also no good way to form a reasonable confidence band
for a density estimate. 

So, density estimates should be used as exploratory tools, not as firm
bases for decision making.  You will probably find it quite unsettling
to learn that there is no exact answer to the problem.  But that's real
life!  

\section{Slutsky's Theorem}
\label{slutsky}

(The reader should review Section \ref{formalclt} before continuing.)

Since one generally does not know the value of $\sigma$ in
(\ref{preci}), we replace it by $s$, yielding (\ref{theci}).  Why was
that legitimate?  

The answer depends on the theorem below.  First, we need a definition.

\begin{definition}
We say that a sequence of random variables $L_n$ {\bf converges in
probability} to the random variable $L$ if for every $\epsilon > 0$,

\begin{equation}
\lim_{{n} \rightarrow \infty} P(|L_n -L| > \epsilon)  = 0
\end{equation}

\end{definition}

This is a little weaker than convergence with probability 1, as in the
Strong Law of Large Numbers (SLLN, Section \ref{reconcil}).  Convergence
with probability 1 implies convergence in probability but not vice
versa.

So for example, if $Q_1, Q_2, Q_3,...$ are i.i.d. with mean $\omega$,
then the SLLN implies that

\begin{equation}
L_n = \frac{Q_1+...+Q_n}{n}
\end{equation}

converges with probability 1 to $\omega$, and thus $L_n$ converges in
probability to $\omega$ too.

\subsection{The Theorem}

\begin{theorem}

{\bf Slutsky's Theorem} (abridged version):  Consider random variables
$X_n, Y_n, \textrm{ and } X$, such that $X_n$ converges in distribution
to $X$ and $Y_n$ converges in probability to a constant $c$ with
probability 1,   

Then:

\begin{itemize}

\item [(a)] $X_n + Y_n$ converges in distribution to $X+c$.

\item [(b)] $X_n / Y_n$ converges in distribution to $X/c$.

\end{itemize}

\end{theorem}

\subsection{Why It's Valid to Substitute $s$ for $\sigma$}

We now return to the question raised above.  In our context here, that
we take

\begin{equation}
\label{xn}
X_n = \frac{\overline{W}-\mu}
{\sigma/\sqrt{n}}
\end{equation}

\begin{equation}
\label{yn}
Y_n = \frac{s}{\sigma}
\end{equation}

We know that (\ref{xn}) converges in distribution to N(0,1) while
(\ref{yn}) converges in to 1.  Thus for large n, we have that 

\begin{equation}
\frac{\overline{W}-\mu} {s/\sqrt{n}}
\end{equation}

has an approximate N(0,1) distribution, so that (\ref{theci}) is valid.

\subsection{Example:  Confidence Interval for a Ratio Estimator}

Again consider the example in Section \ref{davisweights} of weights of
men and women in Davis, but this time suppose we wish to form a
confidence interval for the {\it ratio} of the means,

\begin{equation}
\gamma = \frac{\mu_1}{\mu_2}
\end{equation}

Again, the natural estimator is

\begin{equation}
\widehat{\gamma} = \frac{\overline{X}}{\overline{Y}}
\end{equation}

How can we construct a confidence interval from this estimator?  If it
were a linear combination of $\overline{X}$ and $\overline{Y}$, we'd
have no problem, since a linear combination of multivariate normal
random variables is again normal.

That is not exactly the case here, but it's close.  Since $\overline{Y}$
converges in probability to $\mu_2$, Slutsky's Theorem (Section
\ref{slutsky}) tells us that the problem here really is one of such a
linear combination.  We can form a confidence interval for $\mu_1$, then
divide both endpoints of the interval by $\overline{Y}$, yielding a
confidence interval for $\gamma$.

\section{The Delta Method:  Confidence Intervals for General
Functions of Means or Proportions}
\label{delta}

The {\bf delta method} is a great way to derive asymptotic distributions
of quantities that are functions of random variables whose asymptotic
distributions are already known.

\subsection{The Theorem}

\begin{theorem}

Suppose $R_1,...,R_k$ are estimators of $\eta_1,...,\eta_k$ based on a
random sample of size n.  Let R denote the vector whose components are
the $R_i$, and let $\eta$ denote the corresponding vector for the
$\eta_i$.  
Suppose the random vector 

\begin{equation}
\sqrt{n} (R - \eta) =
\sqrt{n} 
\left (
\begin{array}{l}
   R_1 - \eta_1 \\
   R_2 - \eta_2 \\
   ... \\
   R_k- \eta_k  
\end{array}
\right )
\end{equation}

is known to have an asymptotically multivariate normal distribution with
mean 0 and nonsingular covariance matrix $\Sigma = (\sigma_{ij})$.  

Let h be a smooth scalar function\footnote{The word ``smooth'' here
refers to mathematical conditions such as existence of derivatives,
which we will not worry about here.  

Similarly, the reason that we multiply by $\sqrt{n}$ is also due to
theoretical considerations we will not go into here, other than to note
that it is related to the formal statement of the Central Limit Theorem
in Section \ref{formalclt}.  If we replace $X_1+...,+X_n$ in
(\ref{cltquant}), by $n \overline{X}$, we get

\begin{equation}
Z = \sqrt{n} \cdot \frac{\overline{X}-m}{v}
\end{equation}

} of k variables, with $h_i$ denoting its $i^{th}$ partial derivative.
Consider the random variable 

\begin{equation}
Y = h(R_1,...,R_k)
\end{equation}

Then $\sqrt{n}[Y-h(\eta_1,...,\eta_k)]$ converges in distribution to a
normal distribution with mean 0 and variance

\begin{equation}
\label{avar}
[\nu_1,...,\nu_k]' \Sigma [\nu_1,...,\nu_k]
\end{equation}

provided not all of 

\begin{equation}
\nu_i = h_i(\eta_1,...,\eta_k), ~ i=1,...,k
\end{equation}

are 0.

\end{theorem}

Informally, the theorem says, with R, $\eta$, $\Sigma$, h() and Y
defined above:

\begin{quote}
Suppose R is asymptotically multivariate normally distributed
with mean $\eta$ and covariance matrix $\Sigma/n$.  $Y$ will be 
approximately normal with mean $h(\eta_1,...,\eta_k)$ and covariance 
matrix 1/n times (\ref{avar}).  
\end{quote} 

Note carefully that the theorem is not saying, for example, that $E[h(R)
= h(\eta)$ for fixed, finite n, which is not true.  Nor is it saying
that h(R) is normally distributed, which is definitely not true; recall
for instance that if $X$ has a N(0,1) distribution, then $X^2$ has a
chi-square distribution with one degree of freedom, hardly the same as
N(0,1).  But the theorem says that for the purpose of asymptotic
distributions, we can operate as if these things were true.

The theorem can be used to form confidence intervals for
$h(\eta_1,...,\eta_k)$, because it provides us with a standard error
(Section \ref{stderrest}): 

\begin{equation}
\textrm{std. err. of } h(R) = \sqrt{\frac{1}{n} ~ 
[\nu_1,...,\nu_k]' \Sigma [\nu_1,...,\nu_k]}
\end{equation}

Of course, these quantities are typically estimated from the sample,
e.g.

\begin{equation}
\widehat{\nu}_i = h_i(R_1,...,R_k)
\end{equation}

So, our approximate 95\% confidence interval for $h(\eta_1,...,\eta_k)$
is 

\begin{equation}
\label{deltaci}
h(R_1,...,R_k) \pm 1.96 
\sqrt{
\frac{1}{n}
[\widehat{\nu}_1,...,\widehat{\nu}_k]' \widehat{\Sigma} [\widehat{\nu}_1,...,\widehat{\nu}_k]
}
\end{equation}

% \checkpoint

Note that here we are considering scalar functions h(), but the theorem
can easily be extended to vector-valued h().

Now, how is theorem derived?

\begin{proof}

We'll cover the case k = 1 (dropping the subscript 1 for convenience).

The intuitive version of the proof cites the fact from
calculus\footnote{This is where the ``delta'' in the name of the method
comes from, an allusion to the fact that derivatives are limits of
difference quotients.} that a curve is close to its tangent line if we
are close to the point of tangency.  Here that means

\begin{equation}
\label{intuitiveproof}
h(R) \approx h(\eta) + h'(\eta) (R - \eta) 
\end{equation}

if R is near $\eta$, which will be the case for large n.  Note that in
the right-hand side of (\ref{intuitiveproof}), the only random quantity
is R; the rest are constants.  In other words, the right-hand side
has the form c+dQ, where Q is approximately normal.  Since a linear
function of a normally distributed random variable itself has a normal
distribution, (\ref{intuitiveproof}) implies that h(R) is approximately
normal with mean $h(\eta)$ and variance $[h'(\eta)]^2 Var(R)$.

Reasoning more carefully, recall the Mean Value Theorem from calculus:

\begin{equation}
h(R) = h(\eta) + h'(W) (R - \eta) 
\end{equation}

for some $W$ between $\eta$ and $R$.  Rewriting this, we have

\begin{equation}
\label{mvt}
\sqrt{n}[h(R) - h(\eta)] = \sqrt{n} ~ h'(W) (R - \eta)
\end{equation}

It can be shown---and should be intuitively plausible to you---that if a
sequence of random variables converges in distribution to a constant,
the convergence is in probability too.  So, $R - \eta$ converges in
probability to 0, forcing $W$ to converge in probability to $h(\eta)$.
Then from Slutsky's Theorem, the asymptotic distribution of (\ref{mvt})
is the same as that of $\sqrt{n}~  h'(\eta) (R - \eta)$.  The result
follows.

\end{proof}

\subsection{Example:  Square Root Transformation}
\label{varstabxform}

Here is an example of the delta method with k = 1.  It will be a rather
odd example, in that our goal is actually not to form a confidence
interval for anything, but it will illustrate how the delta method is
used.

It is used to be common, and to some degree is still common today, for
statistical analysts to apply a square-root transformation to Poisson
data.  The delta method sheds light on the motivation for this, as
follows.

First, note that we cannot even apply the delta method unless we have
approximately normally distributed inputs, i.e. the $R_i$ in the
theorem.  But actually, any Poisson-distributed random variable $T$ is
approximately normally distributed if its mean, $\lambda$, is large.  To
see this,  recall from Section \ref{sumpoi} that sums of independent
Poisson random variables are themselves Poisson distributed.  So, if for
instance, ET is an integer k, then $T$ has the same distribution as

\begin{equation}
U_1+...+U_m
\end{equation}

where the $U_i$ are i.i.d. Poisson random variables each having mean 1.
By the Central Limit Theorem, T then has an approximate normal
distribution, with mean and variance $\lambda$.  (This is not quite a
rigorous argument, so our treatment here is informal.) 

Now that we know that T is approximately normal, we can apply the delta
method.  So, what h() should we use?  The pioneers of statistics chose 
$h(t) = \sqrt{t}$.  Let's see why.

Set $Y = h(T) = \sqrt{T}$ (so that T is playing the role of R in the
theorem).  Here $\eta$ is $ET = \lambda$.

We have $h'(t) = 1 / (2\sqrt{t})$.  Then the delta method says that
since T is approximately normally distributed with mean $\lambda$ and
variance $\lambda$, $Y$ too has an approximate normal distribution, with
mean 

\begin{equation}
h(\eta) = \sqrt{\lambda}
\end{equation}

What about the variance?  Well, in one dimension, (\ref{avar}) reduces
to

\begin{equation}
\nu^2 Var(R)
\end{equation}

so we have

\begin{equation}
[h'(\eta)]^2 Var(R) =
\left ( \frac{1}{2\sqrt{t}} \Big |_{t=\lambda} \right )^2 \cdot \lambda =
\frac{1}{4 \lambda} \lambda = \frac{1}{4}
\end{equation}

So, the (asymptotic) variance of $\sqrt{T}$ is a constant, independent
of $\lambda$, and we say that the square root function is a {\bf
variance stabilizing transformation.} This becomes relevant in
regression analysis, where, as we will discuss in Chapter
\ref{chap:linreg}, a
classical assumption is that a certain collection of random variables
all have the same variance.  If those random variables are
Poisson-distributed, then their square roots will all have approximately
the same variance.

\subsection{Example:  Confidence Interval for $\sigma^2$}
\label{cisigma2}

Recall that in Section \ref{whynot} we noted that (\ref{meanci}) is only
an approximate confidence interval for the mean.  An exact interval is
available using the Student t-distribution, \underline{if} the
population is normally distributed.  We pointed out that (\ref{meanci})
is very close to the exact interval for even moderately large n anyway,
and since no population is exactly normal, (\ref{meanci}) is good
enough.  Note that one of the implications of this and the fact that
(\ref{meanci}) did not assume any particular population distribution 
is that a Student-t based confidence interval works well even for
non-normal populations.  We say that the Student-t interval is {\bf
robust} to the normality assumption.

But what about a confidence interval for a variance?  It can be shown
that one can form an exact interval based on the chi-square
distribution, \underline{if} the population is normal.  In this case,
though, the interval does NOT work well for non-normal populations; it
is NOT robust to the normality assumption.  So, let's derive an interval
that doesn't assume normality; we'll use the delta method.  (Warning:
This will be a lengthy derivation, but it will cause you to review 
many concepts, which is good.)

As before, say we have $W_1,...,W_n$, a random sample from our
population, and with W representing a random variable having the
population distribution.) Write

\begin{equation}
\sigma^2 = E(W^2) -(EW)^2
\end{equation}  

and from (\ref{alts2}) write our estimator of $\sigma^2$ as

\begin{equation}
s^2 = \frac{1}{n} \sum_{i=1}^{n} W_i^2 - \overline{W}^2 
\end{equation}

This suggests how we can use the delta method.  We define

\begin{equation}
R_1 = \overline{W}
\end{equation}

\begin{equation}
R_2 = \frac{1}{n} \sum_{i=1}^{n} W_i^2 
\end{equation}

$R_1$ is an estimator of EW, and $R_2$ estimates $E(W^2)$.  Furthermore,
we'll see below that $R_1$ and $R_2$ are approximately bivariate normal,
by the multivariate Central Limit Theorem, so we can use the delta
method.  

And most importantly, our estimator of interest, $s^2$, is a
function of $R_1$ and $R_2$:

\begin{equation}
s^2 = R_2 - R_1^2
\end{equation}

So, we take our function h to be

\begin{equation}
h(u,v) = -u^2 + v
\end{equation}

Now we must find $\Sigma$ in the theorem.  That means we'll need we'll
need the covariance matrix of $R_1$ and $R_2$.  But since

\begin{equation}
\left (
   \begin{array}{l}
   R_1 \\
   R_2 
   \end{array} 
\right ) =
\frac{1}{n}
\sum_{i=1}^n  
\left (
   \begin{array}{l}
   W_i \\
   W_i^2 
   \end{array}
\right ) 
\end{equation}

we can derive the covariance matrix of $R_1$ and $R_2$, as follows.

Remember, the covariance matrix is the multidimensional analog of
variance.  So, after reviewing the reasoning in (\ref{oneovernpopvar}), 
we have in the vector-valued version of that derivation that

\begin{eqnarray}
Cov \left [
\left (
   \begin{array}{l}
   R_1 \\
   R_2 
   \end{array}
\right ) 
\right ]
&=& \frac{1}{n^2} 
Cov \left [
   \sum_{i=1}^n  
   \left (
      \begin{array}{l}
      W_i \\
      W_i^2 
      \end{array}
   \right ) 
\right ] \\
&=& \frac{1}{n^2} 
\sum_{i=1}^n  
   Cov \left [
   \left (
      \begin{array}{l}
      W_i \\
      W_i^2 
      \end{array}
   \right ) 
   \right ] \\
&=& \frac{1}{n^2} 
\sum_{i=1}^n  
   Cov \left [
   \left (
      \begin{array}{l}
      W \\
      W^2 
      \end{array}
   \right ) 
   \right ] \\
&=& \frac{1}{n} 
   Cov \left [
   \left (
      \begin{array}{l}
      W \\
      W^2 
      \end{array}
   \right ) 
   \right ] 
\end{eqnarray}

So

\begin{equation}
\Sigma = 
Cov \left [
\left (
   \begin{array}{l}
   W \\
   W^2 
   \end{array}
\right )
\right ]
\end{equation}

Now we must estimate $\Sigma$.  Taking sample analogs of
(\ref{quickcovarmat}), we set

\begin{equation}
\widehat{\Sigma} = 
\frac{1}{n}
\sum_{i=1}^n  
\left (
   \begin{array}{l}
   W_i \\
   W_i^2 
   \end{array}
\right ) 
(W_i,W_i^2) 
- R ~ R'
=
\frac{1}{n}
\sum_{i=1}^n  
\left (
   \begin{array}{ll}
   W_i^2 & W_i^3 \\
   W_i^3 & W_i^4 
   \end{array}
\right ) 
- R ~ R'
\end{equation}

where $R = (R_1,R_2)'$.

Also, $h'(u,v) = (-2u,1)'$, so

\begin{equation}
h'(R_1,R_2) = (-2R_1,1)' 
\end{equation}

Whew!  We're done.  We can now plug everything into (\ref{deltaci}).

Note that all these quantities are expressions in $E(W^k)$ for various
k.  It should be noted that estimating means of higher powers of a
random variable requires larger samples in order to achieve comparable
accuracy.  Our confidence interval here may need a rather large sample
to be accurate, as opposed to the situation with (\ref{meanci}), in
which even n = 20 should work well.

\subsection{Example:  Confidence Interval for a Measurement of
Prediction Ability}

Suppose we have a random sample $X_1,...,X_n$ from some population. In
other words, the $X_i$ are independent and each is distributed as in the
population. Let X represent a generic random variable having that
distribution. Here we are allowing the $X_i$ and X to be random vectors,
though they won't play much explicit role anyway.

Let A and B be events associated with X. If for example X is a random
vector (U,V), we might have A and B being the events U $>$ 12 and U-V
$<$ 5. The question of interest here will be to what extent we can
predict A from B.

One measure of that might be the quantity $\nu = P(A|B) - P(A)$. The
larger $\nu$ is (in absolute value), the stronger the ability of B to
predict A. (We could look at variations of this, such as the quotient of
those two probabilities, but will not do so here.)

Let's use the delta method to derive an approximate 95\% confidence
interval for $\nu$.  To that end, think of four categories---A and B; A
and not B; not A and B; and not A and not B. Each $X_i$ falls into one of
those categories, so the four-component vector Y consisting of counts of
the numbers of $X_i$ falling into the four categories has a multinomial
distribution with r = 4.

To use the theorem, set R = Y/n, so that R is the vector of the sample
proportions.  For instance, $R_1$ will be the number of $X_i$ satisfying
both events A and B, divided by n.  The vector $\eta$ will then be the
corresponding population proportion, so that for instance 

\begin{equation}
\eta_2 = P(A\textrm{ and not }B)
\end{equation}

We are interested in

\begin{eqnarray}
\label{nueqn}
\nu &=& P(A|B) - P(A) \\
&=& \frac{P(A \textrm{ and } B)}
{P(A \textrm{ and }B) + P(\textrm{not }A \textrm{ and }B)} 
- \left [ P(A \textrm{ and } B) + P(A \textrm{ and not } B) \right ] \\
&=& \frac{\eta_1}{\eta_1+\eta_3} - (\eta_1+\eta_2)
\end{eqnarray}

By the way, since $\eta_4$ is not involved, let's shorten R to
$(R_1,R_2,R_3)'$.

What about $\Sigma$?  Since Y is multinomial, Equation (\ref{propcov})
provides us $\Sigma$:

\begin{equation}
\Sigma = 
\frac{1}{n}
\left (
\begin{array}{rrr}
\eta_1 (1-\eta_1) & -\eta_1 \eta_2 & -\eta_1 \eta_3 \\
-\eta_2 \eta_1 & \eta_2 (1-\eta_2) & -\eta_2 \eta_3 \\
-\eta_3 \eta_1 & -\eta_3 \eta_2 & \eta_3 (1-\eta_3) 
\end{array}
\right )
\end{equation}

We then get $\widehat{\Sigma}$ by substituting $R_i$ for $\eta_i$.
After deriving the $\widehat{\nu}_i$ from (\ref{nueqn}), we make the
same substitution there, and then compute (\ref{deltaci}).

\section{Simultaneous Confidence Intervals} 
\label{simultancis}

Suppose in our study of heights, weights and so on of people in Davis,
we are interested in estimating a number of different quantities, with
our forming a confidence interval for each one.  Though our confidence
level for each one of them will be 95\%, our {\it overall} confidence
level will be less than that.  In other words, we cannot say we are 95\%
confident that all the intervals contain their respective population
values.

In some cases we may wish to construct confidence intervals in such a
way that we can say we are 95\% confident that all the intervals are
correct.  This branch of statistics is known as {\bf simultaneous
inference} or {\bf multiple inference}.  

Usually this kind of methodology is used in the comparison of several
{\bf treatments}.  This term originated in the life sciences, e.g.
comparing the effectiveness of several different medications for
controlling hypertension, it can be applied in any context.  For
instance, we might be interested in comparing how well programmers do in
several different programming languages, say Python, Ruby and Perl.
We'd form three groups of programmers, one for each language, with say
20 programmers per group.  Then we would have them write code for a
given application.  Our measurement could be the length of time T that
it takes for them to develop the program to the point at which it runs
correctly on a suite of test cases.  

Let $T_{ij}$ be the value of T for the j$^{th}$ programmer in the
i$^{th}$ group, i = 1,2,3, j = 1,2,...,20.  We would then wish to
compare the three ``treatments,'' i.e. programming languages, by
estimating $\mu_i = ET_{i1}$, i = 1,2,3.  Our estimators would be $U_i =
\sum_{j=1}^{20} T_{ij}/20$, i = 1,2,3.  Since we are comparing the three
population means, we may not be satisfied with simply forming ordinary
95\% confidence intervals for each mean.  We may wish to form confidence
intervals which {\it jointly} have confidence level 95\%.\footnote{The
word {\it may} is important here.  It really is a matter of philosophy
as to whether one uses simultaneous inference procedures.}

Note very, very carefully what this means.  As usual, think of our
notebook idea.  Each line of the notebook would contain the 60
observations; different lines would involve different sets of 60 people.
So, there would be 60 columns for the raw data, three columns for the
$U_i$.  We would also have six more columns for the confidence intervals
(lower and upper bounds) for the $\mu_i$.  Finally, imagine three more
columns, one for each confidence interval, with the entry for each being
either Right or Wrong.  A confidence interval is labeled Right if it
really does contain its target population value, and otherwise is
labeled Wrong.

Now, if we construct individual 95\% confidence intervals, that means
that in a given Right/Wrong column, in the long run 95\% of the entries
will say Right.  But for simultaneous intervals, we hope that within a
line we see \underline{three} Rights, and 95\% of all lines will have
that property.

In our context here, if we set up our three intervals to have individual
confidence levels of 95\%, their simultaneous level will be $0.95^3 =
0.86$, since the three confidence intervals are independent.
Conversely, if we want a simultaneous level of 0.95, we could take each
one at a 98.3\% level, since $0.95^{\frac{1}{3}} \approx 0.983$.

However, in general the intervals we wish to form will not be
independent, so the above ``cube root method'' would not work.  Here we
will give a short introduction to more general procedures.

Note that ``nothing in life is free.''  If we want simultaneous
confidence intervals, they will be wider.

Another reason to form simultaneous confidence intervals is that it
gives you  ``license to browse,'' i.e. to rummage through the data
looking for interesting nuggets.

\subsection{The Bonferonni Method}

One simple approach is {\bf Bonferonni's Inequality}:

\begin{lemma}
Suppose $A_1,...,A_g$ are events.  Then

\begin{equation}
\label{bonf}
P(A_1 {\rm ~or~...~or} A_g) \leq \sum_{i=1}^{g} P(A_i)
\end{equation}

\end{lemma}

You can easily see this for g = 2: 

\begin{equation}
P(A_1 \textrm{ or } A_2) = P(A_1) + P(A_2) - P(A_1 \textrm{ and } A_2)
\leq  P(A_1) + P(A_2)
\end{equation}

One can then prove the general case by mathematical induction.

Now to apply this to forming simultaneous confidence intervals, take
$A_i$ to be the event that the $i^{th}$ confidence interval is
incorrect, i.e. fails to include the population quantity being
estimated.  Then (\ref{bonf}) says that if, say, we form two confidence
intervals, each having individual confidence level (100-5/2)\%, i.e.
97.5\%, then the overall collective confidence level for those two
intervals is at least 95\%.  Here's why:
Let $A_1$ be the event that the first interval is wrong, and $A_2$ is
the corresponding event for the second interval.  Then

% \checkpoint
\begin{eqnarray}
\textrm{overall conf. level} &=& P(\textrm{not } A_1 \textrm{ and not } A_2) \\ 
&=& 1 - P(A_1 \textrm{ or } A_2) \\
&\geq&  1 - P(A_1) - P(A_2) \\
&=& 1 - 0.025 - 0.025 \\
&=& 0.95
\end{eqnarray}

\subsection{Scheffe's Method}
\label{scheffe}

The Bonferonni method is unsuitable for more than a few intervals; each
one would have to have such a high individual confidence level that the
intervals would be very wide.  Many alternatives exist, a famous one
being {\bf Scheffe's method}.\footnote{The name is pronounced
``sheh-FAY.''}

\begin{theorem}

Suppose $R_1,...,R_k$ have an approximately multivariate normal
distribution, with mean vector $\mu = (\mu_i)$ and covariance matrix
$\Sigma = (\sigma_{ij})$.  Let $\widehat{\Sigma}$ be a {\bf consistent}
estimator of $\Sigma$, meaning that it converges in probability to
$\Sigma$ as the sample size goes to infinity.

For any constants $c_1,...,c_k$, consider linear combinations of the
$R_i$, 

\begin{equation}
\label{lincomb}
\sum_{i=1}^{k} c_i R_i
\end{equation}

which estimate

\begin{equation}
\sum_{i=1}^{k} c_i \mu_i
\end{equation}

Form the confidence intervals 

\begin{equation}
\label{radius}
\sum_{i=1}^{k} c_i R_i \pm 
\sqrt{k \chi_{\alpha; k}^2} s(c_1,...,c_k)
\end{equation}

where

\begin{equation}
[s(c_1,...,c_k)]^2 = (c_1,...,c_k)^T \widehat{\Sigma} (c_1,...,c_k)
\end{equation}

and where $\chi_{\alpha; k}^2$ is the upper-$\alpha$ percentile of a
chi-square distribution with k degrees of freedom.\footnote{Recall
that the distribution of the sum of squares of g independent N(0,1)
random variables is called {\bf chi-square with g degrees of freedom}.
It is tabulated in the R statistical package's function {\bf qchisq()}.}

Then all of these intervals (for infinitely many values of the $c_i$!)
have simultaneous confidence level $1-\alpha$.

\end{theorem}

By the way, if we are interested in only constructing confidence
intervals for {\bf contrasts}, i.e. $c_i$ having the property that
$\Sigma_i c_i = 0$, we the number of degrees of freedom reduces to k-1,
thus producing narrower intervals.

Just as in Section \ref{whynot} we avoided the t-distribution, here we
have avoided the F distribution, which is used instead of ch-square in
the ``exact'' form of Scheffe's method.

\subsection{Example}

For example, again consider the Davis heights example in Section
\ref{diffs}.  Suppose we want to find approximate 95\% confidence
intervals for two population quantities, $\mu_1$ and $\mu_2$.  These
correspond to values of $c_1,c_2$ of (1,0) and (0,1).  Since the two
samples are independent, $\sigma_{12} = 0$.  The chi-square value is
5.99,\footnote{Obtained from R via {\bf qchisq(0.95,2)}.} so the square
root in (\ref{radius}) is 3.46.  So, we would compute (\ref{meanci}) for
$\overline{X}$ and then for $\overline{Y}$, but would use 3.46 instead of 1.96.

This actually is not as good as Bonferonni in this case.  For
Bonferonni, we would find two 97.5\% confidence intervals, which would
use 2.24 instead of 1.96.

% \checkpoint

Scheffe's method is too conservative if we just are forming a small
number of intervals, but it is great if we form a lot of them.
Moreover, it is very general, usable whenever we have a set of
approximately normal estimators.

\subsection{Other Methods for Simultaneous Inference}

There are many other methods for simultaneous inference.  It should be
noted, though, that many of them are limited in scope, in contrast to
Scheffe's method, which is usable whenever one has multivariate normal
estimators, and Bonferonni's method, which is universally usable.

% \section{Conditional Confidence Intervals}
% 
% Often in simulations we will be interested in conditional quantities.
% In a queuing model, for instance, we might wish to estimate the mean
% wait time of jobs which arrive when the server is busy.  We still use
% the usual formulas, e.g. (\ref{meanci}) and (\ref{propci}), but keep in
% mind that the value of n in those formulas might be small, even though
% we might have run the simulation for a long time and the n for
% unconditional quantities would be large.
% 
% By the way, if a conditional confidence interval has confidence level, 
% say, 95\%, then that is also its unconditional confidence level.  To see
% this, let K = 1 or 0, depending on whether the interval contains the
% desired quantity, and say we are conditioning on C.  Then
% 
% \begin{equation}
% P(K = 1) = EK = E \left [ E(K|C) \right ] =
% E \left [ P(K = 1 | C) \right ] = E(0.95) = 0.95
% \end{equation}

\section{The Bootstrap Method for Forming Confidence Intervals}

Many statistical applications can be quite complex, which makes them very
difficult to analyze mathematically.  Fortunately, there is a fairly
general method for finding confidence intervals called the {\bf
bootstrap}.  Here is a brief overview of the type of bootstrap
confidence interval construction called {\bf Efron's percentile method}.

\subsection{Basic Methodology}
\label{basicboot}

Say we are estimating some population value $\theta$ based on i.i.d.
random variables $Q_i$, i = 1,...,n.  Note that $\theta$ and the $Q_i$
could be vector-valued.

Our estimator of $\theta$ is of course some function of the $Q_i$, 
$h(Q_1,...,Q_n)$.  For example, if we are estimating a population mean
by a sample mean, then the function h() is defined by

\begin{equation}
h(u_1,...,u_n) = \frac{u_1+...,+u_n}{n}
\end{equation}

Our procedure is as follows:

\begin{itemize}

\item Estimate $\theta$ based on the original sample, i.e. set

\begin{equation}
\widehat{\theta} = h(Q_1,...,Q_n)
\end{equation}

\item For j = 1,2,...,k:

   \begin{itemize}

   \item Resample, i.e. create a new ``sample,'' $\widetilde{Q}_1,..,
   \widetilde{Q}_n$, by drawing n times with replacement from
   $Q_1,..,Q_n$.

   \item Calculate the value of $\widehat{\theta}$ based on the
   $\widetilde{Q}_i$ instead of the $Q_i$, i.e. set

   \begin{equation}
   \widetilde{\theta}_j = h(\widetilde{Q}_1,...,\widetilde{Q}_n)
   \end{equation}

   \end{itemize}

\item Sort the values $\widetilde{\theta}_j$, j = 1,...,k, and let
$\widetilde{\theta}_{(k)}$ be the k$^{th}$-smallest value.

\item Let A and B denote the 0.025 and 0.975 quantiles of the
$\widetilde{\theta}_j - \widehat{\theta}$, i.e. 

\begin{equation}
\label{abci}
A = \widehat{\theta}_{(0.025n)} - \widehat{\theta} \textrm{ and }
B = \widehat{\theta}_{(0.975n)} - \widehat{\theta}
\end{equation}

(The quantities 0.025n and 0.975n must be rounded, say to the nearest
integer in the range 1,...,n.) 

\item Then your approximate 95\% confidence interval for $\theta$ is 

\begin{equation}
\label{bootci}
(\widehat{\theta} - B, \widehat{\theta} - A)
\end{equation}

\end{itemize}

\subsection{Example:  Confidence Intervals for a Population Variance}

As noted in Section \ref{cisigma2}, the classical chi-square method for
finding a confidence interval for a population variance $\sigma^2$ is
not robust to the assumption of a normally distributed parent
population.  In that section, we showed how to find the desired
confidence interval using the delta method.

That was a solution, but the derivation was complex.  An alternative
would be to use the bootstrap.  We resample many times, calculate the
sample variance on each of the new samples, and then form a confidence
interval for $\sigma^2$ as in (\ref{abci}).  We show the details using R
in Section \ref{compinr}

\subsection{Computation in R}
\label{compinr}

R includes the {\bf boot()} function to do the mechanics of this for us.
To illustrate its usage, let's consider finding a confidence interval
for the population variance $\sigma^2$, based on the sample variance,
$s^2$.  Here is the code:

\begin{Verbatim}[fontsize=\relsize{-2}]
# R base doesn't include the boot package, so must load it
library(boot)  

# finds the sample variance on x[c(inds)]
s2 <- function(x,inds) {
   return(var(x[inds]))
}

bt <- boot(x,s2,R=200)
cilow[rep] <- quantile(bt$t,alp)
cihi[rep] <- quantile(bt$t,1-alp)

print(mean(cilow <= 1.0 & 1.0 <= cihi))
\end{Verbatim}

How does this work?  The line

\begin{Verbatim}[fontsize=\relsize{-2}]
bt <- boot(x,s2,R=200)
\end{Verbatim}

instructs R to apply the bootstrap to the data set {\bf x}, with the
statistic of interest being specified by the user in the function {\bf
s2()}.  The argument {\bf R} here is what we called k in Section
\ref{basicboot} above, i.e. the number of times we resample n items from
{\bf x}.

Our argument {\bf inds} in {\bf s2()} is less obvious.  Here's what
happens:  As noted, the {\bf boot()} function merely shortens our work.
Without it, we could simply call {\bf sample()} to do our resampling.
Say for simplicity that n is 4.  We might make the call

\begin{Verbatim}[fontsize=\relsize{-2}]
j <- sample(1:4,replace=T)
\end{Verbatim}

and {\bf j} might turn out to be, say, c(4,1,3,3).  We would then apply
the statistic to be bootstrapped, in our case here the sample variance,
to the data $x[4], x[1], x[3], x[3]$---more compactly and efficiently
expressed as $x[c(4,1,3,3)]$.  That's what {\bf boot()} does for us.
So, in our example above, the argument {\bf inds} would be c(4,1,3,3)
here.

In the example here, our statistic to be bootstrapped was a very common
one, and thus there was already an R function for it, {\bf var()}.  In
more complex settings, we'd write our own function.  

\subsection{General Applicability}

Much theoretical work has been done on the bootstrap, and it is
amazingly general.  It has become the statistician's ``Swiss army
knife.''  However, there are certain types of estimators on which the
bootstrap fails.  How can one tell in general?

One approach would be to consult the excellent book {\it Bootstrap
Methods and Their Application}, by A. C. Davison and D. V. Hinkley, 
Cambridge University Press, 1997.

But a simpler method would be to test the bootstrap in the proposed
setting by simulation:  Write R code to generate many samples; get a
bootstrap confidence interval on each one; and then see whether the
number of intervals containing the true population value is
approximately 95\%.

In the sample variance example above, the code could be:

\begin{Verbatim}[fontsize=\relsize{-2}]
sim <- function(n,nreps,alp) {
   cilow <- vector(length=nreps)
   cihi <- vector(length=nreps)
   for (rep in 1:nreps) {
      x <- rnorm(n)
      bt <- boot(x,s2,R=200)
      cilow[rep] <- quantile(bt$t,alp)
      cihi[rep] <- quantile(bt$t,1-alp)
   }
   print(mean(cilow <= 1.0 & 1.0 <= cihi))
}
\end{Verbatim}

\subsection{Why It Works}

The mathematical theory of the bootstrap can get extremely involved, but
we can at least get a glimpse of why it works here.

First review notation:

\begin{itemize}

\item Our random sample data is $Q_1,...,Q_n$.

\item Our estimator of $\theta$ is $\widehat{\theta} = h(Q_1,...,Q_n)$.

\item Our resampled estimators of $\theta$ are $\widehat{\theta}_1,...,
\widehat{\theta}_k$.

\end{itemize}

Remember, to get any confidence interval from an estimator, we need the
distribution of that estimator.  Here in our bootstrap context, our goal
is to find the approximate distribution of $\widehat{\theta}$.  The
bootstrap achieves that goal very simply.

In essence, we are performing a simulation, drawing samples from the
empirical distribution function for our $Q_i$ data.  Since the empirical
cdf is an estimate of the population cdf $F_Q$, then the
$\widehat{\theta}_j$ act like a random sample from the resulting
distribution of $\widehat{\theta}$.

Indeed, if we calculate the sample standard deviation (``s'') of the
$\widetilde{\theta}_j$, that is an estimate of the standard error of
$\widehat{\theta}$.  If due to the delta method or other
considerations, we know that the asymptotic distribution of
$\widehat{\theta}$ is normal, then an approximate 95\% confidence
interval for $\theta$ would be

\begin{equation}
\widehat{\theta} \pm 1.96 \times 
\textrm{ standard deviation of the } \widehat{\theta}_j
\end{equation}

Efron's percentile method is more general, and works better for small
samples.  The idea is that the above discussion implies that the values

\begin{equation}
\widetilde{\theta}_j - \widehat{\theta}
\end{equation}

have approximately the same distribution as the values

\begin{equation}
\label{thetadiff}
\widetilde{\theta} - \theta
\end{equation}

Accordingly, the probability that (\ref{thetadiff}) is between A and B
is approximately 0.95, thus giving us (\ref{bootci}).

\section{Bayesian Methods}
\label{bayesian}

{\it Everyone is entitled to his own opinion, but not his own
facts}---Daniel Patrick Moynihan, senator from New York, 1976-2000

{\em Whiskey's for drinkin' and water's for fightin' over}---Mark Twain,
on California water jurisdiction battles 

{\it Black cat, white cat, it doesn't matter as long as it catches
mice}---Deng Xiaoping, when asked about his plans to give private
industry a greater role in China's economy

\bigskip

The most controversial topic in statistics by far is that of {\bf
Bayesian} methods.  In fact, it is so controversial that a strident
Bayesian colleague of mine even took issue with my calling it
``controversial''! 

The name stems from Bayes' Rule (Section \ref{bayesthm}),

\begin{equation}
\label{bayescopy}
P(A|B) = \frac{P(A) P(B|A)}{P(A) P(B|A) + P(\textrm{not }A)
P(B|\textrm{not } A)}
\end{equation}

No one questions the validity of Bayes' Rule, and thus there is no
controversy regarding statistical procedures that make use of
probability calculations based on that rule.  But the key word is {\it
probability}.  As long as the various terms in (\ref{bayescopy}) are
real probabilities, there is no controversy.  But instead, the debate
stems from the cases in which Bayesians replace some of the
probabilities in the theorem with ``feelings,'' i.e. NON-probabilities,
arising from what they call {\bf subjective prior distributions}.  The
key word is then {\it subjective}.  Our section here will concern the
controversy over the use of subjective priors.

Say we wish to estimate a population mean.  Here the Bayesian analyst,
before even collecting data, says, ``Well, I think the population mean
could be 1.2, with probability, oh, let's say 0.28, but on the other
hand, it might also be 0.88, with probability, well, I'll put it at
0.49...'' etc.  This is the analyst's subjective prior distribution for
the population mean.  The analyst does this before even collecting any
data.  Note carefully that he is NOT claiming these are real
probabilities; he's just trying to quantify his hunches.  The analyst
then collects the data, and uses some mathematical procedure that
combines these ``feelings'' with the actual data, and which then outputs
an estimate of the population mean or other quantity of interest. 

The Bayesians justify this by saying one should use all available
information, even if it is just a hunch.  ``The analyst is typically an
expert in the field under study.  You wouldn't want to throw away
his/her expertise, would you?'' 

The non-Bayesians, known as {\bf frequentists}, on the other hand
dismiss this as unscientific and lacking in impartiality.  ``In research
on a controversial health issue, say, you wouldn't want the researcher
to incorporate his/her personal political biases into the number
crunching, would you?''

\subsection{How It Works}

To introduce the idea, consider again the example of estimating p, the
probability of heads for a certain penny.  Suppose we were to
say---before tossing the penny even once---``I think p could be any
number, but more likely near 0.5, something like a normal distribution
with mean 0.5 and standard deviation, oh, let's say 0.1.''\footnote{Of
course, the true value of p is between 0 and 1, while the normal
distribution extends from $-\infty$ to $\infty$.  This, as noted in
Section \ref{normalimp}, the use of normal distributions is common for
modeling many bounded quantities.

Nevertheless, many Bayesians prefer to use a beta distribution for the
prior in this kind of setting.} The prior distribution is then
$N(0.5,0.1^2)$.  But again, note that the Bayesians do not consider it
to be a distribution in the sense of probability.  We are just using our
``gut feeling'' here, our ``hunch.''  This is an absolutely central
point.

So, Bayesians would not regard p as random here.  They would simply be
using the normal ``distribution'' for p to describe a degree of belief,
rather than a probability distribution.  (I will continue to use
quotation marks below for this reason.)

Nevertheless, in terms of the mathematics involved, it's as if the
Bayesians are treating p as random, with p's distribution being whatever
the analyst specifies as the prior.  Under this ``random p'' assumption,
the Maximum Likelihood Estimate (MLE), for instance, would change.  Our
data here is X, the number of heads we get from n tosses of the penny.
In contrast to the frequentist approach, in which the likelihood would
be

\begin{equation}
L = \left (
\begin{array}{c}
n \\
X 
\end{array}
\right )
p^X (1-p)^{n-X}
\end{equation}

it now becomes

\begin{equation}
\label{bayesmle}
L = \frac{1}{\sqrt{2\pi} ~ 0.1} ~ \exp{-0.5 [(p-0.5)/0.1]^2}
\left (
\begin{array}{c}
n \\
X 
\end{array}
\right )
p^X (1-p)^{n-X}
\end{equation}

This is basically P(A and B) = P(A) P(B$|$A), though using a density
rather than probability mass functions.  We would then find the value of
p which maximizes L, and take that as our estimate.  

Note how this procedure achieves a kind of balance between what our
hunch says and what our data say.  In (\ref{bayesmle}), suppose the mean
of p is 0.5 but n = 20 and X = 12 Then the frequentist estimator would
be X/n = 0.6, while the Bayes estimator would be about 0.56.
(Computation not shown here.) So our Bayesian approach ``pulled'' our
estimate away from the frequentist estimate, toward our hunch that p is
at or very near 0.5.  This pulling effect would be stronger for smaller n
or for a smaller standard deviation of the prior ``distribution.''

A Bayesian would use Bayes' Rule to compute the ``distribution'' of p
given X, called the {\bf posterior distribution}.  The analog of
(\ref{bayescopy}) would be (\ref{bayesmle}) divided by the integral of
(\ref{bayesmle}) as p ranges from 0 to 1, with the resulting quotient
then being treated as a density.  The MLE would then be the {\bf mode},
i.e. the point of maximal density of the posterior distribution.  

But we could use any measure of central tendency, and in fact typically
the mean is used, rather than the mode.  In other words:

\begin{quote}
To estimate a population value $\theta$, the Bayesian constructs a prior
``distribution'' for $\theta$ (again, the quotation marks indicate that
it is just a quantified gut feeling, rather than a real probability
distribution).  Then she uses the prior together with the actual
observed data to construct the posterior distribution.  Finally, she
takes her estimate $\hat{\theta}$ to be the mean of the posterior
distribution.
\end{quote}

% \subsection{Noninformative Priors}
% 
% In choosing a subjective prior distribution in the penny problem,
% another approach is that of a {\bf noninformative prior} distribution.
% In our penny example here, we might simply throw up our hands, and say
% ``We have no idea what the value of p is for this coin, so let's just
% set the prior to be U(0,1).''  Needless to say, this is even more
% controversial.  For those who use it, though, it is viewed as achieving
% other goals, such as avoiding model overfit, a problem discussed (in a
% non-Bayesian context) in Section \ref{overfit}.
% 
% \subsubsection{Empirical Bayes Methods}
% 
% Note carefully that if the prior distribution in our model is not
% subjective, but is a real distribution verifiable from data, the above
% analysis on p would not be controversial at all.  Say p does vary a
% substantial amount from one penny to another, so that there is a
% physical distribution involved.  Suppose we have a sample of many pennies,
% tossing each one n times.   If n is very large, we'll get a pretty
% accurate estimate of the value of p for each coin, and we can then plot
% these values in a histogram and compare it to the $N(0.5,0.1^2)$
% density, to check whether our prior is reasonable.  This is called an
% {\bf empirical Bayes} model, because we can empirically estimate our
% prior distribution, and check its validity.  In spite of the name,
% frequentists would not consider this to be ``Bayesian'' analysis.
% 
% Note that we could also assume that p has a general $N(\mu, \sigma^2)$
% distribution, and estimate $\mu$ and $\sigma$ from the data.  We could
% deal with the situation in which n is only moderately large, as long as
% it's large enough for the Central Limit Theorem to work well, but we
% will not pursue that point here.

\subsection{Extent of Usage of Subjective Priors}

Though some academics are staunch, often militantly proselytizing
Bayesians, only a small minority of statisticians in practice use the
Bayesian approach.   It is not mainstream. 

One way to see that Bayesian methodology is not mainstream is through
the R programming language.  For example, as of December 2010, only
about 65 of the more than 3000 packages on CRAN, the R repository,
involve Bayesian techniques.  (See
\url{http://cran.r-project.org/web/packages/tgp/index.html}.) There is
actually a book on the topic, {\it Bayesian Computation with R}, by Jim
Albert, Springer, 2007, and among those who use Bayesian techniques,
many use R for that purpose.  However, almost all general-purpose books
on R do not cover Bayesian methodology at all.

Significantly, even among Bayesian academics, many use frequentist
methods when they work on real, practical problems.  Choose an academic
statistician at random, and you'll likely find on the Web that he/she
does not use Bayesian methods when working on real applications.  

\subsection{Arguments Against Use of Subjective Priors}

As noted, most professional statisticians, including me, are
frequentists.  What are the arguments made in this regard?

First, the following must be noted carefully:

\begin{quote}

Ultimately, the use of any statistical analysis is to make a decision about
something.  This could be a very formal decision, such as occurs when
the Food and Drug Administration (FDA) decides whether to approve a new
drug, or it could be informal, for instance when an ordinary citizen
reads a newspaper article reporting on a study analyzing data on traffic
accidents, and she decides what to conclude from the study.

Frequentists believe that there is nothing wrong using one's gut
feelings to make a final decision, but it should not be part of the
mathematical analysis of the data.  One's hunches can play a role in
deciding the ``preponderance of evidence,'' as discussed in Section
\ref{preponderance}, but that should be kept separate from our data
analysis.  

If for example the FDA's data shows the new drug to be effective, but at
the same time the FDA scientists still have their doubts, they may
decide to delay approval of the drug pending further study.  So they can
certainly act on their hunch, or on non-data information they have
concerning the drug.  But the FDA, as a public agency, has a
responsibility to the citizenry to state what the data say, i.e. to
report the frequentist estimate, rather than merely reporting a
number---the Bayesian estimate---that mixes fact and hunch.

\end{quote}

Thus, the Bayesian rallying cry, ``It would be wrong to ignore any
information we possess to supplement our data, even if that information
is just a hunch,'' is presenting us with a false choice.  The
frequentists have never advocated ignoring hunches.  However, they don't
plug their hunches into a mathematical formula, just as jurors in a
trial don't plug their hunches into a mathematical formula either. 

The Bayesians say that in some cases, a Bayesian estimator may, for
instance, produce smaller mean squared estimation error (recall Section
\ref{msesection}) than its frequentist counterpart, even if the prior
distribution was just in our imaginations.  But again, this argument is
incorrectly implicitly presuming that frequentists ignore their hunches,
which is not the case.

Moreover, in most applications of statistics, there is a need for
impartial estimates.  As noted above, even if the FDA acts on a hunch to
delay approval of a drug in spite of favorable data, the FDA owes the
public (and the pharmaceutal firm) an impartial report of what the data
say.  Bayesian estimation is by definition not impartial.  One Bayesian
statistician friend put it very well, saying ``I believe my own
subjective priors, but I don't believe those of other people.''  His
statement should be considered by any potential user or consumer of
Bayesian statistics.

Furthermore, in practice we are typically interested in inference, i.e.
confidence intervals and significance tests, rather than just point
estimation.  We are sampling from populations, and want to be able to
legitimately make inferences about those populations.  For instance,
though one can derive a Bayesian 95\% confidence interval for p for our
coin, it really has very little meaning, and again is certainly not
impartial.

Consider the following scenario.  Steven is running for president.  Leo,
his campaign manager, has commissioned Lynn to conduct a poll to assess
Steven's current support among the voters.  Lynn takes her poll, and
finds that 57\% of those polled support Steven.  But her own gut
feeling as an expert in politics, is that Steven's support is only 48\%.
She then combines these two numbers in some Bayesian fashion, and comes
up with 50.2\% as her estimate of Steven's support.

She then reports to Steven that she estimates Steven's support to be
50.2\%.  Leo asks Lynn how she arrived at that number, and she explains
that she combined her prior distribution with the data.  {\bf But Leo
then says, ``Lynn, I really respect your political expertise, but I'd
like you to tell me separately---what did the data say, and what is your
own gut feeling?  Lynn then tells Leo the two numbers, 57\% and 48\%,
separately, and Leo finds both of them useful, in different senses.}

\subsubsection{What Would You Do?}

In evaluating the frequentist/Bayesian debate, you might wish to ask
yourself what you would do in the following situations:

\begin{itemize}

\item As a personal investor, you've developed a statistical model for
the day-to-day price variation of Google stock prices, and will use it
to decide whether to buy the stock today.  You wish to predict the price
of the stock tomorrow, based on its price the last few days.  Here are
your choices:

   \begin{itemize}

   \item As a frequentist, you could use a classical mathematical model,
   say regression analysis (Chapter \ref{chap:linreg}), say fitting a linear or
   polynomial model.  You could use the data to estimate the parameters
   of the model.  This would give you a predicted price for tomorrow.
   Note that you can still choose to ignore that predicted price in the
   end, based on a hunch, but you've kept that hunch separate from your
   data analysis.

   \item As a Bayesian, you might use the say linear or polynomial
   regression model, but you would  specify a subjective prior
   distribution for the parameters.   Your predicted price would then be
   affected by that subjective prior.

   \end{itemize}

So, what would you deem wise here---a frequentist or Bayesian approach?

\item We are in a presidential election, complete with opinion polls as
to who is currently winning.  As an involved citizen, would you rather
that the pollsters simply report the data as is, with their reported
margin of error being computed from the traditional frequentist methods
we've seen so far, or would you prefer that they factor in their own
feelings via subjective priors?

So, what would you deem wise here---a frequentist or Bayesian approach?

\item You are a physician reading a medical journal article about the
effectiveness of a certain drug for alleviating high blood pressure.
Would you rather that the authors of the article simply report a
straightforward analysis of the data, or would you prefer that the
author incorporate a subjective prior distribution into his/her
mathematical model?

So, what would you deem wise here---a frequentist or Bayesian approach?

\end{itemize}

\startproblemset

{\bf Note to instructor:}  See the Preface for a list of sources of
real data on which exercises can be assigned to complement
the theoretical exercises below. 

\oneproblem
Consider raffle ticket example in Section \ref{raffle}. Suppose 500
tickets are sold, and you have data on 8 of them. Continue to assume
sampling with replacement.  Consider the Maximum Likelihood and Methods
of Moments estimators.

\begin{itemize}

\item [(a)] Find the probability that the MLE is exactly equal to the
true value of c. 

\item [(b)] Find the exact probability that the MLE is within 50 of the
true value. 

\item [(c)] Find the approximate probability that the Method of Moments
estimator is within 50 of the true value.

\end{itemize}

\oneproblem
Suppose I = 1 or 0, with probability p and 1-p, respectively. Given I,
X has a Poisson distribution with mean $\lambda_I$. Suppose we have        
$X_1,...,X_n$, a random sample of size n from the (unconditional)
distribution of X. (We do not know the associated values of I, i.e.
$I_1,...,I_n$.) This kind of situation occurs in various applications.
The key point is the effect of the unseen variable. In terms of
estimation, note that there are three parameters to be estimated.    

\begin{itemize}

\item [(a)] Set up the likelihood function, which if maximized with respect to
the three parameters would yield the MLEs for them.  

\item [(b)] The words {\it if} and {\it would} in that last sentence
allude to the fact    that MLEs cannot be derived in closed form.
However, R's {\bf mle()}  function can be used to find their values
numerically.  Write R code to do this.  In other words, write a function
with a single argument {\bf x}, representing the $X_i$, and returning
the MLEs for the three parameters.

\end{itemize}

\oneproblem
Find the Method of Moments and Maximum Likelihood estimators of the
following parameters in famous distribution families:

\begin{itemize}

\item p in the binomial family (n known)

\item p in the geometric family

\item $\mu$ in the normal family ($\sigma$ known)

\item $\lambda$ in the Poisson family

\end{itemize}

\oneproblem
For each of the following quantities, state whether the given
estimator is unbiased in the given context:

\begin{itemize}

\item [(a)] (4.15), p. 97, as an estimator of $\sigma^2$

\item [(b)]  $\hat{p}$, as an estimator of p, p.105

\item [(c)] $\hat{p} (1-\hat{p})$, as an estimator of p(1-p), p.105

\item [(d)] $\bar{X} - \bar{Y}$, as an estimator of $\mu_1 - \mu_2$,
p.107

\item [(e)] $\frac{1}{n} \sum_{i=1}^n (X_i-\mu_1)^2$ (assuming $\mu_1$ is
known), as an estimator of $\sigma_1^2$, p.107

\item [(f)] $\bar{X}$, as an estimator of $\mu_1$, p.107, {\it but
sampling (from the population of Davis) without replacement}

\end{itemize}

\oneproblem
Consider the Method of Moments Estimator $\hat{c}$ in the raffle
example, Section \ref{raffle}.  Find the exact value of $Var(\hat{c})$.
Use the facts that $1+2+...+r = r(r+1)/2$ and $1^2+2^+...+r^2 =
r(r+1)(2r+1)/6$.

\oneproblem
Suppose $W$ has a uniform distribution on (-c,c), and we draw a random
sample of size n, $W_1,...,W_n$.  Find the Method of Moments and Maximum
Likelihood estimators.  (Note that in the Method of Moments case, the
first moment won't work.)

\oneproblem
An urn contains $\omega$ marbles, one of which is black and the rest
being white.  We draw marbles from the urn one at a time, without
replacement, until we draw the black one; let $N$ denote the number of
draws needed.  Find the Method of Moments estimator of $\omega$ based on
X.

\oneproblem
Suppose $X_1,...,X_n$ are uniformly distributed on (0,c).  Find the
Method of Moments and Maximum Likelihood estimators of c, and compare
their mean squared error.

Hint:  You will need the density of $M = \max_i X_i$.  Derive this by
noting that $M \le t$ if and only if $X_i \leq t$ for all i = 1,2,...,n.

\oneproblem
Add a single line to the code on page \pageref{bussim} that will print out the
estimated value of Var(W).

\oneproblem
In the raffle example, Section \ref{raffle}, find a $(1-\alpha)$\%
confidence interval for c based on $\check{c}$, the Maximum Likelihood
Estimate of c.

\oneproblem
In many applications, observations come in correlated clusters.  For
instance, we may sample r trees at random, then s leaves within each
tree.  Clearly, leaves from the same tree will be more similar to each
other than leaves on different trees.

In this context, suppose we have a random sample $X_1,...,X_n$, n even,
such that there is correlation within pairs.  Specifically, suppose
the pair $(X_{2i+1},X_{2i+2})$ has a bivariate normal distribution with
mean $(\mu,\mu)$ and covariance matrix

\begin{equation}
\left (
   \begin{array}{rr}
   1 & \rho \\
   \rho & 1
   \end{array}
\right )
\end{equation}

i = 0,...,n/2-1, with the n/2 pairs being independent.  Find the Method
of Moments estimators of $\mu$ and $\rho$.

\oneproblem
Suppose we have a random sample $X_1,...,X_n$ from some
population in which $EX = \mu$ and $Var(X) = \sigma^2$.  Let
$\overline{X} = (X_1+...+X_n)/n$ be the sample mean.  Suppose the data
points $X_i$ are collected by a machine, and that due to a defect, the
machine always records the last number as 0, i.e. $X_n = 0$.  Each of
the other $X_i$ is distributed as the population, i.e. each has mean
$\mu$ and variance $\sigma^2$.  Find the mean squared error of
$\overline{X}$ as an estimator of $\mu$, separating the MSE into
variance and squared bias components as in Section \ref{bv}.

\oneproblem
Suppose we have a random sample $X_1,...,X_n$ from a population in which
X is uniformly distributed on the region $(0,1) \cup (2,c)$ for some
unknown c $>$ 2. Find closed-form expressions for the Method of Moments
and Maximum Likelihood Estimators, to be denoted by $T_1$ and $T_2$,
respectively. 

