\documentstyle{article}

\setlength{\oddsidemargin}{0in}
\setlength{\evensidemargin}{0in}
\setlength{\topmargin}{-0.3in}
\setlength{\headheight}{0in}
\setlength{\headsep}{0in}
\setlength{\textwidth}{6.5in}
\setlength{\textheight}{9.8in}
\setlength{\parindent}{0in}
\setlength{\parskip}{0.2in}

\begin{document}

\title{\bf On Decentralized Task Allocation and Scalable Cache Coherency\rm}

\author{Norman Matloff \\
Division of Computer Science \\
University of California at Davis \\
Davis, CA  95616 \\
matloff@heather.cs.ucdavis.edu \\
(916) 752-1953, (916) 752-7004}
\maketitle

\begin{abstract}
Recently there has been considerable interest in
cache coherency protocols in shared-memory multiprocessor systems,
particularly in protocols which are scalable, i.e. suitable for very 
large systems.  However, cache coherency scalability (CCS) entails heavy
performance overhead and system cost, so a critical examination of
the assumptions underlying the quest for CCS is undertaken here.
The focus is on monitor variables, which comprise the majority
of cache invalidations.  Evidence is presented here that in most
applications, these invalidations can be greatly reduced, using a 
very simple technique.  The findings provide theoretical and
empirical justification for some types of CCS hardware, while
indicating that other types of such hardware should be eliminated.
An architecture which provides only ``locally, but not globally,
coherent" hardware support is also proposed.
\end{abstract}

\maketitle

\section{Introduction}

     In the last few years, scalability of cache coherency protocols
has been one of the most active topics of research in shared-memory
multiprocessor systems.  A nice survey of this area, with a good
list of references, was presented in [Stenstrom], and new papers have 
continued to appear since then.  However, cache coherency scalability 
(CCS) requires complex hardware, which carries significant overhead,
inhibiting system performance, both in absolute terms and also with
respect to performance/price ratios.  It is thus worthwhile to examine 
the fundamental question of whether CCS is a sufficiently desirable 
property to warrant the heroic efforts which are being made to 
achieve it.  This question is addressed here, and evidence toward an
at least partially negative answer to it is presented.  

     The key point involves the type and extent of hardware which
should be devoted to support for coherent access to globally shared
data, i.e. data which are at least potentially shared by all processors
in the system.  It is found here that a large portion of accesses to such 
data can be replaced, in a very simple manner, by data shared only by a
fixed number of processors, with that fixed number being independent of overall 
system size.  This
obviates much of the need for scalability.  The findings here provide 
theoretical and empirical justification for some types of CCS hardware, 
while indicating that other types of such hardware should be eliminated.
An architecture which provides only ``locally, but not globally,
coherent" hardware support is also proposed.

     The concept of CCS may be summarized as follows:  A central 
problem in the design of shared-memory processors is contention for 
both memory and the processor-memory interconnect.  Installing cache 
memories at the processors reduces this contention, but if shared 
variables may be cached, the problem arises of consistency among 
several copies of such variables.  Clever protocols for insuring 
consistency have been developed, but these protocols add significant 
overhead, negatively impacting system performance.  This overhead is 
particularly problematic in systems with large numbers of processors, 
i.e. there is a {\it scalability} problem. 

     At the heart of the matter is atomic access to monitor variables 
(MVs), i.e. temporarily exclusive access of one processor to a lock
variable or equivalent.  Empirical work has shown that the 
majority of coherency transactions involve such variables [Agarwal89].
For this reason, the major focus of this paper is on MVs.

     For reasons which will emerge below, let us roughly divide the 
class of MVs into those which are frequently accessed (FAMVs), and the 
rarely-accessed ones (RAMVs).  A typical example of an FAMV is a lock 
variable that is guarding access to a task-allocation data structure.
If task time and the access time of the data structure itself are small, 
the lock variable will be an FAMV. 

     RAMVs often occur as {\it barrier} variables [Boyle].  For example,
such a variable may be used at the end of a parallel program, to determine
when all processors have finally finished their last tasks.  In such a
setting, this variable is accessed only at the very end of execution,
and is thus an RAMV.  (On the other hand, some barrier variables are
FAMVs, as an example in Section 5 shows.)

     Another example of RAMVs is the ``column-ready" variable type in the 
LU matrix decomposition application used in [Gupta].  Let us denote such a
variable for column i by $C_i$.  For each column, sequentially from 
left to right, a ``phase-A" process is done, followed by an associated 
``phase-B" process of all columns to the right of the current one.  
When column i's phase A is finished, $C_i$ will change value so as 
to give a Proceed signal to the other processors, so that they may perform 
phase B for columns j, j $>$ i.  The scalability context here would 
concern the effect on $C_i$ as matrix size increases, which will be 
that $C_i$ will become more and more an RAMV as the matrix size 
increases.

     Atomic access to an MV can be implemented either in hardware, for
example in cache coherency protocols and related support mechanisms
such as test-and-set instructions, or indirectly in software.  Clearly, 
hardware support is justified only for FAMVs, not for RAMVs.  Accordingly, 
FAMVs form a major underlying motivation in the quest for CCS. 

     As mentioned earlier, CCS schemes carry very substantial overhead, 
which slows system speed.  The slowdown stems from the fact that when a 
cached MV changes value, that event must be made known to other caches 
which contain copies of the MV.  These invalidation messages take time, 
especially in larger systems.

     For example, consider the Scalable Coherent Interface (SCI) standard 
[James], a directory-based cache-coherency protocol.  SCI features the 
ability to create very long linked lists of processors which share an MV.  
SCI manages these lists very efficiently.  However, even with the highest 
efficiency, the propagation delay for a cache update message along such 
a list is quite substantial, with an attendant slowdown in system speed.
Improvements have been suggested, e.g. adding redundant links so that a
message can be sent to several caches simultaneously, but the point still
is that as the system grows, the mean time to propagate an invalidation 
message will grow as well.

     The same difficulty occurs in all cache-coherency schemes for large 
systems:  When a processor writes to a cache line, the cache must send
cache-invalidation messages to all other caches currently sharing that
block.  This is the root of the scalability problem, because the larger
the system, the larger the potential number of caches which may share a
block, and thus the more problematic the sending of invalidation messages
becomes. 

     Again, FAMVs are common in current implementations of parallel 
processing algorithms, usually in support of mechanisms to allocate tasks 
to the various processors.  When a processor finishes a task, it will then 
seek a new task to perform.  However, since processors must contend with 
each other for access to the task-allocation data structure,
a large number of cache invalidations may be generated.

     One possible solution to this problem would be to replace this
centralized task allocation by decentralized task allocation:  Processors
would be divided into groups, and instead of having one task-allocation
data structure, there would be several, one for each processor group.
A processor would then contend only with other processors in its group.
The significance of this is that the number of invalidation messages
generated for any given write is limited to the group size.  {\it In other
words, the number of invalidation messages for a given write does
\underline{not}
grow as the system size grows, and thus there is no scalability problem
in task-allocation actions.}  This will form the central topic of
this paper.

     In Section 2, the implementation of decentralized task allocation 
will be outlined for several different types of applications.  This is
done to clarify the subsequent discussion, to demonstrate that it entails
virtually no additional effort on the part of the programmer, and to 
emphasize that it is architecture-independent, a general programming
technique for use on {\it any} shared-memory multiprocessor, i.e. any
multiprocessor with a single, completely shared address space.

     In Section 3, a mathematical model for these ideas is presented 
which demonstrates that decentralized task allocation is efficient.
This is then supplemented by simulation results for specific application 
types; it is here that we see that decentralized task allocation is not 
only as good as the centralized kind, but is actually {\it superior}, 
once we account for the overhead involved in accessing a task list.

     Section 4 then discusses the architectural implications of these
findings.  First, it is pointed out that these results provide some
justification for the clustering in DASH [Lenoski], and for the 
limited-hardware features of the LimitLESS protocol [Chaiken91] of
ALEWIFE.  On the other hand, the results here are shown to suggest
that some features of these machines, such as DASH's lock queues,
should be eliminated or at least modified.  It is also indicated
that modifications to SCI, such as the redundant pointers mentioned
earlier, may be of lesser importance than previously thought.
Also, a new class of machines, called LCSMA (locally coherent
shared-memory architectures) is proposed, with the main feature
being that hardware is devoted only to within-group cache coherence,
not between-group coherence; in other words, in this proposed
architecture, CCS has been eliminated entirely as a design criterion.

     Section 5 then presents some discussion.


\section{Implementing Decentralized Task Allocation}

     This section will outline the implementation of decentralized task 
allocation for several different types of applications.  Before beginning, 
it should again be noted that this is an {\it architecture-independent}
programming technique,\footnote{
The term architecture-independent is used here in the sense of [Boyle],
meaning from the point of view of the applications programmer (in the
rest of this paper, this latter term will be abbreviated to ``programmer").
Machine dependencies, such as test-and-set instructions, are made
transparent to the programmer, by making available a set of system
calls, as in [Boyle].  The analogy here is that of Unix, say 4.3 BSD,
which provides an architecture-independent interface to programmers
on many different machines.}
described in [Ni] for {\it general} shared-memory multiprocessors.

     In the decentralized version of any application, at the beginning
of the program there would be a system call:

\begin{verbatim}
   GN = GetGroupNumber();
\end{verbatim}

which will place into GN the group number of this processor, analogous
to the Unix getpid() and getpgrp() calls.  The group number will range
between 0 and $g-1$, where g is the number of groups.

     The first example is a simpler one, called MATMULT, involving the
multiplication of a single column vector by a large, square matrix. 
In the centralized version, the task-allocation 
data structure consists simply of a single global variable, called 
Next\_Row, which contains the number of the next row available to be 
multiplied.  When a processor finishes the multiplication of a row, 
it performs

\begin{verbatim}
   R = Next\_Row++;
\end{verbatim}
 
and then does the multiplication of row number R.  

     In the decentralized version, the direct analogy of Next\_Row is
a g-element array Local\_Next\_Row.\footnote{Actually, there should be 
more than g elements, with much padding of 0s,
to prevent cache-block aliasing (i.e. to prevent having two nonzero
elements of Local\_Next\_Row in the same cache block) when applied to 
Section 4 and the CCS question.  This would again be done via a system 
call, GetPaddingSize(), so that the programmer would not have to worry
about the underlying architecture.  But for convenience of notation,
this will be ignored here.}
There is also a variable Glob\_Next\_Row 
and another g-element array Local\_Last\_Row, which are used indirectly,
as will now be seen.  When a processor finishes the multiplication of a 
row, it performs

\begin{verbatim}
   R = Local\_Next\_Row[GN]++;
   if (Local\_Next\_Row[GN] > Local\_Last\_Row[GN])  {
      Local\_Next\_Row[GN] = Glob\_Next\_Row;
      LR = Glob\_Next\_Row += K;
      Local\_Last\_Row[GN] = LR - 1;
   }
\end{verbatim}

What is happening here is the following:  Initially each group is
allocated a set of K consecutive rows of the matrix, K being a design
parameter.  The processors in that group process these rows one-by-one,
just as in the centralized case, with Local\_Next\_Row[GN] playing the
role corresponding to Next\_Row in the centralized version.  When one
of the processors discovers that these K rows have all been processed,
it goes to Glob\_Next\_Row to get K more rows for its group.  (Compare
to [Kruskal], in which each {\it processor}, rather than each group, 
is assigned K tasks at a time.  By assigning batches of K tasks to 
groups rather than processors, work is distributed more evenly, and 
thus overall completion time is reduced.)  

     In the centralized version, the programmer must signal to the
compiler or OS that NextRow is to accessed atomically, e.g. with a
test-and-set instruction.  In the decentralized version, this must
be done for Glob\_Next\_Row and the two arrays.

     Except for these few lines, the remainder of the code would be
line-for-line identical in the centralized and decentralized versions.  
The point is that the programmer does \underline{not}
assign processes to processors, etc.

     Note that though the entire arrays Local\_Next\_Row and Local\_Last\_Row 
are global variables visible to all processors, the code is written so
that a processor in one group will never access the array element
intended for another group.  In the code above, for instance, the only 
array index used is GN.  

     Another well-known application is that of ``the heat equation" (HEAT).
Here one has, say, a two-dimensional grid of points, and an iterative 
computation.  The iteration n value at any given point is computed as 
the average of the iteration (n-1) values at the four neighboring points.  
In order to minimize processor idle time, an asynchronous setup will be 
assumed here, so that different processors might be working on different 
iterations at the same time.

   In the centralized version, there might be a linked list, ReadyList, 
showing which points are currently eligible for another 
iteration.\footnote{Actually, it is more common to have a processor work on a 
{\it set} of
points in one iteration, say a vertical strip of points, but the
discussion here will be simpler if we think of a processor as being
assigned to one point at a time.}
When a processor finishes a point, it would simply delete the head of
this list, and start work on the point specified by the deleted list item.
The decentralized case would be very similar, with the grid now being
divided into g regions, and the i-th processor group being assigned to 
the i-th region.  ReadyList would now be an array of linked lists, with 
ReadyList[GN] being the list for group GN.  (This arrangement for HEAT,
and that of QUAD below, could be made more dynamic, by setting up a
variable analogous to Glob\_Next\_Row in MATMULT above; however, the setup 
here is quite sufficient in these two applications.)  

     The next application, QUAD, involves quadrature, i.e. numerical
integration [Boyle].  The region of integration is successively bisected
into an increasing number of subintervals.  A task consists of bisecting 
the given subinterval, applying the trapezoidal rule to each of the two 
parts, and comparing the area so obtained with the area determined for 
the original subinterval.  If the difference is within a specified 
tolerance level, no further bisection is done; if not, one of the parts
is then processed again by the current processor, and the other is added 
to the task list.  

     In the centralized version, there is one global task list.  In
the decentralized version, the region of integration is divided
into g subregions, with the i-th processor group working only on 
intervals within the i-th subregion.

     Two applications similar to those above are described here 
briefly.  First, consider the LU matrix decomposition mentioned earlier.  
It is similar
to the MATMULT example described above, with columns being analogous
to rows.  In the straightforward implementation described here (more
aggressive approaches can be used), after the phase-A operation for 
column i, the centralized version would set Next\_Column to i+1, and
the processors would acquire tasks using this variable, as they did
with Next\_Row in MATMULT.   The decentralized version would again be
analogous to that of MATMULT, with for example the variable
Global\_Next\_Column being set to i+1, and incremented by K whenever 
a processor acquires a set of K columns on behalf of its group.

     Another problem to consider is that in which one is given a set 
of letters, say o, n and w, and finds all words in a given dictionary 
that can be formed from these letters (no duplication allowed).  
Suppose a depth-first tree search is used, as shown in Figure 1.  
Calling the root level 0, we see that potential 1-letter words are 
considered in level 1, 2-letter words in level 2 and 3-letter words 
in level 3.  Each potential word is checked against the given 
dictionary.  Potential words found to be actual words are flagged 
with `*', while nodes found to be ``dead ends"---i.e. for which not 
only the current combination of letters is a nonword, but also no 
word in the dictionary begins with that sequence of letters---are marked
with `\&'.  A processor at a given node will first generate all possible
one-letter extensions of the ``word" at that node.  It will add
all but one of them to the task list, and process the remaining
one itself.  Clearly the centralized and decentralized versions
of this algorithm can be handled in analogy to QUAD above.

     The vast majority of applications can be partitioned in
manners similar to those indicated here.


\section{Algorithmic Efficiency of Decentralized Task Allocation}

     As mentioned in Section 1, MVs account for the majority of cache 
invalidation messages in systems with
shared-block caches.  Decentralized task allocation can help solve this
problem, since the number of messages needed for a given invalidation 
is limited to the group size.  If the group size is kept fixed while
the number of processors grows, i.e. in the scalability context, then
CCS becomes less of an issue.

     Decentralized task allocation eliminates most systemwide FAMVs, 
but the question then arises as to whether this can be done efficiently.  
Clearly, there will be {\it some} nonzero loss of efficiency, since a 
processor in one group might lie idle while there is work waiting to be 
assigned in another group.  The question then becomes one concerning the 
{\it degree} of this loss.  This question is investigated in this section, 
in which it is found that the degree of loss is actually quite small.

     The investigation here is two-pronged.  First, in Section 3.1,
a mathematical analysis is described, which shows that decentralized
task allocation can indeed be done efficiently with a fixed group
size, even with arbitrarily large overall system size.  Moreover, we
find that a ``universal" group size exists, information which then
can be incorporated into hardware design.  Then in Section 3.2, this 
efficiency is illustrated concretely by simulating three specific 
applications.  Even more significantly, the simulations can in
addition account for the overhead a processor spends in actually
acquiring a task from a task allocator, and it is found that when
such time is included in the analysis, decentralized allocation is
actually \underline{superior} to centralized allocation.


\subsection{M/M/p Analysis}

     Related questions prompted the investigation in [Ni] (in a
non-CCS context), which modeled a p-processor system as an M/M/p queue.  
In this model, each new task joins a central queue, with tasks from 
the queue being assigned on a first-available-processor basis.  
The main quantity computed, called DLI (``degree of load imbalance"), 
was the probability of having at least one idle processor in some group, 
simultaneously with at least one queued task in some other group.  This 
was appropriate for the fixed number of processors considered in that
paper, but not for our scalability context here, which examines what 
occurs as the number of processors increases.  If the per-processor 
utilization and group size are held constant while the number of 
processors increases (i.e. overall workload increases proportionally 
to the number of processors), DLI will increase to 1.0.  Thus DLI 
is not a useful criterion in our context.

     Instead of DLI, the analysis here uses another criterion, 
mean queuing time (MQT), i.e. the mean time a task waits before it is
started by some processor.\footnote{Actually, this was considered 
somewhat in [Ni] too, but again not from
the point of view taken here.}
Ideally, we would like MQT to be zero.  That is of course unattainable,
but it can be shown that MQT can be kept close to zero without making 
the group size increase with p.  Specifically, let MPT denote the mean
processor time needed for a given task.  Then the analysis yields the 
following:

{\bf Theorem:}

Fix positive values $\epsilon$ and $\delta$.  Then there exists a group 
size $\gamma$ such that if work is allocated on the group level rather
than on the system level, all applications having per-processor utilization 
$\leq  1 - \epsilon$ will have $MQT/MPT < \delta$, no matter how large p is.

     The significance of this is as follows:  We can choose $\epsilon$ small
enough so that most applications are included, and choose $\delta$ small 
enough so that queuing time is an acceptably small proportion of processor 
time per task.  These parameters then give us a ``universal" group size, 
${\gamma}_{univ}$, for which all applications in the given range will run 
efficiently, with decentralization inducing only a negligible increase
in task wait time.
     
     It is a key point that the group size ${\gamma}_{univ}$ is applicable 
no matter how large p is, i.e. no matter how many processors the system 
has.  Partitioning the system into groups of size ${\gamma}_{univ}$ will 
be efficient for any system size.  

     The Theorem is proved in Appendix 1.


\subsection{Simulation Analysis of Some Specific Application Classes}

     The M/M/p queueing model studied in Section 3.1 provides valuable
insight.  However, it does not capture some facets of the behavior 
of some applications, such as the following:
 
\begin{itemize}
\item[(a)]
First, the M/M/p environment is an {\it infinite-horizon} problem, in 
which there is an infinite stream of tasks to process.  This differs 
significantly from {\it finite-horizon} problems, in which a finite 
number of tasks is specified at the outset, because in finite-horizon 
problems, the number of idle processors increases as we near the end of an 
application.  

\item[(b)]
In the M/M/p environment new tasks arrive from outside the system, 
with no relation between tasks.  Yet in many applications there are
dependencies between tasks.  For instance, in the HEAT example in Section 
2, the tasks to be allocated are generated in a feedback-loop manner, 
with the completion of one task typically triggering the readiness of 
another.  

\item[(c)]
The M/M/p analysis does not take into account the overhead involved in
a processor's accessing a task-allocation list.

\item[(d)]
In an M/M/p system, the standard deviation of the task time necessarily
equals the mean.  Thus M/M/p analysis does not allow investigation of
the effect of changing the standard deviation with other factors being
held constant.
\end{itemize}

     For these reasons, three specific application classes, MATMULT,
HEAT and QUAD from Section 2, were simulated, to verify that the 
qualitative implications of Section 3.1---namely, that one can avoid 
having systemwide, i.e. nongrouped, FAMVs without any real performance 
penalty---hold in these settings.  

     Each of the applications was simulated, with task times---the time 
to do one row multiplication, the time to do one quadrature computation 
and comparison, and the time to do one grid iteration---normalized
to have mean 1.  The task time distribution was taken to be Gaussian
(truncated at 0), with standard deviation $\sigma$.  In the first
experiments described here, $\sigma$ was taken to be 0.3, but by varying 
$\sigma$, item (d) listed above could be investigated, as will be seen 
below.  This is also the reason for using the term {\it application class}.  
For example, it allows not only simulation of matrix-multiplication problems,
but also other problems which have similar structure to matrix-multiplication
but which have much different standard deviations.  

     The simulation tool used was {\small CSIM}
a process-oriented simulation environment for use with the C language
[Schwetman].  Each processor was represented by one {\small CSIM} 
process.  For each setting, 1000 replications were run, e.g. 1000
matrix multiplications for each combination of parameter values
considered.

     The results of these simulations are presented in Tables 1-9.
Note that the cases in which $\gamma$ = p represent centralized task
allocation, while the others are decentralized.  The parameter r
represents the problem size:  the number of rows in MATMULT, the
number of grid points in HEAT, and the number of initial subintervals
in QUAD.

     The simulations did confirm the findings of Section 3.2.  For example, 
consider the instance of MATMULT shown in Table 1.  The performance penalty 
was almost zero for $\gamma$ = 16, and was only about 2\% for the very small 
group size $\gamma$ = 8.\footnote{In the MATMULT simulations presented here, 
the parameter K was taken to be equal to $\gamma$.}

     Similarly, the experiments with QUAD yielded the results shown in
Table 2.  Again we see only a minor loss of efficiency due to using 
decentralized task allocation, even with very small group sizes.  The 
results for HEAT, in Table 3, were again similar.

     Table 4 shows the results from running MATMULT again, on a larger
system.  Again, a group size of 16 seems quite sufficient.

     Thus the simulations, dealing with real applications, do confirm
the findings of Section 2.  Much more significantly, the simulations
allow us to introduce another element of realism which was missing in
Section 2, concerning $\omega$, the time cost involved when a processor 
accesses a task list.  This time is manifested not only in the direct 
cost to the processor, but also indirectly in that a processor may need 
to spend time queuing behind other processors which are accessing the 
task list.

     To investigate this, the simulations were rerun, with $\omega$ equal
to 0.02 (recall that the mean task time is 1.00).  Tables 1-4 then become 
Tables 5-8.  Note that the effect of nonzero $\omega$ for large systems 
is especially prominent (Table 8).  Using centralized task allocation 
results in an increase in completion time by a factor of approximately 
5, a severe penalty.  The effect is larger in this case because of the
smaller coefficient of variation (see below), since the processors
will tend to finish tasks at bunched-together times and thus contend
with each other more when accessing task lists.

     In other words, decentralized task allocation is not only {\it as
efficient as} centralized allocation in terms of processor utilization,
but is actually {\it better than} centralized allocation, since it greatly 
reduces the overhead due to processor queuing for task lists.  

     As mentioned in item (d) at the beginning of this section,
the effect of varying $\sigma$ was also investigated.  Recall that the
results presented above had $\sigma$ = 0.3.  This actually was a rather
large value, for typical applications (see Appendix 2).
In other words, the value $\sigma$ = 0.3 used in the simulations
above is actually rather conservative, since typical values of $\sigma$
are smaller than this.  The term ``conservative" here stems from the
fact that the results are even stronger for smaller values of $\sigma$,
which produce smaller oscillations and thus a lower probability that
a processor in one group will lie idle while work is waiting in another
group.  For example, the analog of Table 1 for $\sigma$ = 0.05 is Table
9, which shows there is no penalty at all (to two significant figures) 
for using decentralized task allocation.


\section{Architectural Implications}

     The results of Section 3 indicate that applications of shared-memory
multiprocessors generally do not need systemwide FAMVs, because such 
variables can usually be eliminated through the use of decentralized task
allocation.  This section will discuss implications of this finding for 
CCS (an implication to a non-CCS issue arising in shared-memory
multiprocessor systems is outlined in Appendix 3). 
Some comments on previously-proposed systems will be made, and a new
type of architecture will be proposed as well.


\subsection{Implications for CCS Architectures}

     In this section, some implications of the results in Section 3
will be given for three specific CCS systems, those in DASH [Lenoski]
[Gupta], ALEWIFE [Chaiken91] and SCI [James].  The discussion here of
these systems is intended to only be illustrative, not comprehensive; 
it is believed that the results of Section 3 have relevance to them 
in a variety of other ways in addition to those given as examples here.  

     In general, the theme here is that due to the findings in Section 3 
concerning task-allocation variables---variables which, it should be 
emphasized again, account for the majority of invalidation messages---the 
need for attaining good CCS is much less than had been thought
previously.  This then translates into reduced hardware requirements, 
possible changes in hardware, and in different software approaches.

     First, the findings here, indicating that task allocation can be 
done efficiently on a \underline{group}
rather than a centralized basis, may be viewed as providing theoretical 
and empirical justification for the version of DASH described in 
[Lenoski], which has hardware support for processor clusters.  

     Similarly, these findings provide a new justification for various 
limited-directory schemes such as LimitLESS in ALEWIFE.  These schemes 
have up to now been justified by data which indicate that, during
the course of execution of a given application, the number of
processors contending for a given MV will usually be small.  The
point of view which has been taken is that although there will 
occasionally be exceptional cases, in which rather substantial
time penalties will be incurred, this will hopefully occur rarely
enough so that those penalties do not greatly affect overall
performance.  By contrast, the results of Section 3 show that instead
of just {\it hoping} that such occurrences will be rare, we can actually
{\it guarantee} it in the case of task-allocation MVs.

     For SCI, it was mentioned in Section 1 that the slow propagation
of an invalidation message along a linear linked list---a delay which
tends to become worse as the system size scales upward---has motivated
SCI developers to consider more complex, treelike linking structures, 
employing redundant pointers to send several copies of an invalidation 
message in parallel.  But again, using decentralized task allocation as 
shown here, we can limit the length of the SCI linked lists to a fixed 
constant, independent of the system size p.  For this limited length,
a linear linked list 
may be sufficiently efficient after all, and the addition of redundant 
pointers could be avoided, thus avoiding a substantial increase in 
hardware complexity.

     Similarly, in both DASH and ALEWIFE, one concern is that many cache 
invalidation messages may have to travel to remote portions of the 
network.  In DASH, for example, an invalidation transaction may involve 
three different portions of the network, the original {\it local cluster}, 
the {\it home cluster} and the {\it remote cluster}.  But the results of 
Section 3 show that this is much less of a concern than originally 
thought, since one can guarantee that invalidation messages involving 
task-allocation MVs can be confined to small 
groups of processors, with group size independent of overall system 
size p.  {\it In other words, as the system size grows, the network distance 
traveled by the majority of the invalidation messages can be kept bounded 
by a fixed constant, independent of p, so CCS is much less of an issue.}  
A related implication of this is that network topology in such systems 
might be designed to optimize intragroup communication (possibly as in
[Lenoski]), even at the expense of some deterioration of intergroup 
communication performance.

     Another possibility is the elimination of the lock queues in DASH.
Again, most lock variables can be handled on the group level, thus
confining traffic involving them to a small region of the network---not
only small, but again also of a fixed size, not growing with p.  Thus the
lock queues may be of limited usefulness for CCS, and thus might be
eliminated, with a substantial savings in hardware complexity.


\section{A New Class of Shared-Memory Multiprocessor Architecture}

     One can take this last idea, i.e. the possible elimination of
DASH's lock queues mentioned at the end of Section 4.1, one step further:
One might consider eliminating \underline{all}
hardware devoted to CCS.  In DASH, for example, there is a large
amount of such hardware; the directories, for example, have complexity
O($p ^ 2$).  In this regard, consider a concept of
{\it locally coherent shared-memory architectures} (LCSMA), which we 
define to mean shared-memory multiprocessor systems which provide 
hardware support for cache coherency within groups of processors, but 
provide no hardware support for systemwide cache coherency.\footnote{Note 
that ALEWIFE is not an LCSMA machine, since it too has hardware
which is indirectly used for systemwide cache coherency.  The cache 
directory must include logic to detect overflow and to relinquish 
cache control to the processor; the directory and the processor must 
be designed so that the processor can actively access the directory; 
the processor must be designed so as to have very fast trap-handling 
capability; etc.}  

     Under LCSMA the software must, where possible, avoid creation of 
systemwide MVs.  Section 2 indicated a general approach by which this 
can be done, creating group task-allocation variables.  Atomic access
to these variables, and the problem of contention for them, are handled
via hardware, i.e. cache coherency for the latter and mechanisms such
as test-and-set instructions for the former.  Any MVs which remain, e.g. 
those involved in barrier variables, are handled in software.  Some
details are given in Appendix 4.

     The motivation for LCSMA is that while on the one hand hardware is 
needed for efficient access of FAMVs, on the other hand FAMVs can be
limited to groups of processors, and therefore it makes sense to limit
hardware support for cache coherency to intragroup coherency.  The
question of CCS then becomes moot.  This would result in a machine
which is low-cost in terms of hardware, and yet efficient in terms
of task allocation.

     As shown in Section 2, the burden on the applications programmer 
to set up decentralized task allocation is quite minimal.  Only a 
few lines of code need to be changed, and the programmer would not need
to be aware of the underlying architecure; a single, shared address
space standard to shared-memory machines is all that is assumed, and
the caches are transparent to the programmer.

     Another point concerns non-MV shared variables (NMVSVs), e.g. 
the matrix itself in the LU application in Section 2.  A variety of 
approaches are possible.  The simplest solution is to just mark all 
NMVSVs as noncacheable (i.e., mark only group task-allocation structures
as cacheable).  This would not be the most aggressive approach in terms 
of performance, but it would be easiest, and again due to the fact that 
FAMVs form the primary bottleneck on current systems, good performance 
should be attainable.  Alternatively, caching of NMVSVs in an LCSMA 
system could be done automatically by the compiler [Cheong] [Adve].

     LCSMA denotes a broad class of architectures, with many possible 
implementations.  For example, an interconnect structure such as that 
of Figure 2 could be used.  (For the purposes of readability, this figure 
depicts a very small system, with an unrealistically small value of 4 
for ${\gamma}_{univ}$.)  Here processors are connected to memory modules 
via the familiar delta network [Almasi] [Hwang].  What is different is 
that processors are partitioned into groups, and processors within any 
given group are connected via snooping caches [Goodman] and a local bus.
These provide hardware support for local MVs, i.e. variables which are 
shared only by processors within a given group.  Let us introduce one final 
acronym, using LSSMA (``locally-snooping shared-memory architecture") 
to refer to the approach proposed in Figure 1.\footnote{In this sense 
it has some commonality with [Lenoski], but again it
should be noted that the latter is not LSSMA; it is not even an LCSMA 
machine.  A cache invalidation in the machine in [Lenoski] can, and 
often does, result in invalidation messages being sent to processors 
outside the cluster where the invalidation occurred.  In other words,
the hardware does enforce systemwide cache coherency, which is a
fundamental point of departure for LCSMA.}

     The results of Section 3 indicate that a group size of perhaps 
16 will be effective (see Section 5).  This would be nice, since bus 
snooping has been found to work well in this range.  On the other hand, 
even if a larger value of ${\gamma}_{univ}$ turns out to be necessary, 
many other implementations of LCSMA are possible, including adapting
some recently proposed architectures to the LCSMA philosophy.  For
example, a very good implementation may be an LCSMA version of the 
DASH project, with the interconnection network topology again placing
extra interconnection bandwidth within groups of ${\gamma}_{univ}$ 
nodes, but all hardware support for systemwide coherency eliminated.
Another possible implementation of LCSMA would be to adapt SCI, 
making the size of its linked lists limited to ${\gamma}_{univ}$, 
and either retaining the linear structure (see Section 4.1), or
optimizing a tree-based mechanism for the fixed-length lists.
In any case, the main point is that once ${\gamma}_{univ}$ is 
determined, we avoid having to implement cache coherency on an 
ever-widening scale as p increases.


\section{Discussion}

     Although the work here has presented evidence that a reasonable 
value of ${\gamma}_{univ}$ can be chosen, it has not specifically 
identified such a value.  We could use the results of Section 3.1 
to help choose the value, but since Equation (5) in Appendix 1
was derived from 
inequalities, the value obtained would be too conservative.  The 
results in Section 3.2 are more valuable in this regard, and as 
mentioned earlier, indicate that a group size of perhaps 16 will 
be effective.  This should be verified by investigating other 
applications, but it is anticipated that this value will continue 
to hold up well in them, given the ``universality" nature of the 
findings of Section 3.1, the close similarity of the three sets of 
results in Section 3.2, and the remark at the end of Section 2
concerning the similarity of most applications to one of the ones
considered here.

     There are of course {\it some} applications in which systemwide,
i.e. nongrouped, FAMVs arise, such as synchronous algorithms with 
very short times between successive synchronizations.  For instance, 
considering a root-finding problem.  Suppose it is known that the 
function f has a root in the interval (a,b), where f(a) < 0, f(b) > 0 
and f is monotonic increasing.  A parallel-processing attack on this 
problem might be as follows:  At the i-th iteration, we consider the
interval $( a_i , b_i )$, where $a_0 = a$ and  $b_0 = b$.
Within the interval, p evenly spaced points are defined, and the p processors
evaluate f at these points.  Then $a_{i+1}$ and $b_{i+1}$ are defined
to be the largest of these points at which f is negative and the smallest
at which f is positive, respectively.  

     In this setting the processors must synchronize at the end of each
iteration i, at which time they wait for the announcement of $a_{i+1}$
and $b_{i+1}$.  The latter two values (and the barrier variable used
to set them) are then associated with systemwide MVs.  Furthermore, if 
the evaluation of function f is not time-consuming,\footnote{It 
\underline{would} be time-consuming if, say, f were the value
of an integral which itself must be evaluated numerically.}
these MVs are then FAMVs, and it would be nice to have hardware support 
to aid in their access.  

     However, it is clear that cached-based approaches are not good solutions
to this problem either.  The overhead due to relaying of cache update 
messages throughout an entire large system would be too heavy.  Thus other 
mechanisms would be needed if this type of application were to be targeted, 
say adding a separate broadcast channel to Figure 2 for announcing 
$a_{i+1}$ and $b_{i+1}$.  The point here is that this and similar examples do 
not counter this paper's theme concerning a reduced role for CCS as an 
architectural design criterion shared-memory multiprocessors. 

     In any case, the goal here has not been to propose machines which
are optimal for all applications, which is impossible.  Instead, the 
point has been that {\it most} applications of shared-memory multiprocessors 
can be done efficiently without systemwide FAMVs, and that this has
strong implications, positive and negative, for the various CCS schemes
which have been proposed.  

\bigskip

{\bf Bibliography}

S. Adve, V. Adve, M. Hill, M. Vernon.  ``Comparison of Hardware and
Software Cache Coherency Schemes,"
{\it Proceedings of the 18th Annual International Symposium on Computer
Architecture}, 1991, 298-308.

A. Agarwal and M. Cherian.  ``Adaptive Backoff Synchronization
Techniques,"
{\it Proceedings of the 16th Annual International Symposium on Computer
Architecture}, 1989, 396-406.

G. Almasi and A. Gottlieb.  {\it Highly Parallel Computing}, Benjamin
Cummings, 1989.

J. Boyle, R. Butler, T. Disz, B. Glickfield, E. Lusk, R. Overbeek,
J. Patterson and R. Stevens.  {\it Portable Programs for Parallel
Processors}, 1987, Holt, Rinehart \& Winston.

D. Chaiken, J. Kubiatowicz and A. Agarwal.  ``LimitLESS Directories:
A Scalable Cache Coherence Scheme," {\it Proceedings of the Fourth
International ASPLOS Conference}, 1991, 224-234.

H. Cheong and A. Veidenbaum.  ``Compiler-Directed Cache Management
in Multiprocessors," 
{\it Computer}, June 1990, 39-48.

G. Gonnet.  {\it Handbook of Algorithms and Data Structures}, Addison-Wesley,
1984.

J. Goodman.  ``Using Cache Memory to Reduce Processor-Memory Traffic,"
{\it Proceedings of the 10th Annual Symposium on Computer Architecture},
1983, 124-131.

A. Gupta, J. Hennessy, K. Gharachorloo and T. Mowry.  ``Comparative
Evaluation of Latency Reducing and Tolerating Techniques,"
{\it Proceedings of the 18th Annual International Symposium on Computer
Architecture}, 1991, 254-263.

K. Hwang and F. Briggs.  {\it Computer Architecture and Parallel
Processing}, 1984.

Intel.  {\it Microprocessor and Peripheral Handbook}, 1988.

D. James, A. Laundrie, S. Gjessing and G. Sohi.  ``Scalable
Coherent Interface," {\it Computer}, June 1990, 74-77.

C. Kruskal and A. Weiss.  ``Allocating Independent Subtasks
on Parallel Processors," {\it IEEE Transactions on Software
Engineering}, SE-11 (10), 1001-1016, 1985.

D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta and J. Hennessy.
``The Directory-Based Cache Coherence Protocol for the DASH
Multiprocessor," 
{\it Proceedings of the 17th Annual International Symposium on Computer
Architecture}, 1990, 148-158.

L. Ni and C.-F. Wu.  ``Design Tradeoffs for Process Scheduling
in Shared Memory Multiprocessor Systems," {\it IEEE Transactions
on Software Engineering}, 15, 1989, 327-334.

G. Pfister and V.A. Norton.  ``Hot Spot Contention and Combining
in Multistage Interconnection Networks," {\it IEEE Transactions on
Computers}, C-34, 1985, 943-948.

H. Schwetman.  {\it CSIM Reference Manual}, Microelectronics and 
Computer Technology, Technical Report ACA-ST-257-87 Rev. 14, 1990.

P. Stenstrom.  ``A Survey of Cache Coherency Schemes for Multiprocessors,"
{\it Computer}, June 1990, 12-24.

K. Trivedi.  {\it Probability \& Statistics, With Reliability,
Queuing and Computer Science Applications}, Prentice-Hall, 1982.

P.-C. Yew, N.-F. Tzeng and D. Lawrie.  ``Distributing Hot-Spot
Addressing in Large-Scale Multiprocessors," {\it IEEE Transactions
on Computers}, C-36, 1987, 388-395.


\newpage
\appendix

\section{Derivation of the Theorem}

     To derive the Theorem in Section 3.1, consider multiprocessor 
systems 1 and 2, which use centralized and decentralized task allocation, 
respectively.  Each has Poisson task arrivals with rate $\lambda$, 
whose service times are exponentially distributed with mean 1/$\mu$.  
Each system consists of p = $q \gamma$ processors.  However, the two 
systems differ in the manner in which processors acquire tasks:  System 
1 has one global queue; when a processor finishes a task, it gets its 
new task from the head of this queue.  In system 2, an arriving task 
is randomly assigned to one of q queues, each of which serves a group 
of $\gamma$ processors; processors in a group take tasks only from the 
queue for that group.  

     Consider the classical M/M/k queue [Trivedi], in which there are:  
Poisson arrivals with rate $alpha$; exponential service times with mean 
1/$\beta$; and k servers.  System 1 above is such a queue, with $alpha$ = 
$\lambda$, $\beta$ = $\mu$ and k = $q \gamma$, while each individual 
{\it group}
in system 2 is such a queue, with $alpha$ = $\lambda$/$q$, $\beta$ = $\mu$
and k = $\gamma$.  The mean time spent by tasks waiting in this M/M/k 
queue, not including the service time, is 
\begin{equation}
E(Q) = p_0 \frac{{(uk)}^{k-1} u} {k! \beta {(1-u)} ^ 2} ,
\end{equation}
where the per-server utilization is u = $\alpha$/(k $\beta$) and the 
probability that the system is empty is
\begin{equation}
p_0 = \frac{1}
{\sum_{i=0}^{k-1} \frac{{(uk)}^i}{i!} +
\frac{{(ku)}^k}{k! (1-u)} }
\end{equation}

     Let us consider the scalability question in this context, by 
comparing systems 1 and 2.  First note that the value of u is the 
same for the two systems.  The effect on E(Q) as k increases will
be investigated, while holding u fixed.  In other words, as k increases, 
the overall workload is also increased proportionally, with the result 
being that the demand on each processor remains constant.

     Note that for large k, the value of (2) is approximately 
$e^{-ku}$.  An approximate value is obtained for (1) using
Stirling's approximation for factorials,
\[ n! \approx \sqrt{2 \pi} e^{-n} n ^ {n + \frac{1}{2}} . \]
From all this, (1) has the approximate value
\begin{equation}
e^{-ku}
\frac
{k^{k-1}u ^ k} 
{ \beta \sqrt{2 \pi} e^{-k}k ^ {k + \frac{1}{2}}  {(1-u)} ^ 2} 
=
\frac{1}{\beta }
\frac
{ {( u e^{1-u})} ^ k } 
{ \sqrt {2 \pi} k ^ {\frac{3}{2}} {(1-u)} ^ 2 }
\end{equation}

     Since $0 \leq u \leq 1$, the maximum value of 
${( u e^{1-u})} ^ k$ is 1.  Thus an approximate upper bound 
for E(Q) is
\begin{equation}
\frac{1}{\beta}
\frac{1}{\sqrt{2\pi} k^{\frac{3}{2}}  {(1-u)} ^ 2 } .
\end{equation}
Moreover, since the mean service time is $1 / \beta$, we see that an 
approximate upper bound for E(Q) as a proportion of mean service time is
\begin{equation}
\frac{1}{\sqrt{2\pi} k^{\frac{3}{2}} {(1-u)} ^ 2 } .
\end{equation}
For fixed u, the expression (5) converges to 0 as $k \rightarrow \infty$.

     Thus, in both systems 1 and 2, as the group size---$q \gamma$ for 
system 1, $\gamma$ for system 2---goes to infinity, the mean queue wait 
goes to zero.  In other words, for large group size, both systems 1 
and 2 approach full efficiency, and thus the efficiency lost by using 
system 2 instead of system 1 is small.  As long as the group size $\gamma$
in system 2 is sufficiently large, there is essentially no penalty for 
using system 2 instead of system 1.

     Even more importantly, note that in (5), if u bounded away from 1, 
i.e. for all u in the range $0 \leq u \leq U$ for any fixed $U < 1$, the
convergence to 0 is uniform in u.  In other words, in order to have (5) 
get below a certain desired value, say 0.05, the same value of $\gamma$, 
say $\gamma$, will work for all values of u in the range $0 \leq u \leq U$.  
Since it is rare for an application to achieve a value of u above, 
say 0.9, we can choose the group size $\gamma$ accordingly.  Then system 
2---i.e. decentralized task allocation with fixed group size $\gamma$---will 
work well for the vast majority of applications, even on systems of
arbitrarily large size.


\section{Effect of $\sigma$}

     In Section 3.2 it was mentioned that the value $\sigma$ = 0.3 used
in most of the simulations there was actually rather large, for typical
applications.  This point is discussed in this Appendix.

     For example, consider MATMULT.
The multiplication of one r x 1 row involves r multiply/add operations.
The time taken by each such operation is a random variable, with mean $Nu$ 
and standard deviation $rho$.  Their sum has mean $r Nu$, and, making the 
reasonable assumption that the r variables are statistically independent, 
the standard deviation would be $sqrt r rho$.  The coefficient of variation, 
i.e. the ratio of the standard deviation to the mean, for one row 
multiplication would then be O($\frac{1}{\sqrt{r}}$), and thus would
be very small for large r.

     Taking for instance the 80387 mathematics coprocessor, the number 
of clock cycles for a floating-point multiply is in the range 27-35, 
while for a floating-point add the figure is 24-32 [Intel].  Thus a 
multiply/add will take between 51 and 67 cycles, which would imply
a value for $rho$ of at most 8.  Suppose the mean $Nu$ is the midpoint
of this range, 59, and r is, say, 500.  Then the coefficient of variation
for a row multiplication would be at most 0.006.  Recalling that in the 
simulations the mean was scaled to 1.0, that would correspond to a value 
for $\sigma$ of 0.006, orders of magnitude smaller than the value 
$\sigma$ = 0.3 used earlier.

     A similar analysis can be made for many other applications.  For
example, consider Heapsort, for sorting r items.  Many parallel sort
algorithms are based on parceling out subarrays to the individual
processors, with each processor then performing a sequential sort,
e.g. a Heapsort, on its subarray.  The mean running time of Heapsort
is O(r log r), while the standard deviation is O($\sqrt r$) [Gonnet].
Again, the larger the problem, the smaller the coefficient of variation,
in this case O($\frac{1}{\sqrt{r}log r}$).


\section{Application to the Hot Spot Problem}

     Though not the subject of this paper, it should be mentioned that 
the findings of Section 3 also have implications to other issues in 
shared-memory multiprocessor design, such as the problem of {\it tree 
saturation} [Pfister] (though this is one of the motivations for
the CCS question, it is of broader importance than this).  Here, a 
number of processors might be simultaneously trying to enter a critical 
section guarded by an MV, in which case the MV, and the memory module 
in which it resides, become a {\it hot spot}.  The resulting backup 
can affect the entire network, including paths leading to other modules.  
A hardware-intensive method to handle tree saturation has been proposed,
in the form of combining networks [Almasi], but decentralized task 
allocation can reduce this bottleneck without requiring special hardware.

     Just as LCSMA obviates the need for CCS by placing a system
size-independent bound on the number of messages stemming from a given
block invalidation, decentralized task allocation can be used to solve 
the hot-spot problem in a manner independent of system size.  In the
analysis in [Pfister], the maximum memory bandwidth per processor is
given by
\begin{equation}
\frac{1}{1 + h(p-1)} ,
\end{equation}
for a system with p processors, p memory modules, assuming traffic
which is uniform across modules, except that the module containing
the MV has a proportion h of extra traffic.  This expression goes to
zero as p goes to $inf$, a real scalability problem.

     By contrast, with decentralized task allocation, the MV would be 
replaced by a separate MV for each group, giving a maximum bandwidth of
\begin{equation}
\frac{1}{1 + h( {\gamma}_{univ} - 1 )} ,
\end{equation}
which is independent of p.


\section{Software Alternatives}

     As stated in Section 4, LCSMA defines a class of architectures,
rather than a specific architecture, and thus can potentially be
realized in many different implementations.  This Appendix presents 
a brief outline of two implementations of the software-based atomic 
access to systemwide MVs referred to in Section 4. 

     One possibility is to employ the methods in [Yew], which use
software to combine the atomic-access requests of several processors 
to a given MV, in a tree-like manner.  The root nodes of such trees
consist of the memory modules.  There is atomic-access hardware in
the latter, but in a completely scalable manner, since the amount
of such hardware in any particular module remains constant as the
number of processors (and/or memory modules) grows.

     Another possibility, involving no special hardware at all, 
is the following.  Say we have a critical section C, which would in a 
conventional system be protected via hardware by some lock variable L.  
We could accomplish this in software in LCSMA by replacing this scalar 
L by an array, with L[i] being associated with processor Pi.  One of 
the processors, say P0, would be dedicated to periodically scanning 
the L array.  When Pi reaches C and wishes to enter C, Pi sets L[i] 
to 1.  Suppose at that time processor Pr is occupying C.  When Pr 
exits C, it will set L[r] to 0.  P0, upon seeing that action, will 
scan the L array, looking for a processor whose component of L is 
equal to 1.  P0 will handle all such processors in turn; when Pi's 
turn comes, P0 will then set L[i] to 2, indicating that Pi can now 
enter C.  Meanwhile, Pi has been watching for this, and will then go 
ahead and enter C.

     Both of these implementations are simple ones.  There are a number
of refinements that could be made to improve their efficiency, but since
they would primarily be used on RAMVs, refinements would be of only
secondary importance.

     A number of system calls should be developed to make these
operations easy for the programmer.  The calls GetGroupNumber() and
GetPaddingSize() have already been mentioned.  There should also be
calls like AcquireLock(), in two forms:  The first controls hardware,
e.g. includes something like a test-and-set instruction, while the
second implements a software lock-acquire operation, as outlined
above.

\newpage
\twocolumn

\begin{tabular}{| l | l | l | l |} \hline
r & p & $\gamma$ & mean time \\ \hline
128 & 32 & 32 & 4.81 \\ \hline
128 & 32 & 16 & 4.85 \\ \hline
128 & 32 & 8 & 4.91 \\ \hline
\end{tabular}

\bigskip

\hspace{0.5in} Table 1

\vspace{0.5in}
 

\begin{tabular}{| l | l | l | l |} \hline
r & p & $\gamma$ & mean time \\ \hline
64 & 64 & 64 & 5.09 \\ \hline
64 & 64 & 32 & 5.15 \\ \hline
64 & 64 & 16 & 5.11 \\ \hline
64 & 64 & 8 & 5.13 \\ \hline
\end{tabular}

\bigskip

\hspace{0.5in} Table 2

\vspace{0.5in}

\begin{tabular}{| l | l | l | l |} \hline
r & p & $\gamma$ & mean time \\ \hline
128 & 64 & 64 & 11.19 \\ \hline
128 & 64 & 32 & 11.12 \\ \hline
128 & 64 & 16 & 11.18 \\ \hline
128 & 64 & 8 & 11.26 \\ \hline
\end{tabular}

\bigskip

\hspace{0.5in} Table 3

\vspace{0.5in}

\begin{tabular}{| l | l | l | l |} \hline
r & p & $\gamma$ & mean time \\ \hline
1024 & 512 & 512 & 3.11 \\ \hline
1024 & 512 & 256 & 3.10 \\ \hline
1024 & 512 & 128 & 3.09 \\ \hline
1024 & 512 & 64 & 3.10 \\ \hline
1024 & 512 & 32 & 3.11 \\ \hline
1024 & 512 & 16 & 3.13 \\ \hline
1024 & 512 & 8 & 3.18 \\ \hline
\end{tabular}

\bigskip

\hspace{0.5in} Table 4

\vspace{0.5in}

\begin{tabular}{| l | l | l | l |} \hline
r & p & $\gamma$ & mean time \\ \hline
128 & 32 & 32 & 7.44 \\ \hline
128 & 32 & 16 & 5.48 \\ \hline
128 & 32 & 8 & 5.34 \\ \hline
\end{tabular}

\bigskip

\hspace{0.5in} Table 5

\vspace{0.5in}

\begin{tabular}{| l | l | l | l |} \hline
r & p & $\gamma$ & mean time \\ \hline
64 & 64 & 64 & 5.87 \\ \hline
64 & 64 & 32 & 5.48 \\ \hline
64 & 64 & 16 & 5.32 \\ \hline
64 & 64 & 8 & 5.22 \\ \hline
\end{tabular}

\bigskip

\hspace{0.5in} Table 6

\vspace{0.5in}

\begin{tabular}{| l | l | l | l |} \hline
r & p & $\gamma$ & mean time \\ \hline
128 & 64 & 64 & 14.70 \\ \hline
128 & 64 & 32 & 11.84 \\ \hline
128 & 64 & 16 & 11.61 \\ \hline
128 & 64 & 8 & 11.61 \\ \hline
\end{tabular}

\bigskip

\hspace{0.5in} Table 7

\vspace{0.5in}

\begin{tabular}{| l | l | l | l |} \hline
r & p & $\gamma$ & mean time \\ \hline
1024 & 512 & 512 & 21.91 \\ \hline
1024 & 512 & 256 & 11.33 \\ \hline
1024 & 512 & 128 & 6.19 \\ \hline
1024 & 512 & 64 & 3.78 \\ \hline
1024 & 512 & 32 & 3.48 \\ \hline
1024 & 512 & 16 & 3.70 \\ \hline
1024 & 512 & 8 & 4.43 \\ \hline
\end{tabular}

\bigskip

\hspace{0.5in} Table 8

\vspace{0.5in}

\begin{tabular}{| l | l | l | l |} \hline
r & p & $\gamma$ & mean time \\ \hline
128 & 32 & 32 & 4.22 \\ \hline
128 & 32 & 16 & 4.22 \\ \hline
128 & 32 & 8 & 4.22 \\ \hline
\end{tabular}


\hspace{0.5in} Table 9

\bigskip

\onecolumn


% Figure 1 starts here:
\expandafter\ifx\csname graph\endcsname\relax \csname newbox\endcsname\graph\fi
\expandafter\ifx\csname graphtemp\endcsname\relax \csname newdimen\endcsname\graphtemp\fi
\setbox\graph=\vtop{\vskip 0pt\hbox{%
    \special{pn 8}%
    \special{pa 2550 350}%
    \special{pa 3000 350}%
    \special{pa 3000 0}%
    \special{pa 2550 0}%
    \special{pa 2550 350}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 0.175in
    \rlap{\kern 2.775in\lower\graphtemp\hbox to 0pt{\hss root\hss}}%
    \special{pa 400 1250}%
    \special{pa 850 1250}%
    \special{pa 850 900}%
    \special{pa 400 900}%
    \special{pa 400 1250}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 1.075in
    \rlap{\kern 0.625in\lower\graphtemp\hbox to 0pt{\hss o\hss}}%
    \special{pa 2550 1250}%
    \special{pa 3000 1250}%
    \special{pa 3000 900}%
    \special{pa 2550 900}%
    \special{pa 2550 1250}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 1.075in
    \rlap{\kern 2.775in\lower\graphtemp\hbox to 0pt{\hss n\hss}}%
    \special{pa 4700 1250}%
    \special{pa 5150 1250}%
    \special{pa 5150 900}%
    \special{pa 4700 900}%
    \special{pa 4700 1250}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 1.075in
    \rlap{\kern 4.925in\lower\graphtemp\hbox to 0pt{\hss w\hss}}%
    \special{pa 0 2150}%
    \special{pa 450 2150}%
    \special{pa 450 1800}%
    \special{pa 0 1800}%
    \special{pa 0 2150}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 1.975in
    \rlap{\kern 0.225in\lower\graphtemp\hbox to 0pt{\hss on*\hss}}%
    \special{pa 800 2150}%
    \special{pa 1250 2150}%
    \special{pa 1250 1800}%
    \special{pa 800 1800}%
    \special{pa 800 2150}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 1.975in
    \rlap{\kern 1.025in\lower\graphtemp\hbox to 0pt{\hss ow\hss}}%
    \special{pa 2150 2150}%
    \special{pa 2600 2150}%
    \special{pa 2600 1800}%
    \special{pa 2150 1800}%
    \special{pa 2150 2150}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 1.975in
    \rlap{\kern 2.375in\lower\graphtemp\hbox to 0pt{\hss no*\hss}}%
    \special{pa 2950 2150}%
    \special{pa 3400 2150}%
    \special{pa 3400 1800}%
    \special{pa 2950 1800}%
    \special{pa 2950 2150}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 1.975in
    \rlap{\kern 3.175in\lower\graphtemp\hbox to 0pt{\hss nw\&\hss}}%
    \special{pa 4300 2150}%
    \special{pa 4750 2150}%
    \special{pa 4750 1800}%
    \special{pa 4300 1800}%
    \special{pa 4300 2150}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 1.975in
    \rlap{\kern 4.525in\lower\graphtemp\hbox to 0pt{\hss wo\hss}}%
    \special{pa 5100 2150}%
    \special{pa 5550 2150}%
    \special{pa 5550 1800}%
    \special{pa 5100 1800}%
    \special{pa 5100 2150}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 1.975in
    \rlap{\kern 5.325in\lower\graphtemp\hbox to 0pt{\hss wn\&\hss}}%
    \special{pa 0 3050}%
    \special{pa 450 3050}%
    \special{pa 450 2700}%
    \special{pa 0 2700}%
    \special{pa 0 3050}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 2.875in
    \rlap{\kern 0.225in\lower\graphtemp\hbox to 0pt{\hss onw\hss}}%
    \special{pa 800 3050}%
    \special{pa 1250 3050}%
    \special{pa 1250 2700}%
    \special{pa 800 2700}%
    \special{pa 800 3050}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 2.875in
    \rlap{\kern 1.025in\lower\graphtemp\hbox to 0pt{\hss own*\hss}}%
    \special{pa 2150 3050}%
    \special{pa 2600 3050}%
    \special{pa 2600 2700}%
    \special{pa 2150 2700}%
    \special{pa 2150 3050}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 2.875in
    \rlap{\kern 2.375in\lower\graphtemp\hbox to 0pt{\hss now\hss}}%
    \special{pa 4300 3050}%
    \special{pa 4750 3050}%
    \special{pa 4750 2700}%
    \special{pa 4300 2700}%
    \special{pa 4300 3050}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 2.875in
    \rlap{\kern 4.525in\lower\graphtemp\hbox to 0pt{\hss won*\hss}}%
    \special{pa 2775 350}%
    \special{pa 625 900}%
    \special{fp}%
    \special{pa 2775 350}%
    \special{pa 2775 900}%
    \special{fp}%
    \special{pa 2775 350}%
    \special{pa 4925 900}%
    \special{fp}%
    \special{pa 625 1250}%
    \special{pa 225 1800}%
    \special{fp}%
    \special{pa 625 1250}%
    \special{pa 1025 1800}%
    \special{fp}%
    \special{pa 2775 1250}%
    \special{pa 2375 1800}%
    \special{fp}%
    \special{pa 2775 1250}%
    \special{pa 3175 1800}%
    \special{fp}%
    \special{pa 4925 1250}%
    \special{pa 4525 1800}%
    \special{fp}%
    \special{pa 4925 1250}%
    \special{pa 5325 1800}%
    \special{fp}%
    \special{pa 225 2150}%
    \special{pa 225 2700}%
    \special{fp}%
    \special{pa 1025 2150}%
    \special{pa 1025 2700}%
    \special{fp}%
    \special{pa 2375 2150}%
    \special{pa 2375 2700}%
    \special{fp}%
    \special{pa 4525 2150}%
    \special{pa 4525 2700}%
    \special{fp}%
    \hbox{\vrule depth3.050in width0pt height 0pt}%
    \kern 5.550in
  }%
}%
\centerline{\box\graph}

\hspace{2.5in} Figure 1


% Figure 2 starts here:
\expandafter\ifx\csname graph\endcsname\relax \csname newbox\endcsname\graph\fi
\expandafter\ifx\csname graphtemp\endcsname\relax \csname newdimen\endcsname\graphtemp\fi
\setbox\graph=\vtop{\vskip 0pt\hbox{%
    \special{pn 8}%
    \special{pa 0 2250}%
    \special{pa 250 2250}%
    \special{pa 250 2000}%
    \special{pa 0 2000}%
    \special{pa 0 2250}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 2.125in
    \rlap{\kern 0.125in\lower\graphtemp\hbox to 0pt{\hss C0\hss}}%
    \special{pa 700 2250}%
    \special{pa 950 2250}%
    \special{pa 950 2000}%
    \special{pa 700 2000}%
    \special{pa 700 2250}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 2.125in
    \rlap{\kern 0.825in\lower\graphtemp\hbox to 0pt{\hss C1\hss}}%
    \special{pa 1400 2250}%
    \special{pa 1650 2250}%
    \special{pa 1650 2000}%
    \special{pa 1400 2000}%
    \special{pa 1400 2250}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 2.125in
    \rlap{\kern 1.525in\lower\graphtemp\hbox to 0pt{\hss C2\hss}}%
    \special{pa 2100 2250}%
    \special{pa 2350 2250}%
    \special{pa 2350 2000}%
    \special{pa 2100 2000}%
    \special{pa 2100 2250}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 2.125in
    \rlap{\kern 2.225in\lower\graphtemp\hbox to 0pt{\hss C3\hss}}%
    \special{pa 2800 2250}%
    \special{pa 3050 2250}%
    \special{pa 3050 2000}%
    \special{pa 2800 2000}%
    \special{pa 2800 2250}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 2.125in
    \rlap{\kern 2.925in\lower\graphtemp\hbox to 0pt{\hss C4\hss}}%
    \special{pa 3500 2250}%
    \special{pa 3750 2250}%
    \special{pa 3750 2000}%
    \special{pa 3500 2000}%
    \special{pa 3500 2250}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 2.125in
    \rlap{\kern 3.625in\lower\graphtemp\hbox to 0pt{\hss C5\hss}}%
    \special{pa 4200 2250}%
    \special{pa 4450 2250}%
    \special{pa 4450 2000}%
    \special{pa 4200 2000}%
    \special{pa 4200 2250}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 2.125in
    \rlap{\kern 4.325in\lower\graphtemp\hbox to 0pt{\hss C6\hss}}%
    \special{pa 4900 2250}%
    \special{pa 5150 2250}%
    \special{pa 5150 2000}%
    \special{pa 4900 2000}%
    \special{pa 4900 2250}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 2.125in
    \rlap{\kern 5.025in\lower\graphtemp\hbox to 0pt{\hss C7\hss}}%
    \special{pa 350 1750}%
    \special{pa 600 1750}%
    \special{pa 600 1500}%
    \special{pa 350 1500}%
    \special{pa 350 1750}%
    \special{fp}%
    \special{pa 1750 1750}%
    \special{pa 2000 1750}%
    \special{pa 2000 1500}%
    \special{pa 1750 1500}%
    \special{pa 1750 1750}%
    \special{fp}%
    \special{pa 3150 1750}%
    \special{pa 3400 1750}%
    \special{pa 3400 1500}%
    \special{pa 3150 1500}%
    \special{pa 3150 1750}%
    \special{fp}%
    \special{pa 4550 1750}%
    \special{pa 4800 1750}%
    \special{pa 4800 1500}%
    \special{pa 4550 1500}%
    \special{pa 4550 1750}%
    \special{fp}%
    \special{pa 350 1250}%
    \special{pa 600 1250}%
    \special{pa 600 1000}%
    \special{pa 350 1000}%
    \special{pa 350 1250}%
    \special{fp}%
    \special{pa 1750 1250}%
    \special{pa 2000 1250}%
    \special{pa 2000 1000}%
    \special{pa 1750 1000}%
    \special{pa 1750 1250}%
    \special{fp}%
    \special{pa 3150 1250}%
    \special{pa 3400 1250}%
    \special{pa 3400 1000}%
    \special{pa 3150 1000}%
    \special{pa 3150 1250}%
    \special{fp}%
    \special{pa 4550 1250}%
    \special{pa 4800 1250}%
    \special{pa 4800 1000}%
    \special{pa 4550 1000}%
    \special{pa 4550 1250}%
    \special{fp}%
    \special{pa 350 750}%
    \special{pa 600 750}%
    \special{pa 600 500}%
    \special{pa 350 500}%
    \special{pa 350 750}%
    \special{fp}%
    \special{pa 1750 750}%
    \special{pa 2000 750}%
    \special{pa 2000 500}%
    \special{pa 1750 500}%
    \special{pa 1750 750}%
    \special{fp}%
    \special{pa 3150 750}%
    \special{pa 3400 750}%
    \special{pa 3400 500}%
    \special{pa 3150 500}%
    \special{pa 3150 750}%
    \special{fp}%
    \special{pa 4550 750}%
    \special{pa 4800 750}%
    \special{pa 4800 500}%
    \special{pa 4550 500}%
    \special{pa 4550 750}%
    \special{fp}%
    \special{pa 0 250}%
    \special{pa 250 250}%
    \special{pa 250 0}%
    \special{pa 0 0}%
    \special{pa 0 250}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 0.125in
    \rlap{\kern 0.125in\lower\graphtemp\hbox to 0pt{\hss M0\hss}}%
    \special{pa 700 250}%
    \special{pa 950 250}%
    \special{pa 950 0}%
    \special{pa 700 0}%
    \special{pa 700 250}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 0.125in
    \rlap{\kern 0.825in\lower\graphtemp\hbox to 0pt{\hss M1\hss}}%
    \special{pa 1400 250}%
    \special{pa 1650 250}%
    \special{pa 1650 0}%
    \special{pa 1400 0}%
    \special{pa 1400 250}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 0.125in
    \rlap{\kern 1.525in\lower\graphtemp\hbox to 0pt{\hss M2\hss}}%
    \special{pa 2100 250}%
    \special{pa 2350 250}%
    \special{pa 2350 0}%
    \special{pa 2100 0}%
    \special{pa 2100 250}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 0.125in
    \rlap{\kern 2.225in\lower\graphtemp\hbox to 0pt{\hss M3\hss}}%
    \special{pa 2800 250}%
    \special{pa 3050 250}%
    \special{pa 3050 0}%
    \special{pa 2800 0}%
    \special{pa 2800 250}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 0.125in
    \rlap{\kern 2.925in\lower\graphtemp\hbox to 0pt{\hss M4\hss}}%
    \special{pa 3500 250}%
    \special{pa 3750 250}%
    \special{pa 3750 0}%
    \special{pa 3500 0}%
    \special{pa 3500 250}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 0.125in
    \rlap{\kern 3.625in\lower\graphtemp\hbox to 0pt{\hss M5\hss}}%
    \special{pa 4200 250}%
    \special{pa 4450 250}%
    \special{pa 4450 0}%
    \special{pa 4200 0}%
    \special{pa 4200 250}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 0.125in
    \rlap{\kern 4.325in\lower\graphtemp\hbox to 0pt{\hss M6\hss}}%
    \special{pa 4900 250}%
    \special{pa 5150 250}%
    \special{pa 5150 0}%
    \special{pa 4900 0}%
    \special{pa 4900 250}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 0.125in
    \rlap{\kern 5.025in\lower\graphtemp\hbox to 0pt{\hss M7\hss}}%
    \special{pa 0 2850}%
    \special{pa 250 2850}%
    \special{pa 250 2600}%
    \special{pa 0 2600}%
    \special{pa 0 2850}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 2.725in
    \rlap{\kern 0.125in\lower\graphtemp\hbox to 0pt{\hss P0\hss}}%
    \special{pa 700 2850}%
    \special{pa 950 2850}%
    \special{pa 950 2600}%
    \special{pa 700 2600}%
    \special{pa 700 2850}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 2.725in
    \rlap{\kern 0.825in\lower\graphtemp\hbox to 0pt{\hss P1\hss}}%
    \special{pa 1400 2850}%
    \special{pa 1650 2850}%
    \special{pa 1650 2600}%
    \special{pa 1400 2600}%
    \special{pa 1400 2850}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 2.725in
    \rlap{\kern 1.525in\lower\graphtemp\hbox to 0pt{\hss P2\hss}}%
    \special{pa 2100 2850}%
    \special{pa 2350 2850}%
    \special{pa 2350 2600}%
    \special{pa 2100 2600}%
    \special{pa 2100 2850}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 2.725in
    \rlap{\kern 2.225in\lower\graphtemp\hbox to 0pt{\hss P3\hss}}%
    \special{pa 2800 2850}%
    \special{pa 3050 2850}%
    \special{pa 3050 2600}%
    \special{pa 2800 2600}%
    \special{pa 2800 2850}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 2.725in
    \rlap{\kern 2.925in\lower\graphtemp\hbox to 0pt{\hss P4\hss}}%
    \special{pa 3500 2850}%
    \special{pa 3750 2850}%
    \special{pa 3750 2600}%
    \special{pa 3500 2600}%
    \special{pa 3500 2850}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 2.725in
    \rlap{\kern 3.625in\lower\graphtemp\hbox to 0pt{\hss P5\hss}}%
    \special{pa 4200 2850}%
    \special{pa 4450 2850}%
    \special{pa 4450 2600}%
    \special{pa 4200 2600}%
    \special{pa 4200 2850}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 2.725in
    \rlap{\kern 4.325in\lower\graphtemp\hbox to 0pt{\hss P6\hss}}%
    \special{pa 4900 2850}%
    \special{pa 5150 2850}%
    \special{pa 5150 2600}%
    \special{pa 4900 2600}%
    \special{pa 4900 2850}%
    \special{fp}%
    \graphtemp=.5ex\advance\graphtemp by 2.725in
    \rlap{\kern 5.025in\lower\graphtemp\hbox to 0pt{\hss P7\hss}}%
    \special{pa 125 2600}%
    \special{pa 125 2250}%
    \special{fp}%
    \special{pa 825 2600}%
    \special{pa 825 2250}%
    \special{fp}%
    \special{pa 1525 2600}%
    \special{pa 1525 2250}%
    \special{fp}%
    \special{pa 2225 2600}%
    \special{pa 2225 2250}%
    \special{fp}%
    \special{pa 2925 2600}%
    \special{pa 2925 2250}%
    \special{fp}%
    \special{pa 3625 2600}%
    \special{pa 3625 2250}%
    \special{fp}%
    \special{pa 4325 2600}%
    \special{pa 4325 2250}%
    \special{fp}%
    \special{pa 5025 2600}%
    \special{pa 5025 2250}%
    \special{fp}%
    \special{pa 0 3250}%
    \special{pa 2350 3250}%
    \special{fp}%
    \special{pa 2800 3250}%
    \special{pa 5150 3250}%
    \special{fp}%
    \special{pa 250 2125}%
    \special{pa 350 2125}%
    \special{fp}%
    \special{pa 350 2125}%
    \special{pa 350 3250}%
    \special{fp}%
    \special{pa 700 2125}%
    \special{pa 600 2125}%
    \special{fp}%
    \special{pa 600 2125}%
    \special{pa 600 3250}%
    \special{fp}%
    \special{pa 1650 2125}%
    \special{pa 1750 2125}%
    \special{fp}%
    \special{pa 1750 2125}%
    \special{pa 1750 3250}%
    \special{fp}%
    \special{pa 2100 2125}%
    \special{pa 2000 2125}%
    \special{fp}%
    \special{pa 2000 2125}%
    \special{pa 2000 3250}%
    \special{fp}%
    \special{pa 3050 2125}%
    \special{pa 3150 2125}%
    \special{fp}%
    \special{pa 3150 2125}%
    \special{pa 3150 3250}%
    \special{fp}%
    \special{pa 3500 2125}%
    \special{pa 3400 2125}%
    \special{fp}%
    \special{pa 3400 2125}%
    \special{pa 3400 3250}%
    \special{fp}%
    \special{pa 4450 2125}%
    \special{pa 4550 2125}%
    \special{fp}%
    \special{pa 4550 2125}%
    \special{pa 4550 3250}%
    \special{fp}%
    \special{pa 4900 2125}%
    \special{pa 4800 2125}%
    \special{fp}%
    \special{pa 4800 2125}%
    \special{pa 4800 3250}%
    \special{fp}%
    \special{pa 125 2000}%
    \special{pa 350 1750}%
    \special{fp}%
    \special{pa 825 2000}%
    \special{pa 600 1750}%
    \special{fp}%
    \special{pa 1525 2000}%
    \special{pa 1750 1750}%
    \special{fp}%
    \special{pa 2225 2000}%
    \special{pa 2000 1750}%
    \special{fp}%
    \special{pa 2925 2000}%
    \special{pa 3150 1750}%
    \special{fp}%
    \special{pa 3625 2000}%
    \special{pa 3400 1750}%
    \special{fp}%
    \special{pa 4325 2000}%
    \special{pa 4550 1750}%
    \special{fp}%
    \special{pa 5025 2000}%
    \special{pa 4800 1750}%
    \special{fp}%
    \special{pa 125 250}%
    \special{pa 350 500}%
    \special{fp}%
    \special{pa 825 250}%
    \special{pa 600 500}%
    \special{fp}%
    \special{pa 1525 250}%
    \special{pa 1750 500}%
    \special{fp}%
    \special{pa 2225 250}%
    \special{pa 2000 500}%
    \special{fp}%
    \special{pa 2925 250}%
    \special{pa 3150 500}%
    \special{fp}%
    \special{pa 3625 250}%
    \special{pa 3400 500}%
    \special{fp}%
    \special{pa 4325 250}%
    \special{pa 4550 500}%
    \special{fp}%
    \special{pa 5025 250}%
    \special{pa 4800 500}%
    \special{fp}%
    \special{pa 350 1500}%
    \special{pa 350 1250}%
    \special{fp}%
    \special{pa 600 1500}%
    \special{pa 1750 1250}%
    \special{fp}%
    \special{pa 1750 1500}%
    \special{pa 3150 1250}%
    \special{fp}%
    \special{pa 2000 1500}%
    \special{pa 4550 1250}%
    \special{fp}%
    \special{pa 3150 1500}%
    \special{pa 600 1250}%
    \special{fp}%
    \special{pa 3400 1500}%
    \special{pa 2000 1250}%
    \special{fp}%
    \special{pa 4550 1500}%
    \special{pa 3400 1250}%
    \special{fp}%
    \special{pa 4800 1500}%
    \special{pa 4800 1250}%
    \special{fp}%
    \special{pa 350 1000}%
    \special{pa 350 750}%
    \special{fp}%
    \special{pa 600 1000}%
    \special{pa 1750 750}%
    \special{fp}%
    \special{pa 1750 1000}%
    \special{pa 3150 750}%
    \special{fp}%
    \special{pa 2000 1000}%
    \special{pa 4550 750}%
    \special{fp}%
    \special{pa 3150 1000}%
    \special{pa 600 750}%
    \special{fp}%
    \special{pa 3400 1000}%
    \special{pa 2000 750}%
    \special{fp}%
    \special{pa 4550 1000}%
    \special{pa 3400 750}%
    \special{fp}%
    \special{pa 4800 1000}%
    \special{pa 4800 750}%
    \special{fp}%
    \hbox{\vrule depth3.250in width0pt height 0pt}%
    \kern 5.150in
  }%
}%
\centerline{\box\graph}

\hspace{2.5in} Figure 2


\end{document}