
\documentclass[[11pt]{article}

\setlength{\oddsidemargin}{0in}
\setlength{\evensidemargin}{0in}
\setlength{\topmargin}{-0.3in}
\setlength{\headheight}{0in}
\setlength{\headsep}{0in}
\setlength{\textwidth}{6.5in}
\setlength{\textheight}{9.5in}
\setlength{\parindent}{0.2in}
\setlength{\parskip}{0.1in}

\usepackage{psfig}
\usepackage{times}  

\renewcommand{\thefootnote}{\arabic{footnote}}

\begin{document}

\title{\bf A Locally Cache-Coherent Multiprocessor Architecture \rm}


\author{Kevin Rich \\
Computing Research Group\\
Lawrence Livermore National Laboratory\\
Livermore, CA  94551\\
\\
Norman Matloff \\
Division of Computer Science \\
University of California at Davis \\
Davis, CA  95616 \\
\\
Correspondence Author:  N. Matloff \\
matloff@heather.cs.ucdavis.edu \\
(916) 752-1953, (916) 752-7004}

\date{}
\maketitle

%\begin{center}
%\vspace*{.1in}
%{\bf Abstract}
%\vspace*{.3in}
%\begin{end}

%\maketitle

\begin{abstract}
Recently there has been considerable interest in cache coherency protocols
in shared-memory multiprocessor systems, particularly in protocols which
are scalable, i.e. suitable for very large systems.  However, cache
coherency scalability (CCS) entails heavy performance overhead and system
cost, so a critical examination of the assumptions underlying the quest
for CCS is undertaken here.  A non-CCS architecture which provides only 
``locally, but not globally, coherent'' hardware support is proposed, and
evidence is presented which shows that this architecture does well in large 
classes of application.  Special emphasis will be placed on loop calculations,
due to their prevalence in scientific applications.

\end{abstract}

\section{Introduction}
\label{sec:intro}
     In the last few years, scalability of cache coherency protocols
has been one of the most active topics of research in shared-memory
multiprocessor systems.  A nice survey of this area, with a good
list of references, was presented in \cite{sten:survey}, and new papers have 
continued to appear since then.  However, cache coherency scalability 
(CCS) requires complex hardware, which carries significant overhead and
inhibits system performance, both in absolute terms and with
respect to performance/price ratios.  It is thus worthwhile to examine 
the fundamental question of whether CCS is a sufficiently desirable 
property to warrant the heroic efforts which are being made to 
achieve it.  This question is addressed here, and evidence toward an
at least partially negative answer is presented.  

     The key point involves the type and extent of hardware which should be
devoted to support for coherent access of globally shared data, i.e., data
which are at least potentially shared by all processors in the system.  It
is found here that in many large application classes, we can replace the
major shared monitor variables (MVs) by data shared only by a fixed number
of processors, with that fixed number being independent of overall system
size.  This obviates much of the need for CCS.  Based on this idea,
an architecture which provides only ``locally, but not globally, coherent"
hardware support is proposed which is useful in many application domains, 
and is particularly suited to loop applications which are common in the
scientific computing field.

     Section 2 discusses distributed task allocation (DTA), and presents
evidence of its efficiency.  Section 3 then introduces the architecture
which is based on DTA.


\section{Decentralized Task Allocation in Loop Contexts}
\label{sec:dta}

     A common problem in the design of efficient parallel processing
software is the assignment of loop tasks to processors.\footnote{In the
context considered here, we focus on process\underline{ors}, rather
than processes\underline{es}.}  We choose to examine loop applications for
two reasons.  The first is that the concepts here will be easier to
explain in the loop setting.  Second, 
loops are a common paradigm in scientific applications.  It should be
noted, though, that DTA can be applied profitably to many other types
of applications.

     To illustrate the concepts involved,
the simpler the example the better, so let us consider the usual matrix
multiplication problem, in which an NxN matrix is to multiply an Nx1
vector.  Taking as our basic task the computation of the inner product
of the Nth row of the multiplier with the multiplicand, a {\bf for}-loop
then consists of N tasks to be allocated to the processors.

\subsection{Some Non-DTA Task-Allocation Methods for Loops}

     Let us consider the various allocation methods compared in
\cite{humm:factor}, and then discuss their relation to DTA.  First,
under self-scheduling (SS), there would be a global variable NextRow,
recording the number of the next row which is available for multiplication.
Whenever a processor finishes the multiplication of a row, it performs

\begin{verbatim}
   R = NextRow++;
\end{verbatim}
and then does the multiplication of row number R.  (Of course, the
incrementing of NextRow must be atomic.)  This approach is very
efficient in terms of load-balancing, but the variable NextRow
can become a hot spot, leading large number of cache invalidations,
for example.  Indeed, \cite{agar:sync} found that most accesses of
atomically-protected variables such as NextRow led to cache
coherency transactions.

     Another approach would be to use the {\it static chunking} (SC)
method proposed by Kruskal and Weiss \cite{krus:alloc}.  Here tasks are
assigned in ``chunks'' of K, instead of singly.  In other words, the
code above would become

\begin{verbatim}
   R = NextRow;
   NextRow += K;
\end{verbatim}
This reduces the number of accesses of NextRow by a factor of K, but
is less efficient in terms of load-balancing. 
 
    In an effort to get the load-balancing efficiency of SS and the
reduction in accesses of NextRow which stems from SC, a number of
hybrid methods have been proposed, based on the idea of guided
self-scheduling (GSS) \cite{humm:factor}.  The latter is
a variable-sized chunking method, with smaller chunks being used in later
iterations.  A variant of GSS proposed in \cite{humm:factor},
which we will call Factoring,
was shown to do quite well in the {\bf for}-loop setting
(it is specifically designed for that setting), with performance superior to
SS and SC.  

     Another queue-access method (not limited to task allocation) that has 
received considerable attention recently is the use of {\it software 
combining} (SWC) to reduce contention at the particular memory module that 
contains the global task queue \cite{good:eff,tang:swc}.  The
task-allocation software uses a binary tree data structure, with the root
node being the only one with direct accesses to the global task queue. 
Processor requests for additional work are combined at the nodes of the
tree whenever possible, distributing the memory accesses across many
nodes, and through careful planning, many memory modules, thus reducing
contention.  In the matrix example here, the code would look like

\begin{verbatim}
   R = Fetch&Add(NextRow,1);
\end{verbatim}

Another method similar to GSS, called trapezoid self-scheduling was presented
in \cite{tzen:trap}, but appeared too late for us to include in our
empirical evaluations.  The aim of this method was the same as the others,
to reduce the number of accesses to the global task allocation queue.
Our analysis showed that in ``typical'' problems it produces many more 
global accesses than DTA.

\subsection{DTA in the Loop Context}

     The method on which we will focus here is distributed task allocation
(DTA).  This method would approach loop problems such as matrix multiplication
in the following way:  Processors are considered to belong to groups, of
size G.\footnote{Taking DTA as only a software technique, as in \cite{ni:trade},
this grouping is just symbolic, not physical.  However, it will take on a
physical embodiment when we propose the LCSMA architecture in the followig
section.}  Usually tasks are allocated locally, i.e. within groups, hence
the term {\it distributed} in ``DTA.''  However, occasionally one member
of a group must go to a global variable to acquire a batch of new tasks
for the group to process.

     Specifically, the direct analogy of NextRow is now an NG-element array
LocalNextRow, where the group size NG is equal to NPROCS (the number of
processors in the
system) divided by G.  There is also a variable GlobNextRow and another
NG-element array LocalLastRow.  The I-th processor is considered to be in
group GN = I mod NG.  When this processor finishes the multiplication of a
row, it performs

\begin{verbatim}
   R = LocalNextRow[GN]++;
   if (LocalNextRow[GN] > LocalLastRow[GN])  {
      LocalNextRow[GN] = GlobNextRow;
      LR = GlobNextRow += K;
      LocalLastRow[GN] = LR - 1;
   }
\end{verbatim}
What is happening here is the following:  Initially each processor group is
allocated a set of K consecutive rows of the matrix.  The processors in
that group process these rows one-by-one, just as in the centralized case,
with LocalNextRow[GN] playing the role corresponding to NextRow in
the centralized versions.  When one of the processors discovers that these
K rows have all assigned to processors, it goes to GlobNextRow to get K
more rows for its group. This is in contrast to the SC method, in which
each {\it processor}, rather than each group, is assigned K tasks at a
time.  Note that though the entire arrays LocalNextRow and 
LocalLastRow are global variables visible to all processors---this
\underline{is} a shared-memory system, after all---the code
is written so that a processor in one group will never access the array
element intended for another group.\footnote{Clearly, this concept can
be extended in a hierarchal manner, for extremely large systems.}  

     K is a design parameter, which in the experiments here is taken
to be equal to G.

\subsection{Experimental Results}

     As mentioned earlier, our focus will be on DTA.  We are interested in
DTA because of its broad range of applicability:

\begin{itemize}
\item
DTA is applicable in open-ended problems in which the total number of tasks
to be done is not known in advance.  This is in contrast to, for example,
the Factoring approach, which is applicable only in {\bf for}-loops. 

\item
DTA also applies to other problems commonly arising in the parallel processing 
area, such as barrier synchronization.\footnote{Again, 
SWC can be used for this.}

\item
Most importantly,
though we are in this section viewing DTA as a software technique, 
DTA forms the justification for the architecture which we will propose
in the next section, and which is the main subject of this paper.

\end{itemize}  

     What we gain with DTA is the low number of accesses of
the central resource NextRow (or in this case, the new central
resource, GlobNextRow) enjoyed by SC, but with higher processor
utilization than SC.  The question, though, is how {\it much} higher
that utilization will be, compared to the optimal method in terms of
utilization, SS.

     This question was studied by Ni and Wu \cite{ni:trade}, but
not in the scalability context of interest here.  Work in the latter
context was done in \cite{matl:dta}, in which a theoretical analysis was
presented which showed that DTA can indeed be done efficiently with a
fixed group size, even with arbitrarily large overall system size.  Moreover,
it was found that in a certain sense
a ``universal'' group size exists, information which then
can be incorporated into hardware design.  

     Now in the current work, this efficiency is demonstrated empirically
by running three specific applications, and comparing them to SS, SC,
Factoring and SWC.  Note, by the way, that the empirical study is also
important in that it accounts for the overhead a processor spends in
actually acquiring a task from a task allocator, e.g. in row-assignment in
the matrix problem, which the previous theoretical analyses \cite{ni:trade,
matl:dta} did not do.

A summary of the results is:

\begin{itemize}
\item
For the smaller system sizes DTA had performance comparable to that of 
Factoring.

\item 
For the larger systems DTA was superior to Factoring, sometimes
substantially so.
  
\item
DTA was superior to both SS and SWC at all levels.
\end{itemize}

Here is some more detail.
The experiments reported here were conducted on a BBN TC2000 (a.k.a. BBN 
Butterfly) shared-memory multiprocessor consisting of 128 processor/memory 
nodes. The shared memory consists of the totality of memory modules at all 
these nodes, with internode access being via a multistage network.
Local operating system structure allowed us to run programs on a 
dedicated-machine basis, i.e., with all other jobs suspended, except for 
certain interactive jobs which run on a reserved set of four nodes.  For 
each combination of parameters, at least 15 (and as many as 65) runs were 
conducted, with the timings graphed here being the average values so obtained.
All of the DTA experiments reported here used a group size of G = 16,
and thus the numbers of processors used were various multiples of 16.  
Since four of the 128 processors are unavailable, the maximum number of 
processors we used was 112.  The graphs presented here plot program
timings t against numbers of processors p.\footnote{Note carefully the
values on the vertical axis.  They generally do not start at 0, and thus
tend to exaggerate the differences.}  

Figure~\ref{fig:matrix300} presents the data for the matrix-multiply
experiments, in which an NxN matrix multiplies an Nx1 vector.

\begin{figure}[htb]
\parbox[b]{6.5in}{
\centerline{
\psfig{figure=matrix300.eps,height=3.2in,width=4.5in}}
}
\caption{Matrix Multiply (N = 300)}
\label{fig:matrix300}
\end{figure}

Figures~\ref{fig:sort25} and ~\ref{fig:sort50} give the results for the
sorting experiments on matrices of size NxRS, in which a basic task is
to sort one matrix row.\footnote{A standard C-library Quicksort routine
was used for the sorting.} 
DTA, SS and Factoring were roughly equal for the smaller numbers of 
processors, but with DTA having a decided advantage for NPROCS $>=$ 64.

\begin{figure}[htb]
\parbox[b]{6.5in}{
\centerline{
\psfig{figure=sort25.eps,height=3.2in,width=4.5in}}
}
\caption{Sort (N= 500; RS = 25)}
\label{fig:sort25}
\end{figure}


\begin{figure}[htb]
\parbox[b]{6.5in}{
\centerline{
\psfig{figure=sort50.eps,height=3.2in,width=4.5in}}
}
\caption{Sort (N= 1000; RS = 50)}
\label{fig:sort50}
\end{figure}

Though not discussed here, we also considered a {\bf while}-loop application. 
Factoring cannot be used in this context, so we compared DTA to SS and SWC,
and found DTA to yield very strong improvements over the other two methods.
\section{A New Class of Shared-Memory Multiprocessor Architecture}
\label{sec:new}

     Recall our previous statement that \cite{agar:sync} found that in the
systems using an invalidation-based cache coherency policy, most accesses to
variables like NextRow\footnote{Or to lock variables guarding variables like
NextRow.} in the last section caused cache invalidations.  As system size
grows, more and more processors are accessing NextRow, and the problem
gets worse and worse.  DTA solves this problem, because each variable
LocalNextRow[GN], is accessed only by at most a \underline{fixed} number
of processors (GN).  The variable GlobalNextRow does get accessed by all
processors, not just those in one group, but the point is that such accesses
are rare.  This has a very profound implication for the CCS question, in that
neither the variables LocalNextRow[GN] nor the variable GlobalNextRow
need CCS hardware:  

\begin{itemize}
\item
The `S' (for {\it scalability}) in ``CCS'' is irrelevant to the variables
LocalNextRow[GN], since each such variable is accessed only by the
processors in group GN.

\item
The variable GlobalNextRow is accessed so rarely that special hardware
to make atomic access efficient is not justified. 
\end{itemize}

     The second point here is central.  As mentioned earlier, CCS hardware
adds greatly to system cost, and inhibits system performance.  These
problems can be avoided in loop applications by the use of DTA, and
the empirical results of the last section and the mathematical analysis
in \cite{matl:dta} indicate that this can be done without inducing
load-balancing problems.

     In this regard, consider a concept of
{\it locally coherent shared-memory architectures} (LCSMA), which we 
define to mean shared-memory multiprocessor systems which provide 
hardware support for cache coherency within groups of processors, but 
provide no hardware support for systemwide cache coherency.  Note 
that machines such as DASH \cite{leno:dash}
are not LCSMAs.  Though their use of processor
clusters seems to have some similarity to our processor-group concept,
the point is that they do have hardware devoted to systemwide cache
coherency.  LCSMA has no such hardware.

     LCSMA denotes a broad class of architectures, with many possible 
implementations.  For example, an interconnect structure such as that 
of Figure ~\ref{fig:inter} could be used.  (For the purposes of 
readability, this figure depicts a very small system, with an
unrealistically small value of 4 for the group size G.)
Here processor elements (PE) contain the processor, cache,
and some of the global memory.  PEs are connected to other PEs via
the familiar $\Omega$-network.  What is different is 
that PEs are partitioned into groups, and processors within any 
given group are connected via snooping caches \cite{good:red} and a local bus.
These provide hardware support for local MVs, i.e., variables which are 
shared only by processors within a given group, such as LocalNextRow[GN]
in our examples in the last section.

\begin{figure}[htb]
\parbox[b]{6.5in}{
\centerline{
\psfig{figure=lcsma.eps,height=3.2in,width=4.5in}}
}
\caption{}
\label{fig:inter}
\end{figure}

     Under LCSMA the software must, where possible, avoid creation of 
systemwide MVs.  The previous section indicated how to do this with DTA
in the case of {\bf for}-loops, and DTA can be extended---still with
efficiency from the load-balancing point of view---to many other
kinds of task-allocation mechanisms, e.g. task queues \cite{matl:dta}. 
Atomic access to group-specific variables like LocalNextRow[GN], and the
problem of contention for them, are handled via hardware, e.g. the 
``locally-snoopy'' buses in Figure ~\ref{fig:inter}. 
Any MVs which remain, e.g. such as GlobalNextRow,
are handled in hardware at the group level, and then in software above 
that level, such as 
with software fetch-and-add as in \cite{good:eff,tang:swc}.
 
     In addition, LCSMA should have program-controlled cacheability of 
individual blocks, to allow the best usage, with the program having
the ability to dynamically set the cacheability mode (cacheable,
read-only cacheable, noncacheable) for the block containing a
given address/variable, especially for variables which are not
MVs.
 
We will not present details here, but as a simple illustration,
consider Gaussian elimination.  Since each row is accessed by a 
processor only once per iteration, caching is not useful, and
the rows should be made noncacheable.  In many other matrix
applications, rows are re-used within an iteration, but only for
reading, so read-only caching would be appropriate.

\section{Discussion}
\label{sec:disc}

     Though for simplicity and conciseness we have limited our
focus here to loops, it is important to note that DTA can be used
in a much wider variety of applications.  For example, problems with
general task queues can be converted to having local task queues,
rather than using one central queue.  The theoretical work in
\cite{matl:dta} lends support to this.  Thus we believe that
LCSMA is a good choice for general parallel processing applications.

     On the other hand, no machine can be optimal for all applications,
and we note that LCSMA does not provide any special help in, for
instance, synchronous algorithms with 
very short times between successive synchronizations, such as
parallel root-finding problems.  However, it is clear that approaches 
based on systemwide cache coherenecy are not good solutions to this
problem either.  The overhead due to relaying of cache update 
messages throughout an entire large system would be too heavy.  Thus other 
mechanisms would be needed if this type of application were to be targeted, 
say adding a separate broadcast channel to Figure ~\ref{fig:inter}. 

\bigskip


\begin{thebibliography}{Stenstrom  1991}

\bibitem{agar:sync}A. Agarwal and M. Cherian.  ``Adaptive Backoff Synchronization
Techniques,"
{\em Proceedings of the 16th Annual International Symposium on Computer
Architecture}, 1989, 396-406.

\bibitem{good:red}J. Goodman.  ``Using Cache Memory to Reduce Processor-Memory Traffic,"
{\em Proceedings of the 10th Annual Symposium on Computer Architecture},
1983, 124-131.

\bibitem{good:eff} J. Goodman, M. Vernon and P. Woest.  ``Efficient Synchronization Primitives
for Large-Scale Cache-Coherent Multiprocessors,'' 
{\em ACM Supercomputing},  1989, 64-75.

\bibitem{humm:factor}S. Hummel, E. Schonberg, L. Flynn.  ``Factoring:  A Method for Scheduling
Parallel Loops,'' {\em Communications of the ACM}, August 1992, 90-101.

\bibitem{krus:alloc}C. Kruskal and A. Weiss.  ``Allocating Independent Subtasks
on Parallel Processors," {\em IEEE Transactions on Software
Engineering}, SE-11 (10), 1001-1016, 1985.

\bibitem{leno:dash}D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta and J. Hennessy.
``The Directory-Based Cache Coherence Protocol for the DASH
Multiprocessor," 
{\em Proceedings of the 17th Annual International Symposium on Computer
Architecture}, 1990, 148-158.

\bibitem{matl:dta}N. Matloff ``On Decentralized Cache Coherency and Scalable Cache
Coherency,'' {\em Technical Report, University of California at Davis}, 1991.

\bibitem{ni:trade}L. Ni and C.-F. Wu.  ``Design Tradeoffs for Process Scheduling
in Shared Memory Multiprocessor Systems," {\em IEEE Transactions
on Software Engineering}, 15, 1989, 327-334.

\bibitem{sten:survey}P. Stenstrom.  ``A Survey of Cache Coherency Schemes for Multiprocessors,"
{\em Computer}, June 1990, 12-24.

\bibitem{tang:swc}P. Tang and P.-C. Yew. ``Software Combining for Distributing
Hot-Spot Addressing,'' {\em Journal of Parallel and Distributed Computing}, 10, 1990, 130-139

\bibitem{tzen:trap}T. Tzen and L. Ni.  ``Trapezoid Self-Scheduling:  A 
Practical Scheduling Scheme for Parallel Compilers" {\em IEEE Transactions on 
Parallel and Distributed Systems}, January 1993

\bibitem{yew:dist}P.-C. Yew, N.-F. Tzeng and D. Lawrie.  ``Distributing Hot-Spot
Addressing in Large-Scale Multiprocessors," {\em IEEE Transactions
on Computers}, C-36, 1987, 388-395.


\end{thebibliography}


\end{document}









