Kevin Rich
Computing Research Group
Lawrence Livermore National Laboratory
Livermore, CA 94551
Norman Matloff
Division of Computer Science
University of California at Davis
Davis, CA 95616
Correspondence Author: N. Matloff
matloff@heather.cs.ucdavis.edu
(916) 752-1953, (916) 752-7004
Recently there has been considerable interest in cache coherency protocols in shared-memory multiprocessor systems, particularly in protocols which are scalable, i.e. suitable for very large systems. However, cache coherency scalability (CCS) entails heavy performance overhead and system cost, so a critical examination of the assumptions underlying the quest for CCS is undertaken here. A non-CCS architecture which provides only ``locally, but not globally, coherent'' hardware support is proposed, and evidence is presented which shows that this architecture does well in large classes of application. Special emphasis will be placed on loop calculations, due to their prevalence in scientific applications.
The key point involves the type and extent of hardware which should be devoted to support for coherent access of globally shared data, i.e., data which are at least potentially shared by all processors in the system. It is found here that in many large application classes, we can replace the major shared monitor variables (MVs) by data shared only by a fixed number of processors, with that fixed number being independent of overall system size. This obviates much of the need for CCS. Based on this idea, an architecture which provides only ``locally, but not globally, coherent" hardware support is proposed which is useful in many application domains, and is particularly suited to loop applications which are common in the scientific computing field.
Section 2 discusses distributed task allocation (DTA), and presents evidence of its efficiency. Section 3 then introduces the architecture which is based on DTA.
To illustrate the concepts involved, the simpler the example the better, so let us consider the usual matrix multiplication problem, in which an NxN matrix is to multiply an Nx1 vector. Taking as our basic task the computation of the inner product of the Nth row of the multiplier with the multiplicand, a for-loop then consists of N tasks to be allocated to the processors.
Let us consider the various allocation methods compared in [4], and then discuss their relation to DTA. First, under self-scheduling (SS), there would be a global variable NextRow, recording the number of the next row which is available for multiplication. Whenever a processor finishes the multiplication of a row, it performs
R = NextRow++;and then does the multiplication of row number R. (Of course, the incrementing of NextRow must be atomic.) This approach is very efficient in terms of load-balancing, but the variable NextRow can become a hot spot, leading large number of cache invalidations, for example. Indeed, [1] found that most accesses of atomically-protected variables such as NextRow led to cache coherency transactions.
Another approach would be to use the static chunking (SC) method proposed by Kruskal and Weiss [5]. Here tasks are assigned in ``chunks'' of K, instead of singly. In other words, the code above would become
R = NextRow; NextRow += K;This reduces the number of accesses of NextRow by a factor of K, but is less efficient in terms of load-balancing.
In an effort to get the load-balancing efficiency of SS and the reduction in accesses of NextRow which stems from SC, a number of hybrid methods have been proposed, based on the idea of guided self-scheduling (GSS) [4]. The latter is a variable-sized chunking method, with smaller chunks being used in later iterations. A variant of GSS proposed in [4], which we will call Factoring, was shown to do quite well in the for-loop setting (it is specifically designed for that setting), with performance superior to SS and SC.
Another queue-access method (not limited to task allocation) that has received considerable attention recently is the use of software combining (SWC) to reduce contention at the particular memory module that contains the global task queue [3,10]. The task-allocation software uses a binary tree data structure, with the root node being the only one with direct accesses to the global task queue. Processor requests for additional work are combined at the nodes of the tree whenever possible, distributing the memory accesses across many nodes, and through careful planning, many memory modules, thus reducing contention. In the matrix example here, the code would look like
R = Fetch&Add(NextRow,1);
Another method similar to GSS, called trapezoid self-scheduling was presented in [11], but appeared too late for us to include in our empirical evaluations. The aim of this method was the same as the others, to reduce the number of accesses to the global task allocation queue. Our analysis showed that in ``typical'' problems it produces many more global accesses than DTA.
The method on which we will focus here is distributed task allocation (DTA). This method would approach loop problems such as matrix multiplication in the following way: Processors are considered to belong to groups, of size G.2 Usually tasks are allocated locally, i.e. within groups, hence the term distributed in ``DTA.'' However, occasionally one member of a group must go to a global variable to acquire a batch of new tasks for the group to process.
Specifically, the direct analogy of NextRow is now an NG-element array LocalNextRow, where the group size NG is equal to NPROCS (the number of processors in the system) divided by G. There is also a variable GlobNextRow and another NG-element array LocalLastRow. The I-th processor is considered to be in group GN = I mod NG. When this processor finishes the multiplication of a row, it performs
R = LocalNextRow[GN]++; if (LocalNextRow[GN] > LocalLastRow[GN]) { LocalNextRow[GN] = GlobNextRow; LR = GlobNextRow += K; LocalLastRow[GN] = LR - 1; }What is happening here is the following: Initially each processor group is allocated a set of K consecutive rows of the matrix. The processors in that group process these rows one-by-one, just as in the centralized case, with LocalNextRow[GN] playing the role corresponding to NextRow in the centralized versions. When one of the processors discovers that these K rows have all assigned to processors, it goes to GlobNextRow to get K more rows for its group. This is in contrast to the SC method, in which each processor, rather than each group, is assigned K tasks at a time. Note that though the entire arrays LocalNextRow and LocalLastRow are global variables visible to all processors--this is a shared-memory system, after all--the code is written so that a processor in one group will never access the array element intended for another group.3
K is a design parameter, which in the experiments here is taken to be equal to G.
As mentioned earlier, our focus will be on DTA. We are interested in DTA because of its broad range of applicability:
What we gain with DTA is the low number of accesses of the central resource NextRow (or in this case, the new central resource, GlobNextRow) enjoyed by SC, but with higher processor utilization than SC. The question, though, is how much higher that utilization will be, compared to the optimal method in terms of utilization, SS.
This question was studied by Ni and Wu [8], but not in the scalability context of interest here. Work in the latter context was done in [7], in which a theoretical analysis was presented which showed that DTA can indeed be done efficiently with a fixed group size, even with arbitrarily large overall system size. Moreover, it was found that in a certain sense a ``universal'' group size exists, information which then can be incorporated into hardware design.
Now in the current work, this efficiency is demonstrated empirically by running three specific applications, and comparing them to SS, SC, Factoring and SWC. Note, by the way, that the empirical study is also important in that it accounts for the overhead a processor spends in actually acquiring a task from a task allocator, e.g. in row-assignment in the matrix problem, which the previous theoretical analyses [8,7] did not do.
A summary of the results is:
Here is some more detail. The experiments reported here were conducted on a BBN TC2000 (a.k.a. BBN Butterfly) shared-memory multiprocessor consisting of 128 processor/memory nodes. The shared memory consists of the totality of memory modules at all these nodes, with internode access being via a multistage network. Local operating system structure allowed us to run programs on a dedicated-machine basis, i.e., with all other jobs suspended, except for certain interactive jobs which run on a reserved set of four nodes. For each combination of parameters, at least 15 (and as many as 65) runs were conducted, with the timings graphed here being the average values so obtained. All of the DTA experiments reported here used a group size of G = 16, and thus the numbers of processors used were various multiples of 16. Since four of the 128 processors are unavailable, the maximum number of processors we used was 112. The graphs presented here plot program timings t against numbers of processors p.5
Figure presents the data for the matrix-multiply experiments, in which an NxN matrix multiplies an Nx1 vector.
Figures and give the results for the sorting experiments on matrices of size NxRS, in which a basic task is to sort one matrix row.6 DTA, SS and Factoring were roughly equal for the smaller numbers of processors, but with DTA having a decided advantage for NPROCS >= 64.
Though not discussed here, we also considered a while-loop application. Factoring cannot be used in this context, so we compared DTA to SS and SWC, and found DTA to yield very strong improvements over the other two methods.
The second point here is central. As mentioned earlier, CCS hardware adds greatly to system cost, and inhibits system performance. These problems can be avoided in loop applications by the use of DTA, and the empirical results of the last section and the mathematical analysis in [7] indicate that this can be done without inducing load-balancing problems.
In this regard, consider a concept of locally coherent shared-memory architectures (LCSMA), which we define to mean shared-memory multiprocessor systems which provide hardware support for cache coherency within groups of processors, but provide no hardware support for systemwide cache coherency. Note that machines such as DASH [6] are not LCSMAs. Though their use of processor clusters seems to have some similarity to our processor-group concept, the point is that they do have hardware devoted to systemwide cache coherency. LCSMA has no such hardware.
LCSMA denotes a broad class of architectures, with many possible implementations. For example, an interconnect structure such as that of Figure could be used. (For the purposes of readability, this figure depicts a very small system, with an unrealistically small value of 4 for the group size G.) Here processor elements (PE) contain the processor, cache, and some of the global memory. PEs are connected to other PEs via the familiar -network. What is different is that PEs are partitioned into groups, and processors within any given group are connected via snooping caches [2] and a local bus. These provide hardware support for local MVs, i.e., variables which are shared only by processors within a given group, such as LocalNextRow[GN] in our examples in the last section.
Under LCSMA the software must, where possible, avoid creation of systemwide MVs. The previous section indicated how to do this with DTA in the case of for-loops, and DTA can be extended--still with efficiency from the load-balancing point of view--to many other kinds of task-allocation mechanisms, e.g. task queues [7]. Atomic access to group-specific variables like LocalNextRow[GN], and the problem of contention for them, are handled via hardware, e.g. the ``locally-snoopy'' buses in Figure . Any MVs which remain, e.g. such as GlobalNextRow, are handled in hardware at the group level, and then in software above that level, such as with software fetch-and-add as in [3,10].
In addition, LCSMA should have program-controlled cacheability of individual blocks, to allow the best usage, with the program having the ability to dynamically set the cacheability mode (cacheable, read-only cacheable, noncacheable) for the block containing a given address/variable, especially for variables which are not MVs.
We will not present details here, but as a simple illustration, consider Gaussian elimination. Since each row is accessed by a processor only once per iteration, caching is not useful, and the rows should be made noncacheable. In many other matrix applications, rows are re-used within an iteration, but only for reading, so read-only caching would be appropriate.
On the other hand, no machine can be optimal for all applications, and we note that LCSMA does not provide any special help in, for instance, synchronous algorithms with very short times between successive synchronizations, such as parallel root-finding problems. However, it is clear that approaches based on systemwide cache coherenecy are not good solutions to this problem either. The overhead due to relaying of cache update messages throughout an entire large system would be too heavy. Thus other mechanisms would be needed if this type of application were to be targeted, say adding a separate broadcast channel to Figure .