Notes on Book by Grama et al

Chapter 6:

p.241:

Some good points here about MPI replacing a lot of incompatible vendor-specific message-passing packages. MPI is a standard. Each hardware vendor can still develop a version of MPI which is fine-tuned to the vendor's platform, for better performance, but the point is that the program still works if then moved to (and of course recompiled for) another platform, since the MPI function calls are the same.

p.247:

The solution taken here to avoid deadlock--have one node to send then receive, with the other one doing receive then send--should sound familiar to you, from our discussion of hyperquicksort. Better yet, use MPI_Sendrecv(); then MPI avoids the deadlock problems for you.

p.249:

Here is a summary of Sec. 9.3.1: The odd-even sort is a relative of bubble sort. In the sequential case, each pair of neighbors, say X[6] and X[7], compare each other's values and if necessary exchange them, so that in the end X[6] is smaller than X[7]. In alternate iterations, an element compares itself either with its left or right neighbor.

In the simpler parallel case, if one has n array elements and n processors, each processor deals with one array element. But we probably don't have that many processors, so in the chunked case, the array is divided into chunks, with each processor dealing with one chunk. In each iteration, pairs of neighboring chunks are compared. Say the chunks are of size 25. Then a pair will have 50 elements. The lower-numbered processor retains the smaller 25 elements, and the higher-numbered processor retains the larger 25 elements. The code on p.249 uses this chunked approach. Again, this should remind you somewhat of the hyperquicksort algorithm.

p.250:

Recall that in MPI the entity MPI_Comm refers to the entire group of processors. In many cases, we want to break these into subgroups, called communicators. For example, the MPI function MPI_Bcast() does a send operation to a specified set of processors. We might have, say, 16 processors in all but need to send something to only 4 of them. So, we need to set up a communicator for these 4 processors, and then specify that communicator when we call MPI_Bcast().

MPI has many functions which are used to set up communicators. Sec. 6 .4 here introduces some of them. They are rather involved to use, so we will skip them, but do keep in mind the need for communicators.

p.255:

In any parallel processing system, the biggest bottleneck is communication between processors. In a network, the network latency is a major factor, and even in a shared-memory system there can be considerable latency. One of the ways MPI provides to deal with that is MPI_Isend() and MPI_Irecv(). These function calls return right away, even though the send or receive operation may be still pending (review pp.245-246 on this point).

Thse functions must be used with care, though. Consider a call

MPI_Isend(x,...);

It may be some time before the data in x is actually sent off to the network, or at least copied to an OS buffer. That means we must make sure not to overwrite x with new data until the old data is copied by the OS, etc. We can use the MPI_Wait() function to do this, as on p.259.

p.260:

I mentioned MPI_Bcast() above. Operationally speaking, it's not doing anything different than simply putting MPI_Send() inside a for loop, so why have a separate function?

The answer is that the implementers of MPI can fine-tune MPI_Bcast() so that it is faster than repeated calls to MPI_Send(). The amount of speedup depends on the platform (e.g. what kind of network we have), and on how much effort the implementers spend on the fine-tuning. (Here is where having a vendor-specific version of MPI, e.g. IBM's, can really help.) Another advantage of having a separate broadcast function is that it simplifies and clarifies our code.

All the functions in Sec. 6.6 are of this nature: They will be faster than doing things with just MPI_Send() and MPI_Recv(), and make for clearer code as well. Here is an overview:

MPI_Reduce(): Say we want to add up all the elements of an array, and that different nodes have different portions of the array. Say we want the grand total at node 0. We can have each node compute the sum of its portion of the array, and then call MPI_Reduce() with node 0 as the target. The function will add up all the partial sums and place the grand total at node 0.
MPI_Scan(): Say we want to add up all the elements of the array, and we want the grand total to be sent to all nodes, not just node 0. MPI_Scan() does this. It is like having a reduce operation followed by a broadcast.
MPI_Gather(): Say each node has a portion of an array, and we want to gather together all the portions and form one huge array at, say, node 0. This function does this for us. MPI_Scatter() does the opposite, taking a grand array at one node and sending portions of it to the other nodes.

p.266:

This code forms the product Ab, placing the result in x, where A is a matrix, and b and x are vectors.

Note the term distributed across the nodes, e.g. "the matrix A is distributed across the nodes." This means that different processors have different chunks of A. Node 0 will have some rows of A, node 1 will have some other rows of A, etc. This is very common. Remember, this matrix multiplication is probably just one part of a very large, complex program. Earlier operations of the program resulted in A being distributed across the nodes. The same is true for b, and the same will be true for x after we do the multiplication. (Recall the assumption in hyperquicksort that the initial data is distributed across the nodes. Again, this may be very natural, depending on what other operations the program does.)

Each node only cares about its own rows of A; it does not need to know the other rows of A in order to do its job. By contrast, each node needs to know all of b. Yet each node has only part of b. So, on line 20 we see the code calling MPI_Allgather(), which causes each node to send its chunk of b to all nodes, thereby resulting in each node having all of b.

p.269:

This is the same algorithm used in our PLN. Here are some correspondences:

The PLN finds all the shortest paths from vertex 0, whereas this program is more general, finding shortest paths from vertex source.
Note that just as we have seen with other MPI code involving matrices, each MPI node possesses only part of the one-hop distance matrix G, in this case some set of nlocal columns of G, and thus handles only a portion of the vertices. This also implies that each vertex has two ID numbers--its original number and also its number relative to the subset of vertices handled by this CPU. For example, in line 32 we see that our vertex source is referred to by its relative vertex number source-firstvx.
The one-hop distance matrix in the PLN was named G; in the PLN it is wgt. The latter is stored as a one-dimensional array, In other words, even though we think of wgt as a matrix, it is stored as one-dimensional. The storage uses row-major ordering, and again, recall that each CPU stores only part of the matrix G. So in line 58, for instance, wgt[u*nlocal+j] refers to the element in the uth row and jth column of the submatrix, which has nlocal columns.
In the PLN the results were stored in Dist, while here the results are stored in lengths.
The analog of the set NonDone in the PLN is the array marker here; marker[i] = 1 means that vertex i has not been checked yet, i.e. vertex i would be in the set NoneDone in the PLN context.

Again recall that each MPI node possesses only a portion of the wgt matrix, and thus handles only a portion of the vertices of the graph. So, even though we found the minimum value for our set of vertices in lines 39-44, we must call MPI_Allreduce() to find the overall minimum. Note the usage of the MPI_MINLOC operation, and the MPI_2INT data type.

p.279:

What is the word platforms in the title of this chapter referring to? The answer is, for the most part, hardware shared-memory machines. A few of the constructs, such as barriers, are applicable to any platform using the shared-memory programming paradigm, both hardware- and software-based, but most of the material is applicable only to the hardware shared memory case.

The first part of the chapter deals with threads. Again, this will assume a shared-memory multiprocessor machine. There was once a software-based threads system which ran on NOWs, named DSM-Threads, but it was never made public. The second half of the chapter deals with OpenMP, a programming view which is higher-level than threads; it creates threads but they are not seen. The developers of the Treadmarks software distributed shared memory system, have developed a version of OpenMP which works in their context, i.e. NOWs.

A review of OS process scheduling is essential for understanding this chapter:

In a modern OS (we will use UNIX as our running example), processes, i.e. invocations of programs, take turns running, say 50 milliseconds per turn. A turn is called a timeslice or quantum, and when the turn is ended we say that it has been pre-empted. There are two mechanisms which can make a turn end:

A turn can end due a hardware interrupt from a timer or from an I/O device,, which forces a jump to the OS.
A turn can end if the program currently running makes a system call, i.e. a call to a function in the OS. In most applications, this is typically a request that the OS perform some I/O operation, e.g. read from the keyboard or disk. In our case here, another typical class of operations will be for threads.

Note in either case, there is a jump to the OS. This is the only way the OS ever gets to run. And when it does run, it looks in its list of processes (the process table), and chooses one to give the next turn to. It then starts that turn by setting the stack properly and then executing an IRET (return from interrupt) instruction, which causes the other process to resume running.

The OS' process table shows the state of each process in memory, mainly Run state, Ready state or Sleep state. A process which is in Run state is literally running right now A process which is in Ready state means that it is ready to run but simply waiting for its next turn. The processes in Sleep state are waiting for something, typically an I/O operation, and are thus currently ineligible for turns. So, each time a turn ends, the OS will browse through its process table, looking for a process in Run state, and then choosing one for its next turn. If a process is in Sleep state, say waiting for I/O, eventually the I/O will complete its operation, causing another hardware interrupt which causes the OS to run, and the OS will change that process' entry in the process table from Sleep state to Run state, resulting in that process getting a turn sometime.

Now, what happens if we have a shared-memory multiprocessor? The OS still runs in the same manner, with the only difference being that there are more interrupts. Each CPU could be the recipient of an interrupt from a timer, causing the OS to run on that CPU. (It could be running on several CPUs simultaneously. In other words, the OS is a parallel processing program too. So, it must be designed with lock variables on its critical sections, e.g. for access to its process table.)

Because of this, we write parallel processing applications using the process model. That would mean that the program begins by calling fork() and exec() to set up several processes, and calls to shmget() and shmat() to set up shared meory. However, there is a lot of overhead in processes, and especially in shared memory.

By contrast, threads use less overhead than processes and give us shared memory for free, automatically without programmer request. Thus threads have become the main method of writing shared-memory programs on shared-memory multiprocessors.

It should be noted that I am referring to system-level threads, which are created and managed by the OS, as opposed to user-level threads. System-level threads are treated as processes, and are often called light-weight processes. Each thread shows up in the OS' process table (and will show up in the output of the UNIX ps command, e.g. ps axm for Linux). Whenever an interrupt or system call at one CPU causes the OS to run, the OS will give a turn to a process showing Run state in the process table, including the threads. In this manner, we will typically have threads running on all or most of the CPUs simultaneously. None of this would be possible with user-level threads, since those are invisible to the OS; the OS would just see one process, the original program. Personally, on a single machine I much prefer use-level threads (e.g. the Pth package), since they are easier to program in (many fewer lock/unlock ops) and easier to debug (deterministic behavior, due to lack of pre-emption), but they can't be used in multiprocessor contexts.

p.280:

The point that each thread has a separate stack is important. First, it's important to note that this means a local variable has different values in different threads, quite in contrast with the nature of global variables, which are shared.

Also remember, access to shared variables is expensive (in terms of time), due to cache coherency operations, crossbar or omega-network interconnect delays, etc. So one tries to minimize such accesses, and rely on local variables as much as possible.

Finally, recall that in a NUMA context, each CPU is paired with a memory module. That memory module is accessible by all CPUs, but its paired CPU has especially fast access, as it doesn't have to traverse an interconnect, etc. If one can arrange for certain variables (whether local or global) to be stored in a given memory module, it is best to have a thread run on that paired CPU which does a lot of access to those variables.

p.282:

Note the point (even in the section title) that Pthreads is simply a standard for the definition of a collection of APIs. It is then up to various entities (vendors, open-source groups, etc.) to implement those APIs. Many different implementations exist, both for UNIX and for Microsoft Windows.

p.283:

The book notes that a newly-created thread can "pre-empt its creator"; here's why: As soon as pthread_create() returns, the new thread is already in the OS' process table, in Ready state, i.e. eligible to run. Say main() called pthread_create(). Note that main() is considered a thread too. It could be that right after pthread_create() returns, the main() thread is pre-empted and the new thread runs. This is a very important point, as it is a frequent source of bugs.

p.284:

In lines 19-20, the program is setting up thread attributes, which concern things like priorities among threads. Most parallel processing applications (as opposed to real-time programming) have equal priorities among threads, but here the program is asking to have all this program's threads of higher priority than other processes. (It may or may not be granted.)

pp.284-285:

In lines 33, 34, 56, 57 and 67, we see that the array hits is being used in two separate ways: The more obvious use is as counts of how many times our random points have fallen in the 1x1 square, but another use is as an inital seed for random number generation. Here is what is going on:

Suppose I keep calling a random number generator, resulting in x[0], x[1], x[2], etc. The relation between two successive numbers is

x[i+1] = (a * x[i] + b) % c

Here a, b and c are fixed constants which have been chosen so that the sequence of random numbers has nice "random" statistical properties. The initial number, x[0], is the seed.

In the parallel processing context, if we have several threads all calling this random number generator, two such calls may be intertwined with each other (through pre-emption of the thread of one), producing different results than if the calls were separate. This can destroy the nice statistical properties of the random number generator. We say that the function is not re-entrant or not thread-safe. Here they are using the thread-safe generator rand_r(), with the initial seed coming from hits.

p.286:

The point about false sharing is very important. If two variables are declared adjacently, e.g. x andy in

int x,y;

then most compilers will give them adjacent memory locations, which means that with high probability they are in the same memory block. Thus if they are both in a cache, then they are almost certainly both in the same cache line. If one CPU changes the value of x and the system uses, say, an invalidate protocol (an update protocol would have similar problems) , then all caches will be notified that if they have a cache line containing x, then that line is now invalid. That means that their copies of y will now be declared invalid too, even though their values of y might well be correct. This causes a lot of extra, unnecessary cache coherency traffic. (Note that this would also be true with a software DSM system.)

p.288:

They say that if an attempt to lock fails, the thread blocks. See some comments on this in the notes for p.298 below.

p.289:

Note that the code shown here is only an outline. Comments are used to show where actual code would be filled in, e.g. in lines 12-13. By the way, note lines 3-4, an illustration of the fact that global variables are crucial in most shared-memory applications (the globals are what's shared).

pp.294-295:

The wait operation, pthread_cond_wait() is useful for two reasons:

It sets up threads which are waiting for some condition to occur in a nice orderly internal queue, with the semantics of the wait call clearly showing this.
It is really efficient in the broadcast case, in which many threads are awakened at once. (See p.298.)

Here is an outline of a typical wait call and surrounding code:

call pthread_mutex_lock()
check whether condition holds
if condition does not hold
   call pthread_cond_wait()
call pthread_mutex_unlock()

When the wait is called, the OS marks the calling thread as being in Sleep state in the process table. It is important to understand how the OS even gets a chance to do this. The answer is that when the thread calls pthread_cond_wait(), that function in turn makes a call to an OS function, which then in turn changes the state of the waiting thread. Later, when another thread causes the condition to occur and calls pthread_cond_signal(), that function internally calls the OS, which removes the first waiting thread from the front of the queue, and changes that thread's state from Sleep to Ready, so it will soon get a turn.

The role of the mutex is crucial. Before calling the wait, the programmer must lock the mutex, and after the wait returns, the programmer must unlock it. Note carefully that there are lock and unlock operations which don't appear in the pseudocode above:

the internals in pthread_cond_wait() unlock the mutex, automatically and atomically, at the time of the call
those internals relock the mutex, automatically and atomically, later when the call returns

Why is all this necessary? The book is not very clear on this, but here is what is happening: As mentioned earlier, the threads system maintains an internal queue of threads waiting for a given condition. Clearly, addition or deletion of items to this queue must be done atomically, since several threads may be attempting this around the same time, and thus addition/deletion must be done in its own (internal) critical section. That means a lock is required, and that is why Pthreads (or any other threads system) requires that a mutex be associated with any wait/signal operation.

The reason why a wait call automatically unlocks a mutex, and a signal relocks it, is as explained in the book: The unlock gives other threads a chance to make the condition occur; without the unlock we'd have deadlock. On the other hand, upon being awakened the waiting thread needs a chance to make use of the condition, typically in a critical section, so the automatic relocking is needed.

Very subtle bugs can arise here. In Butenhof's book, "Programming with POSIX Threads," he emphasizes the need for the awakened thread to recheck the condition, just to make sure it still holds, something like this:

call pthread_mutex_lock()
check whether condition holds
if condition does not hold
   do
      call pthread_cond_wait()
      check that the condition still holds
      if the condition no longer holds 
         call pthread_mutex_unlock()
   until the condition holds 
call pthread_mutex_unlock()

Why is this extra work necessary? Suppose thread A is first to check the condition, finding that it doesn't hold. Thread A then calls the wait. Later, thread B causes the condition to occur, and calls the signal. That in turn causes the state of thread B to change to Ready. BUT--before thread B gets a turn, thread C reaches that same code segment. Thread C checks the condition, finds that it holds, and then acts on it, possibly changing the condition so that it now does NOT hold. Then thread B finally gets a turn, thinks that the condition holds (since it returned from the wait call) even though the condition does NOT hold, and then B will take incorrect actions accordingly.

Moreover, there is concern over possible spurious wakeups. As one description puts it (see http://nf.apac.edu.au/facilities/sc/compaq_mirror/HTML/ARH9RBTE/DOCU0005.HTM ):

It can be extremely expensive, especially on a symmetric multiprocessor (SMP) system, for the implementation to ensure that a condition signal wakes one and only one thread. It is faster and easier to avoid the extra complication.

The "expense" referred to here is slowdown in system speed. This in turn is probably related to the memory consistency issues discussed briefly in our PLN.

That source also points out that if the thread receives a signal from the UNIX system (not a threads signal), this can also cause a spurious wakeup.

p.290:

The term work queue used here is synonymous with the terms used in our PLN, task pool and task farm. Note again that this is only a partial program; e.g. done() would have to be filled in according to the given application.

p.292:

You should spend some extra time thinking about the issue of lock overheads. For example, reread the material in our PLN on cache coherency, in which our specific example concerned locks; the point was that locking operations typically trigger a lot of cache coherency traffic, a cause of slowdown.

p.298:

The book's phrasing

When a thread performs a condition wait, it takes itself off the runnable list--consequently, it does not use any CPU cycles until it is woken up. This is in contrast to a mutex lock, which consumes CPU cycles as it polls for the lock.

is often true but somewhat misleading. The first sentence is definitely correct, as I explained earlier. The problem is with the second sentence.

Suppose thread X calls pthread_mutex_lock() and is waiting for the lock to be unlocked by another thread. The book is correct in saying that (in many implementations) X does stay in Ready state in the process table, and will continue to get turns to run. However, the book's phrasing makes it sound like X will use its turns in full, which is not the case.

Let's look at an example implementation. The source code for pthread_mutex_lock() begins with

acquire(&mutex->m_spinlock);

The code for that function is

static inline void acquire(int * spinlock)
{
  while (testandset(spinlock)) __sched_yield();
}

The term spin lock refers to looping ("spinning") repeatedly, watching a lock, and trying to grab it when it becomes unlocked. In our PLN, we described this with generic assembly language code:

TRY:  TAS R,L  # test-and-set
      JNZ TRY  # if test fails, loop again

So, let's see to what extent that this Pthreads implementation's naming a variable spinlock makes sense. To this end, let's first look at the testandset() function, which for Intel 386-family machines is in the file sysdeps/i386/pt-machine.h:

static inline int testandset(int *spinlock)
{
  int ret;

  __asm__ __volatile__("xchgl %0, %1"
        : "=r"(ret), "=m"(*spinlock)
        : "0"(1), "m"(*spinlock));

  return ret;
}

This is inline assembly language, consisting mainly of an xchgl instruction--which is definitely one of Intel's test-and-set instructions.

However, note what happens with that test-and-set call above:

  while (testandset(spinlock)) __sched_yield();

We try test-and-set once, and if doesn't work, we call __sched_yield() to relinquish our turn.

In other words, yes, the waiting thread does continue to get turns, as opposed to being in Sleep state--BUT each turn is very short, and thus the drain on CPU cycles as not nearly as severe as the book's phrasing implies.

Note by the way that a related systems call, sched_yield() (possibly the same one, due to name mangling) can be called directly by an application programmer. The process is saying to the OS, "Please end my turn." In a threads context, this is often a useful thing to do (though more with non-preemptive user-level threads).

p.303:

It is important to keep in mind that the variable l->readers does NOT include threads which are waiting to read. It simply counts the number of threads currently reading.

For that reason, line 62 in the code appears to be incorrect. Suppose a write had been in progress, and during that time one or more read requests had come in. Each of those requests would now be stuck at line 24, so when the write finishes and unlocks the lock, we would like line 63 to allow those read requests to proceed. However, they would not be able to do so, since line 62 depends on a condition which would not hold. Line 62 should be changed to test for waiting reads.

Note also that the code here gives new writes priority over new reads in a situation in which no writes or reads are currently in progress. That is different from many implementations of read/write locks, such as the ones in the famous Butenhof threads book; click here and here.

p.384:

They define an array of numbers as being bitonic if it is first increasing and then decreasing, or vice versa. Note the trivial special cases. For example, an array which is only increasing is bitonic. Another special case, one which will turn out to be very important, is arrays of length 2; any such array is bitonic.

Next, they introduce the notion of a bitonic split, defined in Equation (9.1). Figure 9.5 illustrates it very well. Note that the vertical bars show the split points. So, in the line labeled "1st Split," subsequence s₁ is (3,5,8,...,14,0) and s₂ is (95,90,60,...,20).

To illustrate b_i, in s₁ in the first split in Fig. 9.5, we b_i is 14. All the elements before it--3,5, ...,12--do indeed come from the increasing part of the previous line, and all the elements after it--just 0--come from the decreasing part.

Note again, though, that there are special cases, such as (10,12,14,9) in the second split. Here b_i^' = 10. All the elements which come after it are from the increasing part of the parent sequence (3,5,8,9,10,12,14,0), but there are NO elements before it, thus vacuously satisfying the requirement "all elements before it come from the decreasing part of the parent sequence."

The authors don't explain why b_i even exists. Here's why: In the definition of s_i in Equation (9.1), you can see that a series of potential swaps is considered. First we ask whether to swap a₀ and a_n/2, carrying out the swap if the latter is smaller than the former. Then we consider whether to swap a₁ and a_n/2+1, and so on. What the i in b_i really is is the first index at which a swap is carried out.

For concreteness, let's say n = 100. Remember that for now the authors are assuming that the increasing part of the sequence goes through index 50, followed by the decreasing part. (Note: The sequences s₁ and s₂ are still defined around n/2 even if that is not the place of switchover from increasing to decreasing.) Suppose the index at which the first swap is done is 28, so a₂₈ >= a₅₈. Now think about the potential swap at index 29. That index is in the increasing part of the sequence, while its partner a₇₉ is in the decreasing part, so we have

a₇₉ <= a₇₈ <= a₅₈ <= a₅₉

So you can see that a₅₉ and a₇₉ should be swapped too. And then by the same reasoning, a₆₀ and a₈₀ should be swapped, and so on. In other words, everything starting with a₂₈ will be swapped. So, everything after a₂₈ in the new s₁ will be from the decreasing part of the parent sequence. And since index 28 was the site of the first swap, everything before 28 will be intact from their sites in the parent sequence, i.e. the increasing part of the parent sequence. So, b₂₈ will indeed have the property claimed for it by the authors.

The authors then show that

the two subsequences s₁ and s₂ produced by a split are themselves bitonic, and
all the elements in s₁ are smaller than all the elements in s₂

That suggestions a classical recursion. To sort a bitonic sequence, we do a split, recursively call the function on s₁ and s₂, then simply concatenate the results of those calls. The original sequence is now sorted.

Now, what about the general case, in which our original sequence is not bitonic? The answer is given in Fig. 9.7. As explained in the text, the mechanism denoted +BM[n] (I'm missing the circle) takes a bitonic sequence of length n as input, and outputs the sorted version of the sequence, in ascending order. The mechanism -BM[n] does the same thing, but outputs in descending order. Now, here's how the network works:

Let the original sequence be denoted (a₀,a₁,..., a_n-1). First divide the sequence into pairs: (a₀, a₁), (a₂,a₃), etc. Each pair is bitonic; after all, it's got to be either increasing or decreasing, right? Then we apply +BM[2] to the first pair, -BM[2] to the second pair, +BM[2] to the third pair, etc. That will give us an increasing pair, followed by a decreasing pair, followed by an increasing pair, etc. Since an increasing pair followed by a decreasing pair forms a bitonic sequence of length 4, we now have n/4 bitonic sequences of length 4.

Now we do the same thing. We apply +BM[4] to the first quartet, -BM[4] to the second quartet, +BM[4] to the third quartet, etc. That gives us an increasing quartet, followed by a decreasing quartet, followed by an increasing quartet, etc. Those first two quartets then form a bitonic sequence of length 8, as do the second two quartets, and so on. So we feed the first two quartets into +BM[8], the second two into -BM[8], etc. Eventually we get two bitonic sequences of length n/2, which we feed into +BM[n/2], to get our full sorted sequence.

The whole thing is shown with "wires" in Fig. 9.8. The authors state that each "wire" represents one PE. (Review the symbols in Fig. 9.3 before continuing.) However, we set up log₂n time stages, each time stage corresponding to one column in the figure, so that we have better load balancing. (The authors' scheme is better if we have many arrays, not just one, because we can do pipelining; after the first array is processed by the first column, it can move to the second column, making room for the second array to begin in the first column, etc.)

p.388:

As pointed out by the authors, the bitonic merge operation is tailor-made for hypercubes. The reason is that the communications shown in Fig. 9.8 go between two array indices which differ by just one bit in their base-2 representations. That's perfect for hypercubes, since any two nodes whose node numbers differ by only one bit are neighbors, and thus have a direct communication link.

p.407:

The data assigned to each process is determined from a load-balancing perspective. For example, after the first step, P₀ and P₁ were assigned to the S array. If P₂ had also been assigned to S, then processes P₀, P₁ and P₂ would have too little to do, and P₂, P₃ and P₄ would have too much.

p.408:

Note the importance of the auxiliary array A'. If we were to both read from and write to the same array A here, we would risk overwriting some information that we need to use.

p.412:

Here and below, for concreteness assume we have 10 processors.

Suppose first that we know that our array to be sorted comes from a uniform distribution on some known interval, say (0,1). Then we would set our "buckets" to (0,0.1), (0.1,0.2), ..., (0.9,1.0). Each process i would then fill the buckets from its initial data chunk. It would then send the data in bucket j to process j, j=0,1,...,9, j != i. Each process would then sort its resulting numbers, and then the whole array would be sorted, since the numbers at process i are all less than those at process j, for all i < j.

We would set up similar buckets if the distribution were known to be uniformly distributed on any other interval (r,s).

However, in the general case, we must use a data-dependent method for setting up the buckets. It will be much easier to understand this material if we define a little terminology.

First, if we have n numbers X₀, X₁,..., X_n-1, the corresponding order statistics Y₀, Y₁,..., Y_n-1 are defined to the sorted version of the X_i. In other words, Y₀ is the smallest of the X_i, Y₁ is the second-smallest of the X_i, and so on.

Second, if we have a probability distribution, its p^th quantile is defined to be the value z such that a proportion p of the numbers in the distribution are smaller than z. That sounds abstract, but all it is saying is that, for instance, the 0.5^th quantile of the distribution is its median.

What is the problem in general? It is not that we don't know r and s; we can always choose r to be so small and s to be so large that all of our data fall in (r,s). Instead, the problem is the lack of uniformity. If we divide (r,s) into 10 equal-sized intervals, then different processors will likely get widely different amounts of work to do. To get good load balancing, we need to know the (0.1i)^th quantiles of the distribution. We don't know those quantiles, but we can estimate them from the data. Here is how we can do it in an efficient manner.

The naive way to estimate those quantiles would be to just estimate the (0.1i)^th quantile by the order statistic Y_0.1in. (Of course, the subscript here should be floor(0.1in), but we will omit the floor() for simplicity of notation.) But in order to get the Y_j, we'd need to sort the X_i, which is our goal in the first place!

What we can do, though, is have each process sort its own initial data chunk, so that the process will have the order statistics for its chunk. Denote the initial data chunk for process i by U_i0,..., U_i,0.1n-1, and let V_i0,...,V_i,0.1n-1 denote the corresponding order statistics. Process i could then use V_i,0.1j as its estimate of the (0.1j)^th quantile of the distribution. These values would be called splitters (They split the big interval into little ones.)

However, there would be two problems with this:

each process would have a different set of buckets, thus making the data exchange impossible
the estimates would not be very accurate, as they would be based on a sample of size 0.1n instead of n

The solution is to pool all the estimates. We could have each process i send its splitters V_i,0.1j to process 0. The latter process would sort the values and get new splitters based on the grand pool, and then broadcast the new splitters to all the processes. These same new splitters would then be used by all processes, solving our earlier problem of having a different set of splitters for each process, and they would be better estimates of the true quantiles, since they would be based on more data.

Program 6.7 implements (a slight variant of) this scheme.

pp.516-517:

If f(x) = min_j r_j and r_j has only one call to f, we say f is monadic. The purpose of these classifications is to aid in describing general solutions to such problems, e.g. a general solution to serial monadic problems.

p.518:

This is a special subcategory of the shortest-path problem. Note that the C's are unknown while the c's are known.

p.519:

To understand (12.4), note that for A nxn, x nx1 and y = Ax, y_i = sum_j A_ij x_j. Replacing x by + and + by min, that equation becomes y_i = min_j (A_ij + x_j), which is exactly the form of (12.4). So, by a clever trick, the dynamic programming problem is reduced to a matrix multiplication problem. This is important because we already know how to parallelize the latter type of problem.

p.521:

The idea of the recursion is this: We already know F[i-1,x], the best profit from using the first i-1 items with a knapsack of capacity x. Now consider the effect of adding in the i^th item into consideration. Our new best configuration either doesn't use the i^th item or does use it. If the new best doesn't use it, then the best for the first i items and capacity x is the same as the best for the first i-1 items and capacity x, i.e. F[i,x] = F[i-1,x]. On the other hand, if the new best does use the i^th item, then the first i-1 items were best for a capacity x-w_i, with profit F[i-1,x-w_i] on those items, and overall profit equal to that value plus p_i. In other words, in that second case, F[i,x] = F[i-1,x-w_i]+p_i.

A PRAM is an idealization of shared memory, in which memory access has no overhead, even when multiple processors simultaneously attempt memory accesses. However, their approach here does apply to real shared-memory systems.

Here the computation of F[i,x] depends on the element which is "north" of it and the elements "northwest" of it in the matrix. The computation must thus wait for the computations of those items. The easiest way to make this happen is to have a barrier after the computation of each row.

In the message-passing case, we need cyclical communication, in a manner very similar to that of Cannon's algorithm for matrix multiplication in Sec. 8.2.2.

p.523:

The idea behind the recursion is this: F[i-1,j-1] gives us the length of the LCS involving x₁,...,x_i-1 and y₁,...,y_j-1. (By the way, note that the authors accidentally changed from a/b to x/y symbols.) Now look at x_i and y_j. If they are the same, then we have just found an LCS which is 1 longer than the old one, i.e. F[i,j] = F[i-1,j-1] + 1. Etc.

p.525:

Again we see that F[i,j] depends on its neighbors to the north and northwest. But it also depends on neighbors to the west. Thus we can't just do things row by row, as in the knapsack case. By using the diagonal approach described here, we can have several simultaneous computations in progress without dependency problems. Computation is done in "waves": Each wave extends from southwest to northeast, and consists of one set of simultaneous computations. For example, in one step we will compute F[3,0], F[2,1], F[1,2] and F[0,3]. The waves themselves move in a northwest to southeast direction.