\chapter{The Parallel Prefix Problem} 
\label{chap:prefix}

An operation that arises in a variety of parallel algorithms is that of
{\it prefix} (or {\it scan}).  In its abstract form, it inputs a
sequence of objects $(x_0,...,x_{n-1})$, and outputs $(s_0,...,s_{n-1})$, where

\begin{equation}
\label{scandef}
\begin{array}{c}
s_0 = x_0, \\
s_1 = x_0 \otimes x_1, \\
..., \\
s_{n-1} = x_0 \otimes x_1 \otimes ... \otimes x_{n-1}
\end{array}
\end{equation}

where $\otimes$ is some associative operator.

That's pretty abstract.  The most concrete example would be that in
which $\otimes$ is + and the objects are numbers.  The scan of
(12,5,13) would then be (12,12+5,12+5+13) = (12,17,30).

This is called an {\bf inclusive} scan, in which $x_i$ is included in
$s_i$.  The {\bf exclusive} version of the above example would be
(0,12,17).

Prefix scan has become a popular tool in the parallel processing
community, applicable in a surprising variety of situations.  Various
examples will arise in succeeding chapters, but we'll present one in the
next section in order to illustrate the versatility of the prefix
approach.

\section{Example:  Permutations}
\label{fibprefix}

Say we have the vector (12,5,13,8,88).  Applying the permutation (2,0)
would say the old element 0 becomes element 2, the old element 2 becomes
element 0, and all the rest stay the same.  The result would be
(13,5,12,8,88).  If we then applied the permutation (1,2,4), it would
mean that element 1 goes to position 2, 2 goes to 4, and 4 goes to 1,
with everything else staying put.  Our new vector would then be
(13,88,5,8,12).

This can be cast in matrix terms, by representing any permutation as
a matrix multiplication.  We just apply the permutation to the identity
matrix I, and then postmultiply the (row) vector by the matrix.  For
instance, the matrix corresponding to the permutation (1,2,4) is

\begin{equation}
\left (
\begin{array}{rrrrr}
0 & 0 & 1 & 0 & 0 \\
0 & 1 & 0 & 0 & 0 \\
1 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 1 & 0 \\
0 & 0 & 0 & 0 & 1 \\
\end{array}
\right )
\end{equation}

so applying (1,2,4) to (12,5,13,8,88) above can be done as

\begin{equation}
(12,5,13,8,88)
\left (
\begin{array}{rrrrr}
0 & 0 & 1 & 0 & 0 \\
0 & 1 & 0 & 0 & 0 \\
1 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 1 & 0 \\
0 & 0 & 0 & 0 & 1 \\
\end{array}
\right )
= (13,5,12,8,88)
\end{equation}

So in terms of (\ref{scandef}), $x_0$ would be the identity matrix,
$x_i$ for i $>$ 0 would be the i$^{th}$ permutation matrix, and
$\otimes$ would be matrix multiplication.  

Note, however, that although we've couched the problem in terms of
matrix multiplication, these are {\it sparse} matrices, i.e. have many
0s.  Thus a general parallel matrix-multiply routine may not be
efficient, and special parallel methods for sparse matrices should be
used (Section \ref{sparse}).

Note that the above example shows that in finding a scan,

\begin{itemize}

\item the elements might be nonscalars

\item the associative operator need not be commutative

\end{itemize}

\section{General Strategies for Parallel Scan Computation}
\label{genpar}

For the time being, we'll assume we have n threads, i.e. one for each
datum.  Clearly this condition will often not hold, so we'll extend
things later.

We'll describe what is known as a {\it data parallel} solution to the
prefix problem.  

Here's the basic idea, say for n = 8:

{\bf Step 1:}

\begin{eqnarray}
x_1 &\leftarrow& x_0 + x_1 \\
x_2 &\leftarrow& x_1 + x_2 \\
x_3 &\leftarrow& x_2 + x_3 \\
x_4 &\leftarrow& x_3 + x_4 \\
x_5 &\leftarrow& x_4 + x_5 \\
x_6 &\leftarrow& x_5 + x_6 \\
x_7 &\leftarrow& x_6 + x_7
\end{eqnarray}

{\bf Step 2:}

\begin{eqnarray}
x_2 &\leftarrow& x_0 + x_2 \\
x_3 &\leftarrow& x_1 + x_3 \\
x_4 &\leftarrow& x_2 + x_4 \\
x_5 &\leftarrow& x_3 + x_5 \\
x_6 &\leftarrow& x_4 + x_6 \\
x_7 &\leftarrow& x_5 + x_7
\end{eqnarray}

{\bf Step 3:}

\begin{eqnarray}
x_4 &\leftarrow& x_0 + x_4 \\
x_5 &\leftarrow& x_1 + x_5 \\
x_6 &\leftarrow& x_2 + x_6 \\
x_7 &\leftarrow& x_3 + x_7
\end{eqnarray}

In Step 1, we look at elements that are 1 apart, then Step 2 considers
the ones that are 2 apart, then 4 for Step 3.

Why does this work?  Well, consider how the contents of $x_7$ evolve
over time.  Let $a_i$ be the original $x_i$, i = 0,1,...,n-1.  Then here
is $x_7$ after the various steps:

\begin{tabular}{|r|r|}
\hline
step & contents \\ \hline
\hline
1 & $a_6+a_7$ \\ \hline 
2 & $a_4+a_5+a_6+a_7$ \\ \hline 
3 & $a_0+a_1+a_2+a_3 + a_4+a_5+a_6+a_7$ \\ \hline 
\end{tabular}

Similarly, after Step 3, the contents of $x_7$ will be
$a_0+a_1+a_2+a_3 + a_4+a_5+a_6$ (check it!).   So, in the end, the
locations of $x_i$ will indeed contain the prefix sums.

For general n, the routing is as follows.  At Step i, each $x_j$ is
routed both to itself and to $x_{j+2^{i-1}}$, for $j >= 2^{i-1}$.  
(Some threads, more in each successive step, are idle.)

There will be $\log_2 n$ steps, or if n is not a power of 2, case the
number of steps is $\lfloor log_2 n \rfloor$.

Note these important points:

\begin{itemize}

\item The location $x_i$ appears both as an input and an output in the
assignment operations above.  In our implementation, we need to take
care that the location is not written to before its value is read.  One
way to do this is to set up an auxiliary array $y_i$.\footnote{This is
called a {\bf red/black method}, Think of one array as ``red'' and the
other as ``black.''  It arises often in parallel computation.}  
In odd-numbered
steps, the $y_i$ are written to with the $x_i$ as inputs, and vice versa
for the even-numbered steps.

\item As noted above, as time goes on, more and more threads are idle.  
Thus load balancing is poor.

\item Synchronization at each step incurs overhead in a
multicore/multiprocessr setting.  (Worse for GPU if multiple blocks
are used).

\end{itemize}

Now, what if n is greater than p, our number of threads?  Let
Ti denote thread i.  The standard approach is that taken in Section
\ref{cumulsums}:

\begin{lstlisting}[numbers=left]
break the array into p blocks
parallel for i = 0,...,p-1
   Ti does serial scan of block i, resulting in Si
form new array G of rightmost elements of each Si
do parallel scan of G
parallel for i = 1,...,p-1
   Ti adds Gi to each element of block i
\end{lstlisting}

For example, say we have the array

\begin{Verbatim}[fontsize=\relsize{-2}]
2 25 26 8 50 3 1 11 7 9 29 10
\end{Verbatim}

and three threads, and our operation is summing.  We break the data into
three sections,

\begin{tabular}{|c|c|c|}
\hline
2 25 26 8 &  50 3 1 11 & 7 9 29 10 \\ \hline
\end{tabular}

and then apply a scan to each section:

\begin{tabular}{|c|c|c|}
\hline
2 27 53 61 &  50 53 54 65 & 7 16 45 55 \\ \hline
\end{tabular}

But we still don't have the scan of the array overall.  That 50, for
instance, should be 61+50 = 111 and the 53 should be 61+53 = 114.  In
other words, 61 must be added to that second section, (50,53,54,65), and
61+65 = 126 must be added to the third section, (7,16,45,55).  This
then is the last step, yielding

\begin{tabular}{|c|c|c|}
\hline
2 27 53 61 &  111 114 115 126 & 133 142 171 181 \\ \hline
\end{tabular}

Another possible approach would be make n ``fake'' threads FTj.  Each Ti
plays the role of n/p of the FTj.  The FTj then do the parallel scan as
at the beginning of this section.  Key point:  Whenever a Ti becomes
idle, it is assigned to help other Tk.

\section{Implementations}
\label{prefiximps}

The prefix scan operation is common enough for it to be officially
incorporated into various widely-used parallel computation platforms,
such as:

\begin{itemize}

\item The MPI standard actually includes built-in parallel prefix functions,
{\bf MPI\_Scan()}.  A number of choices are offered for $\otimes$, such
as maximum, minimum, sum, product etc.

\item Intel's Threads Building Blocks (TBB), similar in nature to OpenMP
(though more sophisticated), includes a prefix scan operation.

\item The Thrust library for CUDA or OpenMP includes functions {\bf
thrust::inclusive\_scan()} and {\bf thrust::exclusive\_scan()}.

\item The CUDPP (CUDA Data Parallel Primitives Library) package contains CUDA
functions for sorting and other operations, many of which are based on
parallel scan.  See \url{http://gpgpu.org/developer/cudpp} for the
library code, and a detailed analysis of optimizing parallel prefix in a
GPU context in the book {\it GPU Gems 3}, available either in bookstores
or free online at
\url{http://developer.nvidia.com/object/gpu_gems_home.html}.

\end{itemize}

\section{Example:  Parallel Prefix Summing in OpenMP}

Here is an OpenMP implementation of the approach described at the end of
Section \ref{genpar}, for addition:

\begin{lstlisting}[numbers=left]
#include <omp.h>

// calculates prefix sums sequentially on u, in-place, where u is an 
// m-element array
void seqprfsum(int *u,int m)
{  int i,s=u[0];
   for (i = 1; i < m; i++) {
      u[i] += s;
      s = u[i];
   }
}

// OMP example, calculating prefix sums in parallel on the n-element
// array x, in-place; for simplicity, assume that n is divisible by the
// number of threads; z is for intermediate storage, an array with length
// equal to the number of threads; x and z point to global arrays
void parprfsum(int *x, int n, int *z)
{
   #pragma omp parallel
   {  int i,j,me = omp_get_thread_num(),
          nth = omp_get_num_threads(),
          chunksize = n / nth,
          start = me * chunksize;
      seqprfsum(&x[start],chunksize);
      #pragma omp barrier
      #pragma omp single
      {
      for (i = 0; i < nth-1; i++)
         z[i] = x[(i+1)*chunksize - 1];
      seqprfsum(z,nth-1);
      }
      if (me > 0) {
         for (j = start; j < start + chunksize; j++) {
            x[j] += z[me - 1];
         }
      }
   }
}
\end{lstlisting}

\section{Example:  Run-Length Coding Decompression in OpenMP}

Here is an example applying the code in the last section:  A method for
compressing data, called {\bf run-length coding}, is to store only
repeat counts in {\it runs}, where the latter means a set of
consecutive, identical values.  For instance, the sequence
2,2,2,0,0,5,0,0 would be compressed to 3,2,2,0,1,5,2,0, meaning that the
data consist of first three 2s, then two 0s, then one 5, and finally two
0s.  Note that the compressed version consists of alternating {\it run
counts} and {\it run values}, respectively 2 and 0 at the end of the
above example.

To solve this in OpenMP, we'll first call the above functions to decide
where to place the runs in our overall output.

\begin{lstlisting}[numbers=left]
void uncomprle(int *x,int nx,int *tmp,int *y,int *ny)
{
   int i,nx2 = nx/2;
   int z[MAXTHREADS];
   for (i = 0; i < nx2; i++) tmp[i+1] = x[2*i]; 
   parprfsum(tmp+1,nx2+1,z);
   tmp[0] = 0;
   #pragma omp parallel
   {  int j,k;
      int me=omp_get_thread_num();  
      #pragma omp for
      for (j = 0; j < nx2; j++) {
         // where to start the j-th run?
         int start = tmp[j];
         // what value is in the run?
         int val = x[2*j+1];
         // how long is the run?
         int nrun = x[2*j];
         for (k = 0; k < nrun; k++) 
            y[start+k] = val;
      }
   }
   *ny = tmp[nx2];
}
\end{lstlisting}

\section{Example:  Run-Length Coding Decompression in Thrust}
\label{thrustrunlength}

Here's how we could do the first part of the operation above, i.e.\ 
determining where to place the runs in our overall output, in Thrust:

\begin{lstlisting}[numbers=left]
#include <stdio.h>
#include <thrust/device_vector.h>
#include <thrust/scan.h>
#include <thrust/sequence.h>
#include <thrust/remove.h>  

struct iseven {
   bool operator()(const int i)
   {  return (i % 2) == 0;
   }
};

int main()
{  int i;
   int x[12] = {3,2,2,0,1,5,2,0};
   int nx = 12;
   thrust::device_vector<int> out(nx);
   thrust::device_vector<int> seq(nx);
   thrust::sequence(seq.begin(),seq.end(),0);
   thrust::device_vector<int> dx(x,x+nx);
   thrust::device_vector<int>::iterator newend =
      thrust::copy_if(dx.begin(),dx.end(),seq.begin(),out.begin(),iseven());
   thrust::inclusive_scan(out.begin(),out.end(),out.begin());
   // "out" should be 3,3+2 = 5,3+2+2=7,...
   thrust::copy(out.begin(), newend,
      std::ostream_iterator<int>(std::cout, " "));
   std::cout << "\n";
}
\end{lstlisting}

\section{Example:  Moving Average}
\label{ma}

A {\it moving average} is defined as follows.  With input $x_1,..,x_n$
and window width $w$, the output is $a_w,...,a_n$, where

\begin{equation}
\label{defineai}
a_i = \frac{x_{i-w+1}+...+x_i}{w}
\end{equation}

The goal is to address the question, ``What has the recent trend
been?'' at time $i$, $i= w,...,n$.

\subsection{Rth Code}

{\bf Rth} is an R interface to Thrust, for certain algorithms, including
moving average, via the function {\bf rthma()}.  You can download it
from \url{https://github.com/Rth-org/Rth}.

Below is the C++ code, contents of the file {\bf rthma.cpp} in the {\bf
Rth} package.  Before reading it, note that the {\bf Rth} code here
uses {\bf Rcpp}, a very handy interface from R to C++.\footnote{Later
versions of {\bf Rth} do not use {\bf Rcpp}.}

\begin{lstlisting}[numbers=left,language=c]
// Rth interface to Thrust moving-average example,
// simple_moving_average.cu in
// github.com/thrust/thrust/blob/master/examples/

// C++ code adapted from example in Thrust docs

#include <thrust/device_vector.h>
#include <thrust/scan.h>
#include <thrust/transform.h>
#include <thrust/functional.h>
#include <thrust/sequence.h>
#include <Rcpp.h>  // Rcpp includes
#include "backend.h"  // from Rth

// update function, to calculate current moving average value from the
// previous one
struct minus_and_divide : 
   public thrust::binary_function<double,double,double>
{
    double w;
    minus_and_divide(double w) : w(w) {}
    __host__ __device__
    double operator()(const double& a, const double& b) const
    { return (a - b) / w; }
};

// computes moving averages from x of window width w; SEXP ("S
// expression") is the name of the struct type used by R for internal
// storage of objects
RcppExport SEXP rthma(SEXP x, SEXP w, SEXP nthreads)
{
   Rcpp::NumericVector xa(x);  // convert to C++ array
   int wa = INTEGER(w)[0];  // convert this SEXP to int
   
   #if RTH_OMP
   omp_set_num_threads(INT(nthreads));
   #elif RTH_TBB
   tbb::task_scheduler_init init(INT(nthreads));
   #endif
   
   // set up device vector and copy xa to it
   thrust::device_vector<double> dx(xa.begin(), xa.end());

   int xas = xa.size();
   if (xas < wa)
       return 0;
   
   // allocate device storage for cumulative sums, and compute them
   thrust::device_vector<double> csums(xa.size() + 1);
   thrust::exclusive_scan(dx.begin(), dx.end(), csums.begin());
   // need one more sum at (actually past) the end
   csums[xas] = xa[xas-1] + csums[xas-1];

   // compute moving averages from cumulative sums
   Rcpp::NumericVector xb(xas - wa + 1);
   thrust::transform(csums.begin() + wa, csums.end(), 
      csums.begin(), xb.begin(), minus_and_divide(double(wa)));
  
   return xb;
}

\end{lstlisting}

\subsection{Algorithm}

Again with the $x_i$ as inputs, it first computes the
cumulative sums $c_i$, using the Thrust function {\bf
exclusive\_scan()}:

\begin{lstlisting}
thrust::exclusive_scan(dx.begin(), dx.end(), csums.begin());
\end{lstlisting}

Here {\bf dx} contains a copy of the $x_i$ on the device (GPU or
OpenMP/TBB), and {\bf csums} will contain our cumulative sums $c_i$.

Since the numerator of (\ref{defineai}) is

\begin{equation}
\label{ainumer}
x_{i-w+1}+...+x_i = c_i - c_{i-w}
\end{equation}

We then need only compute these differences $c_i - c_{i-w}
$and divide by $w$.  

To do all this, we use Thrust's {\bf transform()} function:

\begin{lstlisting}[language=c]
thrust::transform(csums.begin() + wa, csums.end(), csums.begin(), 
   xb.begin(), minus_and_divide(double(wa)));
\end{lstlisting}

As the name implies, {\bf transform()} takes one or more inputs, applies
a user-specified transformation, and writes to an output vector.  You
can see that the first two arguments are first the $c_i$, shifted left
by $w$, and then the $c_i$ themselves.  The functor computes the values
in (\ref{ainumer}) and divides by $w$:

\begin{lstlisting}
struct minus_and_divide : 
   public thrust::binary_function<double,double,double>
{
    double w;
    minus_and_divide(double w) : w(w) {}
    __host__ __device__
    double operator()(const double& a, const double& b) const
    { return (a - b) / w; }
};
\end{lstlisting}

\subsection{Use of Lambda Functions}
\label{lambda}

If you have a compiler that allows C++11 {\it lambda functions}, these
can make life much simpler for you if you use Thrust, TBB or anything
that uses functors.  Let's see how this would work with our C++ function
{\bf rthma()} above (scroll down to ``changed code'').

\begin{lstlisting}[numbers=left]
#include <thrust/device_vector.h>
#include <thrust/scan.h>
#include <thrust/transform.h>
#include <thrust/functional.h>
#include <thrust/sequence.h>
#include <Rcpp.h>
#include "backend.h"  // from Rth

// struct minus_and_divide now deleted

RcppExport SEXP rthma(SEXP x, SEXP w, SEXP nthreads)
{
   Rcpp::NumericVector xa(x);
   int wa = INTEGER(w)[0];
   #if RTH_OMP
   omp_set_num_threads(INT(nthreads));
   #elif RTH_TBB
   tbb::task_scheduler_init init(INT(nthreads));
   #endif
   thrust::device_vector<double> dx(xa.begin(), xa.end());
   int xas = xa.size();
   if (xas < wa) return 0;
   thrust::device_vector<double> csums(xa.size() + 1);
   thrust::exclusive_scan(dx.begin(), dx.end(), csums.begin());
   csums[xas] = xa[xas-1] + csums[xas-1];
   Rcpp::NumericVector xb(xas - wa + 1);

   // changed code
   thrust::transform(csums.begin() + wa, csums.end(), 
      csums.begin(), xb.begin(), 
      // lambda function
      [=] (double& a, double& b) {  return( (a-b)/wa);  });
  
   return xb;
}
\end{lstlisting}

Just what is going on in the line

\begin{lstlisting}
[=] (double& a, double& b) {  return( (a-b)/wa);  });
\end{lstlisting}

We are creating a function object here, as we did in the earlier version
with an {\bf operator()} notation within a {\bf struct}.  But here we
are doing so right on the spot, in one of the arguments to {\bf
thrust::transform()}.  It's similar to the concept of {\it anonymous}
functions in R.  Here are the details:

\begin{itemize}

\item The brackets in [=] tells the compiler that a function object is
about to begin.  (The = sign will be explained shortly.)

\item The variable {\bf wa} that was local to the function {\bf rthma()}
is considered ``global'' to our lambda function, thus accessible to it
{\it without passing} {\bf wa} {\it as an argument to that function).}
(This too is similar to the R case.) We say that {\bf wa} is {\it
captured} by that function.  

The = indicates that we wish to use {\bf wa} {\it by value}, i.e. just
for its value alone, rather than it being a pointer that we wish to
reference.  If it had been the latter (with the capability of changing
the pointed-to location), we'd use \& instead of =.

\item Here {\bf a} and {\bf b} are ordinary arguments.

\end{itemize}

All this is so much clearer and cleaner than using a functor!

By the way, in using {\bf g++} for the compilation, I did need to add
the {\bf -std=c++11} compiler flag to use lambda functions.