\chapter{Introduction to OpenMP}
\label{chap:omp}

OpenMP has become the {\it de facto} standard for shared-memory
programming.

\section{Overview}

OpenMP has become the environment of choice for many, if not most,
practitioners of shared-memory parallel programming.  It consists of a
set of directives which are added to one's C/C++/FORTRAN code that
manipulate threads, without the programmer him/herself having to deal
with the threads directly.  This way we get ``the best of both
worlds''---the true parallelism of (nonpreemptive) threads and the
pleasure of avoiding the annoyances of threads programming.

Most OpenMP constructs are expressed via {\bf pragmas}, i.e. directives.
The syntax is

\begin{Verbatim}[fontsize=\relsize{-2}]
#pragma omp ......
\end{Verbatim}

The number sign must be the first nonblank character in the line.

\section{Example:  Dijkstra Shortest-Path Algorithm}
\label{running}

The following example, implementing Dijkstra's shortest-path graph
algorithm, will be used throughout this tutorial, with various OpenMP
constructs being illustrated later by modifying this code:

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
// Dijkstra.c

// OpenMP example program:  Dijkstra shortest-path finder in a
// bidirectional graph; finds the shortest path from vertex 0 to all
// others

// usage:  dijkstra nv print

// where nv is the size of the graph, and print is 1 if graph and min
// distances are to be printed out, 0 otherwise

#include <omp.h>

// global variables, shared by all threads by default

int nv,  // number of vertices
    *notdone, // vertices not checked yet
    nth,  // number of threads
    chunk,  // number of vertices handled by each thread
    md,  // current min over all threads
    mv,  // vertex which achieves that min
    largeint = -1;  // max possible unsigned int

unsigned *ohd,  // 1-hop distances between vertices; "ohd[i][j]" is
         // ohd[i*nv+j]
         *mind;  // min distances found so far

void init(int ac, char **av)
{  int i,j,tmp;
   nv = atoi(av[1]);
   ohd = malloc(nv*nv*sizeof(int));
   mind = malloc(nv*sizeof(int));
   notdone = malloc(nv*sizeof(int)); 
   // random graph
   for (i = 0; i < nv; i++)  
      for (j = i; j < nv; j++)  {
         if (j == i) ohd[i*nv+i] = 0;
         else  {
            ohd[nv*i+j] = rand() % 20;
            ohd[nv*j+i] = ohd[nv*i+j];
         }
      }
   for (i = 1; i < nv; i++)  {
      notdone[i] = 1;
      mind[i] = ohd[i];
   }
}

// finds closest to 0 among notdone, among s through e
void findmymin(int s, int e, unsigned *d, int *v)
{  int i;
   *d = largeint; 
   for (i = s; i <= e; i++)
      if (notdone[i] && mind[i] < *d)  {
         *d = ohd[i];
         *v = i;
      }
}

// for each i in [s,e], ask whether a shorter path to i exists, through
// mv
void updatemind(int s, int e)
{  int i;
   for (i = s; i <= e; i++)
      if (mind[mv] + ohd[mv*nv+i] < mind[i])  
         mind[i] = mind[mv] + ohd[mv*nv+i];
}

void dowork()
{  
   #pragma omp parallel  
   {  int startv,endv,  // start, end vertices for my thread
          step,  // whole procedure goes nv steps
          mymv,  // vertex which attains the min value in my chunk
          me = omp_get_thread_num();  
          unsigned mymd;  // min value found by this thread
      #pragma omp single   
      {  nth = omp_get_num_threads();  // must call inside parallel block
         if (nv % nth != 0) {
            printf("nv must be divisible by nth\n");
            exit(1);
         }
         chunk = nv/nth;  
         printf("there are %d threads\n",nth);  
      }
      startv = me * chunk; 
      endv = startv + chunk - 1;
      for (step = 0; step < nv; step++)  {
         // find closest vertex to 0 among notdone; each thread finds
         // closest in its group, then we find overall closest
         #pragma omp single 
         {  md = largeint; mv = 0;  }
         findmymin(startv,endv,&mymd,&mymv);
         // update overall min if mine is smaller
         #pragma omp critical  
         {  if (mymd < md)  
               {  md = mymd; mv = mymv;  }
         }
         #pragma omp barrier 
         // mark new vertex as done 
         #pragma omp single 
         {  notdone[mv] = 0;  }
         // now update my section of mind
         updatemind(startv,endv);
         #pragma omp barrier 
      }
   }
}

int main(int argc, char **argv)
{  int i,j,print;
   double startime,endtime;
   init(argc,argv);
   startime = omp_get_wtime();
   // parallel
   dowork();  
   // back to single thread
   endtime = omp_get_wtime();
   printf("elapsed time:  %f\n",endtime-startime);
   print = atoi(argv[2]);
   if (print)  {
      printf("graph weights:\n");
      for (i = 0; i < nv; i++)  {
         for (j = 0; j < nv; j++)  
            printf("%u  ",ohd[nv*i+j]);
         printf("\n");
      }
      printf("minimum distances:\n");
      for (i = 1; i < nv; i++)
         printf("%u\n",mind[i]);
   }
}
\end{Verbatim}

The constructs will be presented in the following sections, but first
the algorithm will be explained.

\subsection{The Algorithm}

The code implements the Dijkstra algorithm for finding the shortest
paths from vertex 0 to the other vertices in an N-vertex undirected
graph.  Pseudocode for the algorithm is shown below, with the array G
assumed to contain the one-hop distances between vertices.

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
Done = {0}  # vertices checked so far
NewDone = None  # currently checked vertex
NonDone = {1,2,...,N-1}  # vertices not checked yet
for J = 0 to N-1 Dist[J] = G(0,J)  # initialize shortest-path lengths

for Step = 1 to N-1
   find J such that Dist[J] is min among all J in NonDone
   transfer J from NonDone to Done
   NewDone = J
   for K = 1 to N-1
      if K is in NonDone
         # check if there is a shorter path from 0 to K through NewDone
         # than our best so far
         Dist[K] = min(Dist[K],Dist[NewDone]+G[NewDone,K])
\end{Verbatim}

At each iteration, the algorithm finds the closest vertex J to 0 among all
those not yet processed, and then updates the list of minimum distances
to each vertex from 0 by considering paths that go through J.  Two
obvious potential candidate part of the algorithm for parallelization
are the ``find J'' and ``for K'' lines, and the above OpenMP code takes
this approach.

\subsection{The OpenMP {\tt parallel} Pragma}

As can be seen in the comments in the lines 

\begin{Verbatim}[fontsize=\relsize{-2}]
   // parallel
   dowork();  
   // back to single thread
\end{Verbatim}

the function {\bf main()} is run by a {\bf master thread}, which will
then branch off into many threads running {\bf dowork()} in parallel.
The latter feat is accomplished by the directive in the lines

\begin{Verbatim}[fontsize=\relsize{-2}]
void dowork()
{  
   #pragma omp parallel  
   {  int startv,endv,  // start, end vertices for this thread
          step,  // whole procedure goes nv steps
          mymv,  // vertex which attains that value
          me = omp_get_thread_num();  
\end{Verbatim}

That directive sets up a team of threads (which includes the master),
all of which execute the block following the directive in
parallel.\footnote{There is an issue here of thread startup time.  The
OMPi compiler sets up threads at the outset, so that that startup time
is incurred only once.  When a {\bf parallel} construct is encountered,
they are awakened.  At the end of the construct, they are suspended
again, until the next {\bf parallel} construct is reached.} Note that,
unlike the {\bf for} directive which will be discussed below, the {\bf
parallel} directive leaves it up to the programmer as to how to
partition the work.  In our case here, we do that by setting the range
of vertices which this thread will process:

\begin{Verbatim}[fontsize=\relsize{-2}]
      startv = me * chunk; 
      endv = startv + chunk - 1;
\end{Verbatim}

Again, keep in mind that {\it all} of the threads execute this code, but
we've set things up with the variable {\bf me} so that different threads
will work on different vertices.  This is due to the OpenMP call

\begin{Verbatim}[fontsize=\relsize{-2}]
          me = omp_get_thread_num();  
\end{Verbatim}

which sets {\bf me} to the thread number for this thread.

\subsection{Scope Issues}

Note carefully that in

\begin{Verbatim}[fontsize=\relsize{-2}]
   #pragma omp parallel  
   {  int startv,endv,  // start, end vertices for this thread
          step,  // whole procedure goes nv steps
          mymv,  // vertex which attains that value
          me = omp_get_thread_num();  
\end{Verbatim} 

the pragma comes {\it before} the declaration of the local variables.
That means that all of them are ``local'' to each thread, i.e. not
shared by them.  But if a work sharing directive comes within a function
but {\it after} declaration of local variables, those variables are
actually ``global'' to the code in the directive, i.e. they {\it are} 
shared in common among the threads.  

This is the default, but you can change these properties, e.g. using the
{\bf private} keyword and its cousins.  For instance, 

\begin{Verbatim}[fontsize=\relsize{-2}]
#pragma omp parallel private(x,y)
\end{Verbatim}

would make {\bf x} and {\bf y} nonshared even if they were declared
above the directive line.  You may wish to modify that a bit, so that
{\bf x} and {\bf y} have initial values that were shared before the
directive; use {\bf firstprivate} for this.

It is crucial to keep in mind that variables which are global to the
program (in the C/C++ sense) are automatically global to all threads.
This is the primary means by which the threads communicate with each
other.

\subsection{The OpenMP {\tt single} Pragma} 

In some cases we want just one thread to execute some code, even though
that code is part of a {\bf parallel} or other {\bf work sharing}
block.\footnote{This is an OpenMP term.  The {\bf for} directive is another
example of it.  More on this below.}  We use the {\bf single} directive
to do this, e.g.:

\begin{Verbatim}[fontsize=\relsize{-2}]
      #pragma omp single   
      {  nth = omp_get_num_threads();  
         if (nv % nth != 0) {
            printf("nv must be divisible by nth\n");
            exit(1);
         }
         chunk = nv/nth;  
         printf("there are %d threads\n",nth);  }
\end{Verbatim}

Since the variables {\bf nth} and {\bf chunk} are global and thus
shared, we need not have all threads set them, hence our use of {\bf
single}.

\subsection{The OpenMP {\tt barrier} Pragma} 

As see in the example above, the {\bf barrier} implements a standard
barrier, applying to all threads.

\subsection{Implicit Barriers}

Note that there is an implicit barrier at the end of each {\bf single}
block, which is also the case for {\bf parallel}, {\bf for}, and {\bf
sections} blocks.  This can be overridden via the {\bf nowait} clause,
e.g.

\begin{Verbatim}[fontsize=\relsize{-2}]
#pragma omp for nowait
\end{Verbatim}

Needless to say, the latter should be used with care, and in most cases
will not be usable.  On the other hand, putting in a barrier where it is
not needed would severely reduce performance.

\subsection{The OpenMP {\tt critical} Pragma}

The last construct used in this example is {\bf critical}, for critical
sections.  

\begin{Verbatim}[fontsize=\relsize{-2}]
         #pragma omp critical
         {  if (mymd < md)
              {  md = mymd; mv = mymv;  }
         }
\end{Verbatim}

It means what it says, allowing entry of only one thread at a time while
others wait.  Here we are updating global variables {\bf md} and {\bf
mv}, which has to be done atomically, and {\bf critical} takes care of
that for us.  This is much more convenient than setting up lock
variables, etc., which we would do if we were programming threads code
directly.

\section{The OpenMP {\tt for} Pragma}

This one breaks up a C/C++ {\bf for} loop, assigning various iterations
to various threads.  (The threads, of course, must have already been set
up via the {\bf omp parallel} pragma.) This way the iterations are done
in parallel.  Of course, that means that they need to be independent
iterations, i.e. one iteration cannot depend on the result of another.

\subsection{Example:  Dijkstra with Parallel for Loops}
\label{basic}

Here's how we could use this construct in the Dijkstra program :

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
// Dijkstra.c

// OpenMP example program (OMPi version):  Dijkstra shortest-path finder
// in a bidirectional graph; finds the shortest path from vertex 0 to
// all others

// usage:  dijkstra nv print

// where nv is the size of the graph, and print is 1 if graph and min
// distances are to be printed out, 0 otherwise

#include <omp.h>

// global variables, shared by all threads by default

int nv,  // number of vertices
    *notdone, // vertices not checked yet
    nth,  // number of threads
    chunk,  // number of vertices handled by each thread
    md,  // current min over all threads
    mv,  // vertex which achieves that min
    largeint = -1;  // max possible unsigned int

unsigned *ohd,  // 1-hop distances between vertices; "ohd[i][j]" is
                // ohd[i*nv+j]
         *mind;  // min distances found so far

void init(int ac, char **av)
{  int i,j,tmp;
   nv = atoi(av[1]);
   ohd = malloc(nv*nv*sizeof(int));
   mind = malloc(nv*sizeof(int));
   notdone = malloc(nv*sizeof(int)); 
   // random graph
   for (i = 0; i < nv; i++)  
      for (j = i; j < nv; j++)  {
         if (j == i) ohd[i*nv+i] = 0;
         else  {
            ohd[nv*i+j] = rand() % 20;
            ohd[nv*j+i] = ohd[nv*i+j];
         }
      }
   for (i = 1; i < nv; i++)  {
      notdone[i] = 1;
      mind[i] = ohd[i];
   }
}

void dowork()
{  
   #pragma omp parallel  
   {  int step,  // whole procedure goes nv steps
          mymv,  // vertex which attains that value
          me = omp_get_thread_num(),
          i;  
      unsigned mymd;  // min value found by this thread
      #pragma omp single   
      {  nth = omp_get_num_threads();  
         printf("there are %d threads\n",nth);  }
      for (step = 0; step < nv; step++)  {
         // find closest vertex to 0 among notdone; each thread finds
         // closest in its group, then we find overall closest
         #pragma omp single 
         {  md = largeint; mv = 0;  }
         mymd = largeint; 
         #pragma omp for nowait
         for (i = 1; i < nv; i++)  {
            if (notdone[i] && mind[i] < mymd)  {
               mymd = ohd[i];
               mymv = i;
            }
         }
         // update overall min if mine is smaller
         #pragma omp critical  
         {  if (mymd < md)  
              {  md = mymd; mv = mymv;  }
         }
         // mark new vertex as done 
         #pragma omp single 
         {  notdone[mv] = 0;  }
         // now update ohd
         #pragma omp for
         for (i = 1; i < nv; i++)  
            if (mind[mv] + ohd[mv*nv+i] < mind[i])  
               mind[i] = mind[mv] + ohd[mv*nv+i];
      }
   }
}

int main(int argc, char **argv)
{  int i,j,print;
   init(argc,argv);
   // parallel
   dowork();  
   // back to single thread
   print = atoi(argv[2]);
   if (print)  {
      printf("graph weights:\n");
      for (i = 0; i < nv; i++)  {
         for (j = 0; j < nv; j++)  
            printf("%u  ",ohd[nv*i+j]);
         printf("\n");
      }
      printf("minimum distances:\n");
      for (i = 1; i < nv; i++)
         printf("%u\n",mind[i]);
   }
}

\end{Verbatim}

The work which used to be done in the function {\bf findmymin()} is now
done here:

\begin{Verbatim}[fontsize=\relsize{-2}]
         #pragma omp for
         for (i = 1; i < nv; i++)  {
            if (notdone[i] && mind[i] < mymd)  {
               mymd = ohd[i];
               mymv = i;
            }
         }
\end{Verbatim}

Each thread executes one or more of the iterations, i.e. takes
responsibility for one or more values of {\it i}.  This occurs in
parallel, so as mentioned earlier, the programmer must make sure that
the iterations are independent; there is no predicting which threads
will do which values of {\bf i}, in which order.  By the way, for
obvious reasons OpenMP treats the loop index, {\bf i} here, as private
even if by context it would be shared.

\subsection{Nested Loops}

If we use the {\bf for} pragma to nested loops, by default the pragma
applies only to the outer loop.  We can of course insert another {\bf
for} pragma inside, to parallelize the inner loop.

Or, starting with OpenMP version 3.0, one can use the {\bf collapse}
clause, e.g.

\begin{Verbatim}[fontsize=\relsize{-2}]
#pragma omp parallel for collapse(2)
\end{Verbatim}

to specify two levels of nesting in the assignment of threads to tasks.

\subsection{Controlling the Partitioning of Work to Threads: the
schedule Clause}
\label{schedulework}

In this default version of the {\bf for} construct, iterations are
executed by threads {\it in unpredictable order}; the OpenMP standard
does not specify which threads will execute which iterations in which
order.  But this can be controlled by the programmer, using the {\bf
schedule} clause.  OpenMP provides three choices for this:

\begin{itemize}

\item {\bf static:}  The iterations are grouped into chunks, and
assigned to threads in round-robin fashion.  Default chunk size is
approximately the number of iterations divided by the number of threads.

\item {\bf dynamic:}  Again the iterations are grouped into chunks, but
here the assignment of chunks to threads is done dynamically.  When a
thread finishes working on a chunk, it asks the OpenMP runtime system to
assign it the next chunk in the queue.  Default chunk size is 1.

\item {\bf guided:}  Similar to dynamic, but with the chunk size
decreasing as execution proceeds.

\end{itemize}

For instance, our original version of our program in Section
\ref{running} broke the work into chunks, with chunk size being the
number vertices divided by the number of threads.  

For the Dijkstra algorithm, for instance, we could get the same
operation with less code by asking OpenMP to do the chunking for us, say
with a chunk size of 8:

\begin{Verbatim}[fontsize=\relsize{-2}]
...
         #pragma omp for schedule(static)
         for (i = 1; i < nv; i++)  {
            if (notdone[i] && mind[i] < mymd)  {
               mymd = ohd[i];
               mymv = i;
            }
         }
...
         #pragma omp for schedule(static)
         for (i = 1; i < nv; i++)
            if (mind[mv] + ohd[mv*nv+i] < mind[i])
               mind[i] = mind[mv] + ohd[mv*nv+i];
...
\end{Verbatim}

Note again that this would have the same effect as our original code,
which each thread handling one chunk of contiguous iterations within a
loop.  So it's just a programming convenience for us in this case.  (If
the number of threads doesn't evenly divide the number of iterations,
OpenMP will fix that up for us too.)

The more general form is

\begin{Verbatim}[fontsize=\relsize{-2}]
#pragma omp for schedule(static,chunk)
\end{Verbatim}

Here {\bf static} is still a keyword but {\bf chunk} is an actual
argument.  However, setting the chunk size in the {\bf schedule()}
clause is a {\it compile-time} operation.  If you wish to have the chunk
size set at run time, call {\bf omp\_set\_schedule()} in conjunction
with the {\bf runtime} clause.  Example:

\begin{lstlisting}[numbers=left]
int main(int argc, char **argv)
{   
   ...
   n = atoi(argv[1]);
   int chunk = atoi(argv[2]);
   omp_set_schedule(omp_sched_static, chunk);
   #pragma omp parallel  
   {  
      ...
      #pragma omp for schedule(runtime)
      for (i = 1; i < n; i++)  {
         ...
      }
      ...
   }
}
\end{lstlisting}


Or set the {\bf OMP\_SCHEDULE} environment variable.

The syntax is the same for {\bf dynamic} and {\bf guided}.

As discussed in Section \ref{methodabest}, on the one hand, large chunks
are good, due to there being less overhead---every time a thread
finishes a chunk, it must go through the critical section, which
serializes our parallel program and thus slows things down.  On the
other hand, if chunk sizes are large, then toward the end of the work,
some threads may be working on their last chunks while others have
finished and are now idle, thus foregoing potential speed enhancement.
So it would be nice to have large chunks at the beginning of the run, to
reduce the overhead, but smaller chunks at the end.  This can be done
using the {\bf guided} clause.

For the Dijkstra algorithm, for instance, we could have this:

\begin{Verbatim}[fontsize=\relsize{-2}]
...
         #pragma omp for schedule(guided)
         for (i = 1; i < nv; i++)  {
            if (notdone[i] && mind[i] < mymd)  {
               mymd = ohd[i];
               mymv = i;
            }
         }
...
         #pragma omp for schedule(guided)
         for (i = 1; i < nv; i++)
            if (mind[mv] + ohd[mv*nv+i] < mind[i])
               mind[i] = mind[mv] + ohd[mv*nv+i];
...
\end{Verbatim}

There are other variations of this available in OpenMP.  However, in
Section \ref{methodabest}, I showed that these would seldom be
necessary or desirable; having each thread handle a single chunk would
be best.

See Section \ref{methodabest} for a timing example.

\begin{lstlisting}[numbers=left]
setenv OMP_SCHEDULE "static,20"
\end{lstlisting}

\subsection{The OpenMP {\tt reduction} Clause}

The name of this OpenMP clause alludes to the term {\bf reduction} in
functional programming.  Many parallel programming languages include
such operations, to enable the programmer to more conveniently (and
often more efficiently) have threads/processors cooperate in computing
sums, products, etc.  OpenMP does this via the {\bf reduction} clause.  

For example, consider

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
int z;
...
#pragma omp for reduction(+:z)
for (i = 0; i < n; i++)  z += x[i];
\end{Verbatim}

The pragma says that the threads will share the work as in our previous
discussion of the {\bf for} pragma.  In addition, though, there will be
independent copies of {\bf z} maintained for each thread, each
initialized to 0 before the loop begins.  When the loop is entirely
done, the values of {\bf z} from the various threads will be summed, of
course in an atomic manner.  

Note that the {\bf +} operator not only indicates that the values of {\bf z}
are to be summed, but also that their initial values are to be 0.  If
the operator were {\bf *}, say, then the product of the values would be
computed, and their initial values would be 1.  

One can specify several reduction variables to the right of the colon,
separated by commas.  

Our use of the {\bf reduction} clause here makes our programming much
easier.  Indeed, if we had old serial code that we wanted to parallelize,
we would have to make no change to it!  OpenMP is taking care of both
the work splitting across values of {\bf i}, and the atomic operations.
Moreover---note this carefully---it is efficient, because by maintaining
separate copies of {\bf z} until the loop is done, we are reducing the
number of serializing atomic actions, and are avoiding time-costly cache
coherency transactions and the like.

Without this construct, we would have to do

\begin{Verbatim}[fontsize=\relsize{-2}]
int z,myz=0;
...
#pragma omp for private(myz)
for (i = 0; i < n; i++)  myz += x[i];
#pragma omp critical
{ z += myz; }
\end{Verbatim}

Here are the eligible operators and the corresponding initial values:

In C/C++, you can use {\bf reduction} with +, -, *, \&, $|$, \&\& and
$||$ (and the exclusive-or operator).  

\begin{tabular}{|l|l|}
\hline
operator & initial value \\ \hline 
\hline
+ & 0 \\ \hline 
- & 0 \\ \hline 
* & 1 \\ \hline 
\& & bit string of 1s \\ \hline 
$|$ & bit string of 0s \\ \hline 
\verb1^1 & 0 \\ \hline 
\&\& & 1 \\ \hline 
$||$ & 0 \\ \hline 
\end{tabular}

The lack of other operations typically found in other parallel
programming languages, such as min and max, is due to the lack of these
operators in C/C++.  The FORTRAN version of OpenMP does have min and
max.\footnote{Note, though, that plain min and max would not help in our
Dijkstra example above, as we not only need to find the minimum value,
but also need the vertex which attains that value.}

Note that the reduction variables must be shared by the threads, and
apparently the only acceptable way to do so in this case is to declare
them as global variables.

A reduction variable must be scalar, in C/C++.  It can be an array in
FORTRAN.

\section{Example: Mandelbrot Set}
\label{mandelbrotcode}

Here's the code for the timings in Section \ref{mandelbrottiming}:

\begin{lstlisting}[numbers=left]
// compile with -D, e.g.
//
//    gcc -fopenmp -o manddyn Gove.c -DDYNAMIC
//
// to get the version that uses dynamic scheduling

#include <omp.h>
#include <complex.h>

#include <time.h>
float timediff(struct timespec t1, struct timespec t2)
{  if (t1.tv_nsec > t2.tv_nsec) {
         t2.tv_sec -= 1;
         t2.tv_nsec += 1000000000;
   }
   return t2.tv_sec-t1.tv_sec + 0.000000001 * (t2.tv_nsec-t1.tv_nsec);
}

#ifdef RC
// finds chunk among 0,...,n-1 to assign to thread number me among nth
// threads
void findmyrange(int n,int nth,int me,int *myrange)
{  int chunksize = n / nth;
   myrange[0] = me * chunksize;
   if (me < nth-1) myrange[1] = (me+1) * chunksize - 1;
   else myrange[1] = n - 1;
}

#include <stdlib.h>
#include <stdio.h>
// from http://www.cis.temple.edu/~ingargio/cis71/code/randompermute.c
// It returns a random permutation of 0..n-1
int * rpermute(int n) {
  int *a = (int *)(int *)  malloc(n*sizeof(int));
  // int *a = malloc(n*sizeof(int));
  int k;
  for (k = 0; k < n; k++)
    a[k] = k;
  for (k = n-1; k > 0; k--) {
    int j = rand() % (k+1);
    int temp = a[j];
    a[j] = a[k];
    a[k] = temp;
  }
  return a;
}
#endif

#define MAXITERS 1000

// globals
int count = 0;
int nptsside;
float side2;
float side4;

int inset(double complex c) {
   int iters;
   float rl,im;
   double complex z = c;
   for (iters = 0; iters < MAXITERS; iters++) { 
      z = z*z + c;
      rl = creal(z);
      im = cimag(z);
      if (rl*rl + im*im > 4) return 0;
   }
   return 1;
}

int *scram;

void dowork()
{  
   #ifdef RC
   #pragma omp parallel reduction(+:count)
   #else 
   #pragma omp parallel 
   #endif
   {  
      int x,y; float xv,yv;
      double complex z;
      #ifdef STATIC
      #pragma omp for reduction(+:count) schedule(static)
      #elif defined DYNAMIC
      #pragma omp for reduction(+:count) schedule(dynamic)
      #elif defined GUIDED
      #pragma omp for reduction(+:count) schedule(guided)
      #endif
      #ifdef RC
      int myrange[2];
      int me = omp_get_thread_num();
      int nth = omp_get_num_threads();
      int i;
      findmyrange(nptsside,nth,me,myrange);
      for (i = myrange[0]; i <= myrange[1]; i++) {
         x = scram[i];
      #else
      for (x=0; x<nptsside; x++) {  
      #endif
         for ( y=0; y<nptsside; y++)  {
            xv = (x - side2) / side4;
            yv = (y - side2) / side4;
            z = xv + yv*I;
            if (inset(z)) {
               count++;
            }
         }
      }
   }
}

int main(int argc, char **argv)
{
   nptsside = atoi(argv[1]);
   side2 = nptsside / 2.0;
   side4 = nptsside / 4.0;

   struct timespec bgn,nd;
   clock_gettime(CLOCK_REALTIME, &bgn);
   
   #ifdef RC
   scram = rpermute(nptsside);
   #endif

   dowork();
 
   // implied barrier
   printf("%d\n",count);
   clock_gettime(CLOCK_REALTIME, &nd);
   printf("%f\n",timediff(bgn,nd));
}
\end{lstlisting}

The code is similar to that of a number of books and Web sites, such as
the Gove book cited in Section \ref{loadbalance}.  Here RC is the random
chunk method discussed in Section \ref{methodabest}.

\section{The Task Directive}
\label{taskdir}

This is new to OpenMP 3.0.  The basic idea is to set up a task queue:
When a thread encounters a {\bf task} directive, it arranges for some
thread to execute the associated block---at some time.  The first thread
can continue.  Note that the task might not execute right away; it may
have to wait for some thread to become free after finishing another
task.  Also, there may be more tasks than threads, also causing some
threads to wait.

Note that we could arrange for all this ourselves, without {\bf task}.
We'd set up our own work queue, as a shared variable, and write our code
so that whenever a thread finished a unit of work, it would delete the
head of the queue.  Whenever a thread generated a unit of work, it would
add it to the que.  Of course, the deletion and addition would have to
be done atomically.  All this would amount to a lot of coding on our
part, so {\bf task} really simplifies the programming.  

\subsection{Example:  Quicksort}

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
// OpenMP example program:  quicksort; not necessarily efficient

void swap(int *yi, int *yj) 
{  int tmp = *yi;
   *yi = *yj;
   *yj = tmp;
}

int *separate(int *x, int low, int high)
{  int i,pivot,last;
   pivot = x[low];  // would be better to take, e.g., median of 1st 3 elts
   swap(x+low,x+high);
   last = low;
   for (i = low; i < high; i++) {
      if (x[i] <= pivot) {
         swap(x+last,x+i);
         last += 1;
      }
   }
   swap(x+last,x+high);
   return last;
}

// quicksort of the array z, elements zstart through zend; set the
// latter to 0 and m-1 in first call, where m is the length of z;
// firstcall is 1 or 0, according to whether this is the first of the
// recursive calls
void qs(int *z, int zstart, int zend, int firstcall)
{  
   #pragma omp parallel
   {  int part;
      if (firstcall == 1) {
         #pragma omp single nowait
         qs(z,0,zend,0);
      } else {
         if (zstart < zend) {
            part = separate(z,zstart,zend);
            #pragma omp task
            qs(z,zstart,part-1,0);
            #pragma omp task
            qs(z,part+1,zend,0);
         }

      }
   }
}

// test code
main(int argc, char**argv)
{  int i,n,*w;
   n = atoi(argv[1]);
   w = malloc(n*sizeof(int));
   for (i = 0; i < n; i++) w[i] = rand();
   qs(w,0,n-1,1);
   if (n < 25) 
      for (i = 0; i < n; i++) printf("%d\n",w[i]);
}
\end{Verbatim}

The code

\begin{Verbatim}[fontsize=\relsize{-2}]
if (firstcall == 1) {
   #pragma omp single nowait
   qs(z,0,zend,0);
\end{Verbatim}

gets things going.  We want only one thread to execute the root of the
recursion tree, hence the need for the {\bf single} clause.  After that,
the code

\begin{Verbatim}[fontsize=\relsize{-2}]
part = separate(z,zstart,zend);
#pragma omp task
qs(z,zstart,part-1,0);
\end{Verbatim}

sets up a call to a subtree, with the {\bf task} directive stating,
``OMP system, please make sure that this subtree is handled by some
thread.''

There are various refinements, such as the barrier-like {\bf taskwait}
clause.

\section{Other OpenMP Synchronization Issues}

Earlier we saw the {\bf critical} and {\bf barrier} constructs.  There
is more to discuss, which we do here.

\subsection{The OpenMP {\tt atomic} Clause}

The {\bf critical} construct not only serializes your program, but also
it adds a lot of overhead.  If your critical section involves just a
one-statement update to a shared variable, e.g.

\begin{Verbatim}[fontsize=\relsize{-2}]
x += y;
\end{Verbatim}

etc., then the OpenMP compiler can take advantage of an atomic hardware
instruction, e.g. the LOCK prefix on Intel, to set up an extremely efficient
critical section, e.g.

\begin{Verbatim}[fontsize=\relsize{-2}]
#pragma omp atomic
x += y;
\end{Verbatim}

Since it is a single statement rather than a block, there are no
braces.

The eligible operators are: 

\begin{Verbatim}[fontsize=\relsize{-2}]
++, --, +=, *=, <<=, &=, |=
\end{Verbatim}

\subsection{Memory Consistency and the {\tt flush} Pragma}

Consider a shared-memory multiprocessor system with coherent caches, and
a shared, i.e. global, variable {\bf x}.  If one thread writes to {\bf
x}, you might think that the cache coherency system will ensure that the
new value is visible to other threads.  But as discussed in Section
\ref{consistency}, it is is not quite so simple as this.  

For example, the compiler may store {\bf x} in a register, and update
{\bf x} itself at certain points.  In between such updates, since the
memory location for {\bf x} is not written to, the cache will be unaware
of the new value, which thus will not be visible to other threads.  If
the processors have write buffers etc., the same problem occurs.  

In other words, we must account for the fact that our program could be
run on different kinds of hardware with different memory consistency
models.  Thus OpenMP must have its own memory consistency model, which
is then translated by the compiler to mesh with the hardware.

OpenMP takes a {\bf relaxed consistency} approach, meaning that it
forces updates to memory (``flushes'') at all synchronization points,
i.e. at:

\begin{itemize}
\item {\bf barrier}
\item entry/exit to/from {\bf critical}                                       
\item entry/exit to/from {\bf ordered}                                       
\item entry/exit to/from {\bf parallel}                                       
\item exit from {\bf parallel for}
\item exit from {\bf parallel sections}
\item exit from {\bf single}
\end{itemize}

In between synchronization points, one can force an update to {\bf x}
via the {\bf flush} pragma:

\begin{Verbatim}[fontsize=\relsize{-2}]
#pragma omp flush (x)    
\end{Verbatim}

The flush operation is obviously architecture-dependent.  OpenMP
compilers will typically have the proper machine instructions available
for some common architectures.  For the rest, it can force a flush at
the hardware level by doing lock/unlock operations, though this may be
costly in terms of time.

\section{Combining Work-Sharing Constructs}

In our examples of the {\bf for} pragma above, that pragma would come
within a block headed by a {\bf parallel} pragma.  The latter specifies
that a team of theads is to be created, with each one executing the
given block, while the former specifies that the various iterations of
the loop are to be distributed among the threads.  As a shortcut, we can
combine the two pragmas:

\begin{Verbatim}[fontsize=\relsize{-2}]
#pragma omp parallel for
\end{Verbatim}

This also works with the {\bf sections} pragma.

\section{The Rest of OpenMP}

There is much, much more to OpenMP than what we have seen here.  To see
the details, there are many Web pages you can check, and there is also
the excellent book, {\it Using OpenMP:  Portable Shared Memory Parallel
Programming}, by Barbara Chapman, Gabriele Jost and Ruud Van Der Pas,
MIT Press, 2008.  The book by Gove cited in Section \ref{loadbalance}
also includes coverage of OpenMP.

\section{Compiling, Running and Debugging OpenMP Code}

\subsection{Compiling} 

There are a number of open source compilers available for OpenMP,
including:

\begin{itemize}

\item Omni:  This is available at (\url{http://phase.hpcc.jp/Omni/}).
To compile an OpenMP program in {\bf x.c} and create an executable file
{\bf x}, run

\begin{Verbatim}[fontsize=\relsize{-2}]
omcc -g -o x x.c
\end{Verbatim}

Note:  Apparently declarations of local variables cannot be made in the
midst of code; they must precede all code within a block.

\item Ompi:  You can download this at
\url{http://www.cs.uoi.gr/~ompi/index.html}.  Compile {\bf x.c} by

\begin{Verbatim}[fontsize=\relsize{-2}]
ompicc -g -o x x.c
\end{Verbatim}

\item GCC, version 4.2 or later:\footnote{You may find certain
subversions of GCC 4.1 can be used too.}  Compile {\bf x.c} via

\begin{Verbatim}[fontsize=\relsize{-2}]
gcc -fopenmp -g -o x x.c
\end{Verbatim}

You can also use {\bf -lgomp} instead of {\bf -fopenmp}.

\end{itemize}

\subsection{Running}

Just run the executable as usual.  

The number of threads will be the number of processors, by default.  To
change that value, set the OMP\_NUM\_THREADS environment variable.  For
example, to get four threads in the C shell, type

\begin{Verbatim}[fontsize=\relsize{-2}]
setenv OMP_NUM_THREADS 4
\end{Verbatim}

\subsection{Debugging}

Since OpenMP is essentially just an interface to threads, your debugging
tool's threads facilities should serve you well.  See Section
\ref{debugthreads} for the GDB case.

A possible problem, though, is that OpenMP's use of pragmas makes it
difficult for the compilers to maintain your original source code line
numbers, and your function and variable names.  But with a little care,
a symbolic debugger such as GDB can still be used.  Here are some tips
for the compilers mentioned above, using GDB as our example debugging
tool:

\begin{itemize}

\item GCC:  GCC maintains line numbers and names well.  In earlier
versions, it had a problem in that it did not not retain names of local
variables within blocks controlled by {\bf omp parallel} at all.  
That problem was fixed in version 4.4 of the GCC suite, but seems to
have slipped back in with some later versions!

\item Omni:  The function {\bf main()} in your executable is actually in
the OpenMP library, and your function {\bf main()} is renamed {\bf
\_ompc\_main()}.  So, when you enter GDB, first set a breakpoint at your
own code:

\begin{Verbatim}[fontsize=\relsize{-2}]
(gdb) b _ompc_main
\end{Verbatim}

Then run your program to this breakpoint, and set whatever other
breakpoints you want.

You should find that your other variable and function names are
unchanged. 

\item Ompi: Older versions also changed your function names, but the
current version (1.2.0) doesn't.  Works fine in GDB.

\end{itemize}

\section{Performance}

As is usually the case with parallel programming, merely parallelizing a
program won't necessarily make it faster, even on shared-memory
hardware.  Operations such as critical sections, barriers and so on
serialize an otherwise-parallel program, sapping much of its speed.
In addition, there are issues of cache coherency transactions, false
sharing etc.

\subsection{The Effect of Problem Size}

To illustrate this, I ran our original Dijkstra example (Section
\ref{running} on various graph sizes, on a quad core machine.  Here are
the timings:

\begin{tabular}{|l|l|l|}
\hline
nv & nth & time \\ \hline 
\hline
1000 & 1 & 0.005472 \\ \hline 
1000 & 2 & 0.011143 \\ \hline 
1000 & 4 & 0.029574 \\ \hline 
\end{tabular}

The more parallelism we had, the {\it slower} the program ran!  The
synchronization overhead was just too much to be compensated by the
parallel computation.

However, parallelization did bring benefits on larger problems:

\begin{tabular}{|l|l|l|}
\hline
nv & nth & time \\ \hline 
\hline
25000 & 1 & 2.861814 \\ \hline 
25000 & 2 & 1.710665 \\ \hline 
25000 & 4 & 1.453052 \\ \hline 
\end{tabular}

\subsection{Some Fine Tuning}

How could we make our Dijkstra code faster?  One idea would be to
eliminate the critical section.  Recall that in each iteration, the
threads compute their local minimum distance values {\bf md} and {\bf
mv}, and then update the global values {\bf md} and {\bf mv}.  Since the
update must be atomic, this causes some serialization of the program.
Instead, we could have the threads store their values {\bf mymd} and
{\bf mymv} in a global array {\bf mymins}, with each thread using a
separate pair of locations within that array, and then at the end of the
iteration we could have just one task scan through {\bf mymins} and
update {\bf md} and {\bf mv}.   

Here is the resulting code:

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
// Dijkstra.c

// OpenMP example program:  Dijkstra shortest-path finder in a
// bidirectional graph; finds the shortest path from vertex 0 to all
// others

// **** in this version, instead of having a critical section in which
// each thread updates md and mv, the threads record their mymd and mymv
// values in a global array mymins, which one thread then later uses to
// update md and mv

// usage:  dijkstra nv print

// where nv is the size of the graph, and print is 1 if graph and min
// distances are to be printed out, 0 otherwise

#include <omp.h>

// global variables, shared by all threads by default

int nv,  // number of vertices
    *notdone, // vertices not checked yet
    nth,  // number of threads
    chunk,  // number of vertices handled by each thread
    md,  // current min over all threads
    mv,  // vertex which achieves that min
    largeint = -1;  // max possible unsigned int

int *mymins;  // (mymd,mymv) for each thread; see dowork()

unsigned *ohd,  // 1-hop distances between vertices; "ohd[i][j]" is
         // ohd[i*nv+j]
         *mind;  // min distances found so far

void init(int ac, char **av)
{  int i,j,tmp;
   nv = atoi(av[1]);
   ohd = malloc(nv*nv*sizeof(int));
   mind = malloc(nv*sizeof(int));
   notdone = malloc(nv*sizeof(int)); 
   // random graph
   for (i = 0; i < nv; i++)  
      for (j = i; j < nv; j++)  {
         if (j == i) ohd[i*nv+i] = 0;
         else  {
            ohd[nv*i+j] = rand() % 20;
            ohd[nv*j+i] = ohd[nv*i+j];
         }
      }
   for (i = 1; i < nv; i++)  {
      notdone[i] = 1;
      mind[i] = ohd[i];
   }
}

// finds closest to 0 among notdone, among s through e
void findmymin(int s, int e, unsigned *d, int *v)
{  int i;
   *d = largeint; 
   for (i = s; i <= e; i++)
      if (notdone[i] && mind[i] < *d)  {
         *d = ohd[i];
         *v = i;
      }
}

// for each i in [s,e], ask whether a shorter path to i exists, through
// mv
void updatemind(int s, int e)
{  int i;
   for (i = s; i <= e; i++)
      if (mind[mv] + ohd[mv*nv+i] < mind[i])  
         mind[i] = mind[mv] + ohd[mv*nv+i];
}

void dowork()
{  
   #pragma omp parallel 
   {  int startv,endv,  // start, end vertices for my thread
          step,  // whole procedure goes nv steps
          me,
          mymv;  // vertex which attains the min value in my chunk
          unsigned mymd;  // min value found by this thread
      int i;
      me = omp_get_thread_num();  
      #pragma omp single   
      {  nth = omp_get_num_threads();  
         if (nv % nth != 0) {
            printf("nv must be divisible by nth\n");
            exit(1);
         }
         chunk = nv/nth;  
         mymins = malloc(2*nth*sizeof(int)); 
      }
      startv = me * chunk; 
      endv = startv + chunk - 1;
      for (step = 0; step < nv; step++)  {
         // find closest vertex to 0 among notdone; each thread finds
         // closest in its group, then we find overall closest
         findmymin(startv,endv,&mymd,&mymv);
         mymins[2*me] = mymd;
         mymins[2*me+1] = mymv;
         #pragma omp barrier 
         // mark new vertex as done 
         #pragma omp single 
         {  md = largeint; mv = 0;  
            for (i = 1; i < nth; i++) 
               if (mymins[2*i] < md) {
                  md = mymins[2*i];
                  mv = mymins[2*i+1];
               }
            notdone[mv] = 0;  
         }
         // now update my section of mind
         updatemind(startv,endv);
         #pragma omp barrier 
      }
   }
}

int main(int argc, char **argv)
{  int i,j,print;
   double startime,endtime;
   init(argc,argv);
   startime = omp_get_wtime();
   // parallel
   dowork();  
   // back to single thread
   endtime = omp_get_wtime();
   printf("elapsed time:  %f\n",endtime-startime);
   print = atoi(argv[2]);
   if (print)  {
      printf("graph weights:\n");
      for (i = 0; i < nv; i++)  {
         for (j = 0; j < nv; j++)  
            printf("%u  ",ohd[nv*i+j]);
         printf("\n");
      }
      printf("minimum distances:\n");
      for (i = 1; i < nv; i++)
         printf("%u\n",mind[i]);
   }
}
\end{Verbatim}

Let's take a look at the latter part of the code for one iteration;

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
         findmymin(startv,endv,&mymd,&mymv);
         mymins[2*me] = mymd;
         mymins[2*me+1] = mymv;
         #pragma omp barrier 
         // mark new vertex as done 
         #pragma omp single 
         {  notdone[mv] = 0;  
            for (i = 1; i < nth; i++) 
               if (mymins[2*i] < md) {
                  md = mymins[2*i];
                  mv = mymins[2*i+1];
               }
         }
         // now update my section of mind
         updatemind(startv,endv);
         #pragma omp barrier 
\end{Verbatim}

The call to {\bf findmymin()} is as before; this thread finds the
closest vertex to 0 among this thread's range of vertices.  But instead
of comparing the result to {\bf md} and possibly updating it and {\bf
mv}, the thread simply stores its {\bf mymd} and {\bf mymv} in the
global array {\bf mymins}.  After all threads have done this and then
waited at the barrier, we have just one thread update {\bf md} and {\bf
mv}.

Let's see how well this tack worked:

\begin{tabular}{|l|l|l|}
\hline
nv & nth & time \\ \hline 
\hline
25000 & 1 & 2.546335 \\ \hline 
25000 & 2 & 1.449387 \\ \hline 
25000 & 4 & 1.411387 \\ \hline 
\end{tabular}

This brought us about a 15\% speedup in the two-thread case, though less
for four threads.

What else could we do?  Here are a few ideas:

\begin{itemize}

\item False sharing could be a problem here.  To address it, 
we could make {\bf mymins} much longer, changing the places at
which the threads write their data, leaving most of the array as
padding.

\item We could try the modification of our program in Section
\ref{basic}, in which we use the OpenMP {\bf for} pragma, as well as the
refinements stated there, such as {\bf schedule}.

\item We could try combining all of the ideas here.

\end{itemize}

\subsection{OpenMP Internals}

We may be able to write faster code if we know a bit about how OpenMP
works inside.

You can get some idea of this from your compiler.  For example, if you
use the {\bf -t} option with the Omni compiler, or {\bf -k} with Ompi,
you can inspect the result of the preprocessing of the OpenMP pragmas.

Here for instance is the code produced by Omni from the call to {\bf
findmymin()} in our Dijkstra program:

\begin{Verbatim}[fontsize=\relsize{-2}]
# 93 "Dijkstra.c"
findmymin(startv,endv,&(mymd),&(mymv));{
_ompc_enter_critical(&__ompc_lock_critical);
# 96 "Dijkstra.c"
if((mymd)<(((unsigned )(md)))){

# 97 "Dijkstra.c"
(md)=(((int )(mymd)));
# 97 "Dijkstra.c"
(mv)=(mymv);
}_ompc_exit_critical(&__ompc_lock_critical);
\end{Verbatim}

Fortunately Omni saves the line numbers from our original source file,
but the pragmas have been replaced by calls to OpenMP library functions.

With Ompi, while preprocessing of your file {\bf x.c}, the compiler
produces an intermediate file {\bf x\_ompi.c}, and the latter is what is
actually compiled.  Your function {\bf main} is renamed to {\bf
\_ompi\_originalMain()}.  Your other functions and variables are
renamed.  For example in our Dijkstra code, the function {\bf dowork()}
is renamed to {\bf dowork\_parallel\_0}.  And by the way, all indenting
is lost!  So it's a bit hard to read, but can be very instructive.

The document, {\it The GNU OpenMP Implementation},
\url{http://pl.postech.ac.kr/~gla/cs700-07f/ref/openMp/libgomp.pdf},
includes good outline of how the pragmas are translated.

\section{Example:  Root Finding}

The application is described in the comments, but here are a couple of
things to look for in particular:

\begin{itemize}

\item The variables {\bf curra} and {\bf currb} are shared by all the
threads, but due to the nature of the application, no critical sections
are needed.

\item On the other hand, the barrier is essential.  The reader should
ponder what calamities would occur without it.

\end{itemize}

\begin{lstlisting}[numbers=left]
#include<omp.h>
#include<math.h>

// OpenMP example:  root finding

// the function f() is known to be negative
// at a, positive at b, and thus has at
// least one root in (a,b); if there are
// multiple roots, only one is found;
// the procedure runs for niters iterations

// strategy: in each iteration, the current
// interval is split into nth equal parts,
// and each thread checks its subinterval
// for a sign change of f(); if one is
// found, this subinterval becomes the
// new current interval; the current guess
// for the root is the left endpoint of the
// current interval

// of course, this approach is useful in
// parallel only if f() is very expensive
// to evaluate

// for simplicity, assumes that no endpoint
// of a subinterval will ever exactly
// coincide with a root

float root(float(*f)(float),
   float inita, float initb, int niters) {
   float curra = inita;
   float currb = initb;
   #pragma omp parallel
   { 
      int nth = omp_get_num_threads();
      int me = omp_get_thread_num();
      int iter;
      for (iter = 0; iter < niters; iter++) {
         #pragma omp barrier
         float subintwidth =
            (currb - curra) / nth;
         float myleft =
            curra + me * subintwidth;
         float myright = myleft + subintwidth;
         if ((*f)(myleft) < 0 &&
             (*f)(myright) > 0) {
            curra = myleft;
            currb = myright;
         }
      }
   }
   return curra;
}

float testf(float x) {
   return pow(x-2.1,3);
}

int main(int argc, char **argv)
{  printf("%f\n",root(testf,-4.1,4.1,1000));  }
\end{lstlisting}

\section{Example:  Mutual Outlinks}
\label{ompmutlinks}

Consider the example of Section \ref{mutlinks}.  We have
a network graph of some kind, such as Web links.  For any two
vertices, say any two Web sites, we might be interested in mutual
outlinks, i.e. outbound links that are common to two Web sites.  

The OpenMP code below finds the mean number of mutual outlinks, among
all pairs of sites in a set of Web sites.  Note that it uses the method
for load balancing presented in Section \ref{mutlinks}.

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
#include <omp.h>
#include <stdio.h>

// OpenMP example:  finds mean number of mutual outlinks, among all
// pairs of Web sites in our set

int n,  // number of sites (will assume n is even)
    nth,  // number of threads (will assume n/2 divisible by nth)
    *m, // link matrix
    tot = 0; // grand total of matches

// processes row pairs (i,i+1), (i,i+2), ...
int procpairs(int i)
{  int j,k,sum=0;
   for (j = i+1; j < n; j++) {
      for (k = 0; k < n; k++) 
         sum += m[n*i+k] * m[n*j+k];
   }
   return sum;
}

float dowork()
{ 
   #pragma omp parallel
   {  int pn1,pn2,i,mysum=0;
      int me = omp_get_thread_num();
      nth = omp_get_num_threads(); 
      // in checking all (i,j) pairs, partition the work according to i;
      // to get good load balance, this thread me will handle all i that equal
      // me mod nth
      for (i = me; i < n; i += nth) {
         mysum += procpairs(i);
      }
      #pragma omp atomic
      tot += mysum;
      #pragma omp barrier
   }
   int divisor = n * (n-1) / 2;
   return ((float) tot)/divisor;
}

int main(int argc, char **argv)
{   int n2 = n/2,i,j;
    n = atoi(argv[1]);  // number of matrix rows/cols
    int msize = n * n * sizeof(int);  
    m = (int *) malloc(msize);  
    // as a test, fill matrix with random 1s and 0s
    for (i = 0; i < n; i++) {
       m[n*i+i] = 0;
       for (j = 0; j < n; j++) {
          if (j != i) m[i*n+j] = rand() % 2;
       }
    }
    if (n < 10) {
       for (i = 0; i < n; i++) {
          for (j = 0; j < n; j++) printf("%d  ",m[n*i+j]);
          printf("\n");
       }
    }
    tot = 0;
    float meanml = dowork();
    printf("mean = %f\n",meanml);
}
\end{Verbatim}

\section{Locks with OpenMP}

Though one of OpenMP's best virtues is that you can avoid working with
those pesky lock variables needed for straight threads programming,
there are still some instances in which lock variables may be useful.
OpenMP does provide for locks:

\begin{itemize}

\item declare your locks to be of type \lstinline{omp_lock_t}

\item call \lstinline{omp_set_lock()} to lock the lock

\item call \lstinline{omp_unset_lock()} to unlock the lock

\end{itemize}

\section{Other Examples of OpenMP Code in This Book}

There are additional OpenMP examples in later sections of this book,
such as:\footnote{If you are reading this presentation on OpenMP
separately from the book, the book is at
\url{http://heather.cs.ucdavis.edu/~matloff/158/PLN/ParProcBook.pdf}
}

\begin{itemize}

\item sampling bucket sort, Section \ref{ompbsort} 

\item parallel prefix sum, Section \ref{prefiximps}.

\item matrix multiplication, Section \ref{openmpmatmul}.

\item Jacobi algorithm for solving systems of linear equations, with a
good example of the OpenMP {\bf reduction} clause, Section
\ref{ompjacobi}

\item another implementation of Quicksort, Section \ref{sharedqs}

\end{itemize}



