\documentclass{article}

\setlength{\oddsidemargin}{-0.5in}
\setlength{\evensidemargin}{-0.5in}
\setlength{\topmargin}{0.0in}
\setlength{\headheight}{0in}
\setlength{\headsep}{0in}
\setlength{\textwidth}{7.0in}
\setlength{\textheight}{9.5in}
\setlength{\parindent}{0in}
\setlength{\parskip}{0.05in}
\setlength{\columnseprule}{0.3pt}
\usepackage{fancyvrb}
\usepackage{relsize}
\usepackage{hyperref}

\begin{document}

Name: \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

Directions: {\bf Work only on this sheet} (on both sides, if needed); do
not turn in any supplementary sheets of paper. There is actually plenty
of room for your answers, as long as you organize yourself BEFORE
starting writing.

{\bf 1.} (10) Fill in the blank with a Unix command name:  The {\bf mpirun}
command we've been using with MPICH invokes 
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ to get the application program
started on other nodes.

{\bf 2.}  Consider the Dijkstra program in 
Section 3.1.1 of the OpenMP PLN. 

\begin{itemize}

\item [(a)] (10) Which OpenMP pragma line should be deleted in order to improve
performance?

\item [(b)] (10) Why was the {\bf for} in line 61 (with index variable {\bf
step}) not implemented as a parallel {\bf for}?

\item [(c)] (10) Which OpenMP pragma line should be modified in order to improve
performance?  How should it be modified?

\end{itemize}

{\bf 3.} (15) Consider the pthreads prime-finding program in our inroductory
PLN, Section 2.2.2.  Supose we wish to improve performance of this
program via ``chunking,'' which we'll do by adding 10 instead of 2 to
{\bf nextbase} in line 49.  Show how to modify the rest of the code
accordingly.  Your answer should use language like ``Change line 8888
to...'' ``Between lines 3000 and 3001 add...'' etc.

{\bf 4.} (10) Suppose we have a NUMA Omega network with 8 PEs.  Say we
use low-order interleaving, and that consecutive words of memory have
consecutive addresses (as opposed to consecutive addresses differing by
4, as in PCs).  It allows fetch-and-add operations but with no
combining.  We have a variable {\bf z} in our program  which is in
location 164.  Suppose PEs 0, 1 and 7 each do a fetch-and-add operation
on {\bf z}, with operand 6, and the original value of {\bf z} is 20.
There are no other memory transactions in progress at this time.  What
values do the three nodes get back?\footnote{There is an old joke about
the newscaster who said, ``And here are today's baseball scores: 6-5,
3-0 and 4-3.''  No team names!  In our case here, my point is, don't
just give the set of values.  State which node receives which value.}

{\bf 5.}  Suppose we are reading someone's pthreads code, and see this
code

\begin{Verbatim}[fontsize=\relsize{-2}]
#define PADDING (some number here)
...
int tmp,nlocks;
pthread_mutex_t *lockarray;
...
tmp = nlocks * (sizeof(pthread_mutex_t)+PADDING);
lockarray = (pthread_mutex_t *) malloc(tmp);
\end{Verbatim}

\begin{itemize}

\item [(a)] (10) Fill in with a term from our course:  The programmer here is
trying to avoid \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_.

\item [(b)] (10) The value of PADDING should be set to 
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_.

\end{itemize}

{\bf 6.} (15) Write OpenMP code in the form of a function {\bf fs(x,n)}
which does a right-shift of the first {\bf n} elements of the {\bf int}
array {\bf x}.  Element i gets shifted to i+1, with element 0's new
value being 0.  You must have an OpenMP {\bf parallel} directive either
just before the function or at the beginning of its code.  Aim for
reasonably good performance, but assume that this function will be used
on a variety of platforms.

{\bf Solutions:}

{\bf 1.}  {\bf ssh}

{\bf 2.a}  Line 87, the barrier, is redundant, since the parallel {\bf
for} has an implied barrier at the end.

{\bf 2.b}  The iterations of the loop are not independent, so they must
be done sequentially.

{\bf 2.c}  On line 67, we can add a {\bf nowait} clause.  This won't
harm the correct operation of the program, and it will help reduce the
penalty we incur in the critical section, by spreading out the times at
which threads enter it.

{\bf 3.}  Line 49 becomes

\begin{Verbatim}[fontsize=\relsize{-2}]
nextbase += 10;
\end{Verbatim}

We need to have each thread check values of {\bf base}, {\bf base+2},
..., {\bf base+8}.  So, place a {\bf for} loop around lines 52-57,

\begin{Verbatim}[fontsize=\relsize{-2}]
for (i = 0; i < 10; i += 2)
\end{Verbatim}

and replace {\bf base} by {\bf base+i}.

{\bf 4.}  Call the messages from PEs 0, 1 and 7 A, B and C.  A and B
route through switching nodes 00, 11 and 22, while C goes through 03, 13
and 22.  

A and B clash at 00, with A getting priority.  A then clashes with C at
22, with A getting priority again.  Then B comes along and beats C too.
So A, B and C get the values 20, 26 ande 32, respectively.

{\bf 5.}  false sharing; a cache line

{\bf 6.}  First of all, to get good performance, e.g. with respect to
cache coherency, break the array into blocks, with each thread working
on its own block.  Great care must be taken, however, at the boundaries
of the blocks.  When copying from the end of the previous block to the
beginning of my block, I must ensure that I am getting the original
value, not the copied one.  For example:

\begin{Verbatim}[fontsize=\relsize{-2}]
void rs(int *x, int n)
{  int chunk,nth;
   #pragma omp parallel
   {  int me = omp_get_thread_num(),
          i,tmp,mystart;
      #pragma omp single
      {  nth = omp_get_num_threads();
         printf("there are %d threads\n",nth);  }
      chunk = n / nth;
      mystart = me * chunk;
      if (me > 0) tmp = x[mystart-1];
      else tmp = 0;
      #pragma omp barrier
      for (i = mystart+chunk-1; i >= mystart; i--)
         x[i] = x[i-1];
      x[mystart] = tmp;
   }
}
\end{Verbatim}

\end{document}