

\documentstyle[twocolumn]{article}

\setlength{\oddsidemargin}{0in}
\setlength{\evensidemargin}{0in}
\setlength{\topmargin}{-0.3in}
\setlength{\headheight}{0in}
\setlength{\headsep}{0in}
\setlength{\textwidth}{6.5in}
\setlength{\textheight}{9.8in}
\setlength{\parindent}{0in}
\setlength{\parskip}{0.1in}
\setlength{\columnseprule}{0.4pt}

\begin{document}

\bf
\hfill Name:  $\underline{ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \
\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ }$
% \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
\rm
\bigskip
                                      
{\bf Directions:} Work on this sheet (both sides, if needed) only;
{\bf do not turn in any supplementary sheets of paper}.  
In order to get full credit, {\bf SHOW YOUR WORK.}

{\bf 1.}  (15) Look at the bottom half of Figure 6.23 in Patterson and
Hennessy.  Suppose instead of an {\bf lw} instruction we have an
{\bf sw}.  Give an example of an object which had been shaded green
but now would not be, or an object which previously had not been
shaded and now will be.  

{\bf 2.}  (5) Cite an application discussed in lecture in which
something like the cache coherence problem occurs at the software
level.

{\bf 3.}  (10) Fill in the blanks with official terms from our course: 
Suppose we have a program which makes relatively few accesses to memory.   
Then among the types of interconnects we discussed in class, a(n)
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ would give the best
performance.  This is because it would have the best
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_.

{\bf 4.}  (10) Consider the ``unsafe'' barrier problem discussed in
class, in which one processor races ahead and reaches the barrier
again before processor 0 has finished its work.  Then which of the
following is true?  (i) The processor which reached the barrier
prematurely will also exit the barrier prematurely.  (ii) When processor
0 next hits the barrier, it will exit the barrier later than it should.
(iii) The system will hang. (iv) None of the previous scenarios is guaranteed
to hold.

{\bf 5.}  Consider the code

\begin{verbatim} 
LOCK(L); 
Composite[0] += MyComposites; 
UNLOCK(L);
\end{verbatim}

in our MulSim example program.

\begin{itemize}

\item [(a)] (10) Suppose we had forgotten to put in the lines with the calls
to LOCK() and UNLOCK().  Let $p_{true}$ be the true number of primes
and $p_{report}$ be the number of primes reported by the program.  Then
which of the following is true? (i) $p_{true} \ge p_{report}$. (ii)
$p_{true} \le p_{report}$. (iii) Neither of the previous two scenarios is
guaranteed to hold.

\item [(b)] (10) Instead of the situation in (a), suppose we do remember
that we need to make the operation atomic, but suppose MulSim did not
have LOCK(), TAS() and UNLOCK().  Assume that MulSim would still have
its AINC() call.  The latter, which invokes the {\bf ainc} instruction,
atomically applies a fetch-and-add operation with incrementation value
of 1 to the function's argument, and returns the old value of the
argument.  Write complete code in which you use AINC() to make the above
code safe (albeit highly inefficiently).

\end{itemize}

{\bf 6.} Suppose we are writing an MPI program to do Hyperquicksort.  In
this problem you will write code to implement some of the pseudocode
in our handout. {\bf Keep in mind that this is a single program,
identical copies of which run on each of the PEs.}  We have these global
variables (all of which are of type {\bf int}, except for X, which is
of type array of {\bf int}):

\begin{verbatim}
D, dimension of full cube
NPEs, number of PEs in full cube
Me, i.d. number of this PE
I, current index in "for i = d downto 1 do"
Partner, i.d. of PE's partner 
Root, i.d. of root of current subcube
Median, value broadcast by root
X, array containing current "chunk" at this PE
NX, he number of elements currently in X
\end{verbatim}

We also define some message types:

\begin{verbatim}
#define MDN 1
#define PRNT 2
\end{verbatim}

Suppose we have a function BinaryIt() which converts a PE's
i.d. number to its base-2 form.  The heading in its declaration is:

\begin{verbatim}
BinaryIt(K,B)  
   int K,B[];
\end{verbatim}

If for example in a 4-dimensional cube we have K = 12, then after
calling BinaryIt(), then we will have B[3] = 1, B[2] = 1, B[1] = 0
and B[0] = 0.  Assume that this function is given to you.

\begin{itemize}

\item [(a)] (10) Write the complete code which determines the value of
Partner from Me, I and D.

\item [(b)] (15) Write the complete code which determines Root and
Median from D, I and a message from Root of type MDN, for the case
in which Me is not equal to Root.

\item [(c)] (15) Recall that when Hyperquicksort finishes, the array
elements are still distributed among the PEs.  We wish to have the
entire array printed out, using printf(), with the PEs taking turns
(first PE 0 prints out its chunk, then PE 1, etc.).  Write the complete
code to do this, using X, NX, NPEs, Me and a message type named PRNT.

\end{itemize}

{\bf Solutions:}  

1.  In the Data memory box, shift the green to the left half, and
in the MEM/WB, the green shifts to the other half too.

2.  For example, software distributed shared memory.

3.  Bus, latency.

4.  The answer is (iii).  After the CPU which races ahead and
increments the barrier but another CPU resets the count back to 0,
then we will be "one short."  The barrier count will never reach
the number of CPUs, so the CPUs will stay in the barrier forever.

5.a.  The answer is (i).  The only error which can occur is that
two CPUs both try to add to the number of composites at the same
same.  The new sum won't be as large as it should be.  Then we
when do a subtraction to get the number of primes, our answer will
be too large.

5.b.  Don't use the solution mentioned in the notes; the notes
point out that it must be modified to be fully safe.  Instead,
set up a {\bf for} loop to add 1 to the number of composites
My Composites number of times, using {\bf ainc} for atomicity.

 
\end{document}



