%% This LaTeX-file was created by <root> Tue Mar 14 13:07:10 2000
%% LyX 1.0 (C) 1995-1999 by Matthias Ettrich and the LyX Team

%% Do not edit this file unless you know what you are doing.
\documentclass[twocolumn]{article}
\usepackage[T1]{fontenc}
\setlength\parskip{\medskipamount}
\setlength\parindent{0pt}

\makeatletter


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands.
\providecommand{\LyX}{L\kern-.1667em\lower.25em\hbox{Y}\kern-.125emX\@}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands.
\usepackage[T1]{fontenc}
\setlength\parskip{\medskipamount}
\setlength\parindent{0pt}

\makeatletter


\usepackage[T1]{fontenc}
\setlength\parskip{\medskipamount}
\setlength\parindent{0pt}

\makeatletter


\setlength{\oddsidemargin}{-0.5in}
\setlength{\evensidemargin}{-0.5in}
\setlength{\topmargin}{0.0in}
\setlength{\headheight}{0in}
\setlength{\headsep}{0in}
\setlength{\textwidth}{7.0in}
\setlength{\textheight}{9.5in}
\setlength{\parindent}{0in}
\setlength{\parskip}{0.1in}
\setlength{\columnseprule}{0.4pt}

\makeatother
\makeatother
\makeatother

\begin{document}

Name: \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

\textbf{Directions: Work only on this sheet (on both sides, if needed); do not
turn in any supplementary sheets of paper. There is actually plenty of room
for your answers, as long as you organize yourself BEFORE starting writing.
In order to get full credit, SHOW YOUR WORK.}

\textbf{1.} (15) Look at the discussion of \textit{shearsort} in pages 277ff
of Wilkinson and Allen. The text suggests performing a transpose operation,
using a single call to an \textit{all-to-all} routine. Assume that we have n
processors and n data points, as in the discussion in the text. Data point (i,j),
named x, will be at processor k{*}i+j, where \( k \) = \( \sqrt{n} \). Give
the MPI code to do the \textit{all-to-all} operation, followed by a call to
printf() to print out the value received. \textbf{(Don't forget the printf().)}

\textbf{2.} (15) Give the complete C code implementation of the pseudode

\begin{verbatim}
for all possible moves of square I do {
   Q = malloc(sizeof(struct BoardPos));
   fill in *Q according to this move;
\end{verbatim}

in the handout on the 8-Squares Puzzle problem.

\textbf{3.} (15) Look at the cyclic-striped partitioning on p.317 of Wilkinson
and Allen. Suppose our matrix is 16x16 and that we are using 4 processors. Consider
what happens when \( P_{1} \) is performing the last of its four calls to the
broadcast function. Note that it is actually a multicast, going only to a subset
of the four processors. State which processor(s) \( P_{1} \)will be sending
to in this fourth call

\textbf{4.} This problem concerns the pthreads examples, Workpile.c and Quicksort.c.

\begin{description}
\item [a.](10) The code in Workpile.c is meant to be generally applicable to task-farm
problems, not just Quicksort.c. State which argument in which function declared
in Workpile.c ties that general machinery to a specific problem such as \textit{quicksort}.
\item [b.](10) Show which line in Workpile.c checks the termination condition for
the workpile operation.
\end{description}
\textbf{5}. Suppose we wish to do noise reduction on a certain image. (To simplify
things, assume we are going to do this on just one row of pixels.)

\begin{description}
\item [~~~a.](10) Look at the equations for \( X_{k} \) and \( x_{k} \) at the bottom
of p.349 of Wilkinson and Allen. State how one the two equations would be altered.
\item [~~~b.](10) Of the master-slave and pipelined parallelizations discussed on
pp.353-354 (applied in this case to the inverse transform), state why one of
the two methods would be more appropriate in our setting here.
\end{description}
\textbf{6.} (15) Consider the pipelined parallelization of \textit{bubble sort},
described in Figure 9.9 of Wilkinson and Allen. Suppose we are implementing
this in MulSim, sorting an n-item global data array 'a' using n processors.
In the code on p.275, \( P_{i} \) will handle case i in the outer loop. There
will be a global array my\_j, with my\_j{[}i{]} stating which j in the inner
loop \( P_{i} \) is currently working on. Write the MulSim code for this algorithm,
omitting declarations, header-file includes and so on, but showing full detail
of the algorithm itself. (Note: You should not need to use any lock variables.)

{\bf Solutions:}

{\bf 1.}

\begin{verbatim}
MPI_Alltoall(&x,1,MPI_Int,y,1,MPI_Int,MPI_COMM_WORLD);
printf("%d\n",y[k*j+i]);
\end{verbatim}

{\bf 2.}

\begin{verbatim}
Q = (struct BoardPos *) malloc(sizeof(struct BoardPos));
memcpy(Q,P,sizeof(struct BoardPos)); 
RowI = Q->Row[I];  ColI = Q->Col[I];
RowBlank = Q->Row[8];  ColBlank = Q->Col[8];
if (abs(RowI-RowBlank) == 1)
   Swap(&Q->Row[I],&Q->Row[8]); 
else if (abs(ColI-ColBlank) == 1)
   Swap(&Q->Col[I],&Q->Col[8]); 
\end{verbatim}

{\bf 4.a.}  Argument worker\_proc in work\_init().

{\bf 4.b.}

\begin{verbatim}
while(wp->n_pile != 0 || wp->n_working != 0)
\end{verbatim}

{\bf 5.a.}  We want to remove the high-frequency components, i.e. the
``noisy'' ones.  So, in the sum for $x_k$, we sum only from j=0 to j=M for
some M $<$ N-1.

{\bf 5.b.}  In {\bf this setting}, the master-slave approach would be bad,
since it would compute some values of $X_k$ that would never be used.
(Must get part (a) correct to receive credit for part (b).)

{\bf 6.}

\begin{verbatim}
me = CPU_NUM;
for (j = 0; j < me; j++)  {
   // must wait until the "upstream" node has past this j
   while (my_j[me-1] <= me+1) ;
   k = j+1;
   if (a[j] > a[k])  {
      temp = a[j];
      a[j] = a[k];
      a[k] = temp;
   }
   // signal "downstream" node
   my_j[me] = j;
}
\end{verbatim}

\end{document}

