\documentclass{article}

\setlength{\oddsidemargin}{-0.5in}
\setlength{\evensidemargin}{-0.5in}
\setlength{\topmargin}{0.0in}
\setlength{\headheight}{0in}
\setlength{\headsep}{0in}
\setlength{\textwidth}{7.0in}
\setlength{\textheight}{9.5in}
\setlength{\parindent}{0in}
\setlength{\parskip}{0.05in}
\setlength{\columnseprule}{0.3pt}
\usepackage{fancyvrb}
\usepackage{relsize}
\usepackage{hyperref}

\begin{document}

Name: \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

Directions: {\bf \Large Work only on this sheet} (on both sides, if
needed); do not turn in any supplementary sheets of paper. There is
actually plenty of room for your answers, as long as you organize
yourself BEFORE starting writing.

{\bf 1.} (50) The OpenMP code below implements what might be considered a
variant of k-means clustering.  It is assumed that once a data point is
placed into a cluster, it stays with that cluster even as new data
points are added.  The number of clusters is fixed, but the centroids
and counts of cluster members are updated each time a new data point is
acquired.  

Assume that new data arrives in clumps.  The code below takes a clump of
new data points and updates the cluster centroids and counts.  (Which
must be updated once for each new data point.)

Globals:

   \begin{itemize}

   \item {\bf k}, the number of clusters

   \item {\bf p}, the dimensionality of the space

   \item {\bf n}, the total number of data points recorded in clusters
   (will grow by the amount of {\bf nnew} below)

   \item {\bf centroids}, a matrix of the current centroids, {\bf k}
   rows, {\bf p} columns

   \item {\bf clstrcounts}, an array, length {\bf k}, recording how many
   data points are in each cluster

   \item {\bf grps}, an array listing group membership, so that for
   example {\bf grps[88] = 3} means that data point number 88 is in
   cluster 3; length is assumed as large as {\bf n} will ever get

   \item {\bf nnew}, the number of new data points

   \item {\bf clump}, matrix of the new data, with {\bf nnew} rows, {\bf
   p} columns

   \end{itemize}

Fill in the blanks, and add any lines necessary.  For the latter action,
write something like, ``Place the following code between lines 8 and
9.'' Do NOT delete or change lines.

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
#pragma omp parallel
{  
   int i,j,grpnum;
   for (i = 0; i < nnew; i++) {
      // function closest() (not shown), finds the index of 
      // the closest centroid to the new data point i
      grpnum = closest(i);
      for (j = 0; j < p; j++) {
         tmp =  centroid[grpnum][j] _______________________________; 
         tmp += clump[i][j];
         tmp /= ________________________________
      }
   }
}
\end{Verbatim}

{\bf 2.} (50) The CUDA code below computes the discrete cosine transform
of an image, p.135.  Assume there is only one block, with that block
consisting of {\bf n} rows and {\bf m} columns of threads.  Each thread
handles a single pixel, making a local copy.  Shared memory is not used.
The arguments to {\bf dct} are:

   \begin{itemize}

   \item {\bf n}, the number of rows in the image and the transform

   \item {\bf m}, the number of columns in the image and the transform

   \item {\bf dx}, the image data on the device, {\bf n} rows, {\bf m}
   columns

   \item {\bf dd}, the transform data on the device, {\bf n} rows, {\bf m}
   columns, initially all 0.0

   \end{itemize}

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
__global__ void dct(float *dx,int n, int m, float *dd)
{  int j,k;
   float pi = 3.14;
   for (u = 0; u < n; u++)
      for (v = 0; v < m; v++) {
      }
}

float y(int q) {
   if (q == 0) return 0.71;
   else return 1.0;
}
\end{Verbatim}

{\bf Solutions:}

{\bf 1.}

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
#pragma omp parallel
{  
   int i,j,grpnum;
   #pragma omp for
   for (i = 0; i < nnew; i++) {
      // function closest() (not shown), finds the index of 
      // the closest centroid to the new data point i
      grpnum = closest(i);
      #pragma omp critical
      {  for (j = 0; j < p; j++) {
            tmp =  clstrcounts[grpnum] * centroid[grpnum][j]; 
            tmp += clump[i][j];
            tmp /= (clstrcounts[grpnum]+1);
            centroids[grpnum][j] = tmp;
         }
         clstrcounts[grpnum]++;
         n++;
         grps[n] = grpnum;
      }
   }
}
\end{Verbatim}

{\bf 2.}

\begin{Verbatim}[fontsize=\relsize{-2},numbers=left]
__global__ void dct(float *dx,int n, int m, float *dd)
{  int u,v;
   int j = threadIdx.x;
   int k = threadIdx.y;
   float pi = 3.14, myx, tmp;
   myx = dx[n*j+k];
   for (u = 0; j < n; u++)
      for (v = 0; k < m; v++) { 
         tmp = myx * cos((2*j+1)*u*pi/(2*n)) + cos((2*k+1)*v*pi/(2*m));
         tmp /= y(u) * y(v) * 2 / sqrt(m*n);
         atomicAdd(&dd[n*u+v],tmp);
      }
}

float y(int q) {
   if (q == 0) return 0.71;
   else return 1.0;
}
\end{Verbatim}

\end{document}