% \documentclass[11pt]{article}

% \setlength{\oddsidemargin}{0in}
% \setlength{\evensidemargin}{0in}
% \setlength{\topmargin}{0.0in}
% \setlength{\headheight}{0in}
% \setlength{\headsep}{0.5in}
% \setlength{\textwidth}{6.5in}
% \setlength{\textheight}{9.0in}
% \setlength{\parindent}{0in}
% \setlength{\parskip}{0.05in}

%% \usepackage{psfig}

\documentclass[times, 10pt,twocolumn]{article}
\usepackage{latex8}
\usepackage[dvips]{graphicx}
\usepackage{times}

\begin{document}

\title{I-Tuples: A Programmer-Controllable Performance Enhancement
for the Linda Environment}
\author{Michael Foster, Norman Matloff, Raju Pandey, David Standring, 
and Robert Sweeney \\ 
University of California, Davis \\
\\
Correspondence Author: \\
\\
Norman Matloff \\
Department of Computer Science \\
University of California at Davis \\
Davis, CA  95616 \\
matloff@cs.ucdavis.edu
}

\date{November 12, 2000}

\maketitle

% \abstract{\noindent The Linda programming paradigm provides an elegant
% approach to parallel processing.  However, its efficiency is often
% limited by the communications overhead involved in transferring tuples
% to and from the tuple manager.  Here we propose an RPC approach to the
% problem, by enabling the tuple manager to process a set of several
% related tuple operations in one batch, without intervening
% communications with the application node.}

\begin{abstract}
A version of the Linda parallel programming model is developed in which
the user may offload batches of tuple operations to be done at the
tuple server, eliminating expensive communications which occur between
successive tuple operations.
\end{abstract}

\section{Introduction}

The motivation behind the Linda model \cite{linda} is to create a
protected shared memory space for parallel processing.  This space
consists of a collection of objects, called {\bf tuples}, which are
themselves ordered collections of typed fields and associated data.
Operators are available to client applications for insertion (out()),
deletion (in()), and inspection (rd()) of the collection of tuples, also
referred to as the {\bf tuple space}.  

The Linda approach is quite elegant, and has interesting advantages over
other parallel-processing environments.  Atomic operations, for example,
are easy to implement by temporarily removing certain tuples from the
tuple space. A particularly useful feature is that processing nodes can
easily join the work group in the midst of a computation, and can easily
leave midstream as well.  The latter property also implies high
potential for writing fault-tolerant applications.  The Linda philosophy
was the inspiration for Sun Microsystems' new Jini/JavaSpaces system
\cite{jini} and IBM's TSpaces \cite{ibm}.

However, one drawback to the Linda model, particularly in a Network of
Workstations (NOW) setting \cite{anderson} \cite{beowulf}, is the
relatively large number of communication transactions required for each
operation.  This is particularly evident in situations where a program
repeatedly retrieves a tuple, performs a small operation on data in that
tuple and writes the data back to the tuple space. In such cases, the
overhead associated with retrieving and returning the tuple dominates
the operation's overall cost.  

Consider the following example which might arise in a
matrix-multiplication computation:\footnote{(Here we are using the
UCTuplets implementation of Linda \cite{uctuplets}, somewhat similar to
that of p4-Linda \cite{p4}, differing slightly from the standard
\cite{linda}.  For example, ``sp'' here means that the second component
in the tuple will be of string type and the third a pointer type, and
``si'' means a string type followed by an {\bf int} type.)}

\begin{verbatim}
  in("sp", "next row", &Row);
  Row++;
  out("si", "next row", Row);
\end{verbatim}

\noindent This code represents a simple fetch-and-add operation \cite{almasi}.
The call to in() removes a tuple from the tuple space whose second
component is the string ``next row'', and places the value of that
tuple's third component into the variable Row.  The incremented version
of that value is then written back to the tuple space by calling out().

In a NOW setting, this will necessitate a total of four network
transactions: a request for retrieving the tuple, the response to that
request, a request to insert the incremented tuple, and the final
acknowledgment.  As such, this simple example incurs a significant
performance penalty due to overhead in packaging, transmitting,
decoding, and returning the data through the network.

\section{Proposed Solution:  the I-Tuples System}

To address this performance issue we propose an alternative
implementation, termed {\bf I-tuples}, which allows the programmer to
reduce the overhead associated with tuple operations such in as the
fetch-and-add example, for improved performance. 

We propose a solution in which the tuple manager (TM) is able to process
tuple operations on behalf of remote clients, using data that is local
to the TM and hence reducing the amount of network
communication required.\footnote{Many implementations of Linda feature
multiple tuple managers, not just one.  Our I-Tuples system would also
benefit from such a scheme, though we have not implemented it.  In the
material below, any reference to the singular term ``tuple manager''
should be considered to apply to the case of multiple tuple managers as
well.}  

For example, the above fetch-and-add operation could be replaced
by a single function call made by a worker node but executed locally by
the TM.  The call to in() would not return a tuple to the
worker node, but instead would be retained by the TM for use
in the rest of the code.  The entire operation of three lines of code
would be executed as a package at the TM, via a new Linda
operation, tmexec().

All of this would require just two network transactions, compared to
four network transactions required in the former example.  The first
transaction instructs the TM to execute the specified
function locally, and the second would consist of return data from the
TM to the requesting client (in this case Row). As a result,
overhead attributed to network latency is reduced, in this example, by a
factor of two.  Potential gains could be much more significant,
depending primarily upon the number of tuple operations offloaded to the
TM.  

Note that our approach differs from {\bf code shipping} \cite{gifford},
remote procedure call, and so on.  This is discussed further in Section
\ref{disc}.

\section{Implementation}

As currently implemented, all the nodes running an I-Tuples application
must share the same architecture and compiler.  In other words, our
model here is a network of identical workstations (extensions will be
discussed later in this paper).  The reason for this restriction is that
the current implementation relies on the address of a function (the
virtual address, i.e. \&f for a function f) having the same value on
all nodes of the system.

Application programs in the current system are in Single Program
Multiple Data (SPMD) form.  The same program is executed by all nodes,
including the node serving as the TM.  The function main()
calls a library function UCTInit().  The latter senses the node number
on which the program is being run, and then either calls a library
function to start the TM (in the case of node 0) or starts a
user-written function Worker() which comprises the application code (in
the case of the nodes numbered greater than 0).  By having node 0, the
TM, run the same program being run by the worker nodes, the
TM has access to the I-Tuples functions written by the user.

The first objective was to add a new function to the operation set
in()/out()/rd().  This function, called tmexec() (Tuple Manager
EXECute), enables a client to specify a function for the TM
to execute locally.  This function may contain multiple in()/out()/rd()
calls.  This new operation has the following specification:

\begin{verbatim}
  tmexec (int (*ftn_ptr)(), 
     char* in_args, int in_length, 
     char* out_args, 
     int* out_length_ptr);
\end{verbatim}

In this prototype, ftn\_ptr is the address of the client-supplied
function,\footnote{Recall that due to the SPMD nature of the I-Tuples
application program, the function itself is already there at the tuple
manager.} consisting of in()/out()/rd() calls that are to be
serviced locally by the TM. The strings ``in\_args'' and
``out\_args'' are used to pass arguments from/to the client application
to/from the specified function. The two integer parameters represent the
lengths of the input and output arguments, respectively. The user is
responsible for encoding/decoding the input and output arguments, and
appropriately specifying their lengths.  The following example is a
modification of the fetch-and-add code presented earlier.  The
FetchAndAdd function and its indirect call via tmexec() are shown:

\begin{verbatim}
   char *Ins,*Outs;  
   int InL,OutL;  int Val;
   ...
   FetchAndAdd(char* InArgs, 
      int InLength, char* OutArgs, 
      int* OutLengthPtr)

   {  int R;

      in("lsi", "hello", &R);
      memcpy(OutArgs, R, sizeof(int)); 
      *OutLengthPtr = sizeof(int);
      out("lsi", "hello", R+1);
   }
   ...
   // we would set Ins and InL here 
   // if we had inputs in this case 
   tmexec(FetchAndAdd, Ins, InL, 
      Outs, &OutL);
   memcpy(Val, Outs, sizeof(int));
   // Val now contains the 
   // pre-incrementing value in 
   // the tuple 
\end{verbatim}

To implement the tmexec() operator, it was necessary to establish a
mechanism to differentiate between in()/out()/rd() calls that are to be
serviced remotely (i.e. requests through the network from worker nodes,
as in standard Linda) and those made locally by the TM.  To
facilitate this distinction, the key string used to specify tuple fields
for in()/out()/rd() access operations has been slightly modified. The
user must now designate the context of the access operation by adding
`r' (for remote access operations) or `l' (for local access operations)
as the first letter of the first tuple component.  Again, the
designation ``remote'' or ``local'' which indicates the source of the
operation request are defined from the point of view of the TM.  

For example, consider the call

\begin{verbatim}
  in("rsipI", "hello", 5, &var, 
     IntArray);
\end{verbatim}

This example illustrates how a remote call might look.  Notice that the
first letter in the key indicates that this is a remote operation,
intended to be called from the user application, i.e. as in standard
Linda; alternatively, an 'l' would indicate that the specified operation
is part of a user-defined function, for local execution by the TM.  The remainder of the key string states that the next field
will be of string type, then one of pointer type ({\bf *int} here), then
one of {\bf int} type, and finally an {\bf int} array.  (After the call,
the next-to-last field will contain the length of the array, filled in
by the TM.)

\section{Experimental Evaluation}

The motivation behind I-Tuples is to reduce the amount of network
traffic required to interact with the tuple space, by batching tuple
space access operations from the client to the tuple manager.
Intuitively, one might expect that application performance will likely
improve as a result.  However, by transferring workload to the TM, we
are effectively sacrificing parallelism, as execution is serialized
within the TM. For small operations, this transferral will likely
improve performance, as the loss of parallelism is insignificant
compared to the performance increase associated with the reduction of
network traffic.  However, as more work is offloaded to the manager, the
performance will likely degrade as parallelism is reduced. We will
illustrate this behavior here.

\subsection{Test Applications}

We are interested in the point at which a given application no longer
benefits from offloading work to the manager.  To explore this point of
interest, we employ the following performance tests.  For each test, we
measure the overall run-time of the manager, from the point at which the
manager is launched to the point at which the manager exits.  The
startup time for each client is not represented.  

It should be emphasized that we are interested here only in the impacts
on performance which result from mild to aggressive use of tmexec().
The programs here were not especially optimized, nor did we make
comparisons to sequential versions which did not use the Linda paradigm.

The first test application, chosen for simplicity, was a matrix
multiplication program, which multiples two NxN matrices.  

We use a task farm approach, where each client retrieves a tuple
specifying the next row and column to be multiplied, retrieves the
associated row and column, multiples them together, and returns the
result to the tuple space.  For this test, we implemented two user
functions to be remotely executed by the manager. The first is a simple
fetch-and-add operation, intended to consume a minimal amount of manager
resources. This operation is limited to two accesses to the tuple space
(one to retrieve a tuple and one to return a tuple to the space) and an
increment operation. This operation is given below:

\begin{verbatim}
  void FAInextIndex(char *inArgs, 
     int inLen, char *outArgs, 
     int* outLen) {
    int i,j;

    in("lspp", "nextIndex", &i,&j);
    *outLen = 2*sizeof(int);
    memcpy(outArgs, &i, sizeof(int));
    memcpy(outArgs+sizeof(int), &j, 
       sizeof(int));
    j++;
    if (j<N){
      out("lsii", "nextIndex", i,j);
    }
    else {
      j=0;
      i++;
      out("lsii", "nextIndex", i,j);
    }
  }
\end{verbatim}

This function retrieves the index of the ``nextIndex'' tuple,
representing the index into the product array, increments the index, and
returns the incremented tuple to the tuple space.  Minimal additional
code surrounding the call to tmexec() is necessary to cycle through the
rows and columns.  This local function will be used to highlight
performance trends in our first execution environment, for which a
majority of the user function is composed of tuple operations.

The second operation we consider (code not shown here) is one in which
more work is offloaded to the TM.  Here the ``dot product'' work for a
row-column multiplication is done at the TM.  This local access function
will be used to highlight pertinent trends for our second execution
environment, in which a majority of the work done by the local access
function is non-tuple in nature.

We executed four variations of this program: the original implementation
without optimizations, with only the fetch-and-increment user function
implemented, with only the row-multiplication user function implemented,
and with both user functions implemented.  We then vary the number of
clients and the size of the matrices to be multiplied, and measure
performance based on the running time of the application.

In our second application, we considered an implementation of Dijkstra's
algorithm to determine the shortest path in a connected graph.  We
perform two optimizations of this algorithm.  The first implements a
user function to update the ``done'' flag for vertices in the graph.
This function simply retrieves a tuple for the specified vertex, updates
its value and returns the new value to the tuple space.  The second
implements a user function to update the distance to every other vertex
after the vertex with the current shortest path has been determined.  As
with the MatMult application, this program is run under four different
conditions: the original implementation without optimizations, with only
the Done-Update optimization, with only the Dist-Update optimization,
and with both user-functions implemented.

\subsection{Experimental Results}

\subsubsection{MatMult Application}

Once again, we consider two execution environments applicable to the
I-Tuples model.  The first environment is characterized by the utilization
of user functions that are composed primarily of tuple operations,
and the second environment by user functions composed primarily of non-tuple
operations.

\begin{figure}[h]
\includegraphics[width=3.2in]{mm1.eps}
\caption{}
\end{figure}

\begin{figure}[h]
\includegraphics[width=3.2in]{mm2.eps}
\caption{}
\end{figure}

Figure 2 presents data corresponding to the first matrix-multiply
execution environment, in which the Fetch-and-Add optimization was
applied to the MatMult application.  This data clearly indicates a
dramatic performance improvement relative to the non-optimized
environment (Figure 1).  These results suggest that user-defined
functions can improve overall application performance considerably.
Figure 2 also indicates that performance scales well with client count,
for this particular optimization, until we approach 6 clients.  At this
point, the performance to be gained by increasing the number of clients
becomes marginal.  This behavior is attributable to a growing amount of
network traffic as the number of clients increases. With more clients
requesting to be serviced by the manager, the manager quickly becomes
saturated, thereby causing clients to spin waiting for data.
Nevertheless, this data suggests that the I-Tuples environment can
provide significant performance gains.

\begin{figure}[h]
\includegraphics[width=3.2in]{mm3.eps}
\caption{}
\end{figure}

Figure 3 presents data corresponding to the second matrix-multiply
execution environment, in which the row-multiply optimization was
applied to the MatMult application.  As was observed with the
fetch-and-add optimization, this data also shows a significant
performance improvement, relative to the non-optimized environment.
Again, these results reaffirm the overall application performance
benefits that may be achieved by the I-Tuples environment.  

Relative to our fetch-and-add optimization results, however, a critical
deviation is observed. As the number of clients is increased, we find
that performance gains are less significant for multiply-row
optimization.  This behavior is largely observed for two reasons.
First, this user-function is performing significantly more work than the
fetch-and-add user-function.  Consequently, the manager workload is
consistently high, necessitating that clients spend more time idling
than performing useful work.  Second, the predominant amount of work
being performed in the row-multiply function is non-tuple in nature.
Accordingly, in comparison with the Fetch-and-Add optimization,
proportionally less work is eligible for the speedup associated with
reducing network activity.

\begin{figure}[h]
\includegraphics[width=3.2in]{mm4.eps}
\caption{}
\end{figure}

Figure 4 presents data corresponding to both execution environments, in
which both the fetch-and-add and row-multiply optimizations were applied
to the MatMult application.  The motivation in collecting this data was
to highlight trends associated with heavily-loaded environments in which
a significant amount of work is offloaded to the manager.  As this data
suggests, environments in which the manager is heavily loaded experience
negligible speedup as compared to the non-optimized MatMult environment
(Figure 1).  This is a result of shifting excessive workload from the
clients to the manager.  This effectively serializes execution of the
program within the manager, since much of the computation that would
have otherwise been performed in parallel by the clients is now executed
by the manager.

\subsubsection{Dijkstra's Shortest Path Application}

We now consider the performance of Dijkstra's shortest path algorithm
for various optimizations.

\begin{figure}[h]
\includegraphics[width=3.2in]{dij.eps}
\caption{}
\end{figure}

Figure 5 presents data corresponding to each variant of the Dijkstra
application.  For the non-optimized environment the data indicates that
performance improves until we reach a client count of 4.  For more than
4 clients, we find a small degradation in performance attributable to an
increase in network overhead. Data for the Done-Update optimization
indicates a performance advantage associated with reducing the network
traffic in updating the ``done'' tuple for each vertex.  Similarly, data
for the dist-update optimization shows a greater performance increase
over the non-optimized version, and additional performance benefits over
the done-update version.  This is attributable to further reduced
network traffic associated with updating the ``Dist'' information for
each vertex.  The final version, shown in Figure 6, employs both
optimizations, and indicates the best execution time thus far.  In
contrast with the MatMult data, in which both optimizations impacted
performance significantly by transferring the majority of the work to
the manager, this data suggests otherwise.  We find here that utilizing
both optimizations improves performance for the Dijkstra example.  This
is because the parallelism is not sacrificed as the clients still have
sufficient work to do.

The I-Tuples environment and all test programs were compiled using CC on
SGI O2 systems connected by an ordinary 10 Mb Ethernet.  

\section{Conclusions and Discussion}
\label{disc}

We have proposed an extension of the Linda programming paradigm which
enables the programmer to enhance performance by requesting the TM to
perform batches of tuple operations (together with associated ``glue''
code), rather than discretely.  Initial experimental results suggest
that significant performance speedups are possible.

A byproduct novelty of I-Tuples is that it enables the application
programmer to interact in a more active manner with the TM.  In standard
Linda, the TM is completely transparent.  In the NOW setting, for
example, the programmer does not know on which network node the TM
resides, and could not directly communicate with that node even if its
identity were known.

In this light, though the actions in I-Tuples may remind the reader
of {\bf code shipping} \cite{gifford}, in which application program can
specify the dynamic migration of code to another designated node, 
I-Tuples is different, in that here nodes are communicating with an
unseen entity at an unknown location.

I-Tuples does have some similarity with the notion of remote procedure
call (RPC), and tmexec() is arguably a form of RPC.  However, it differs
from RPC, again in the sense that the TM is unseen and resides in an
unknown location.

I-Tuples is also completely different from the code migration capability
implicit in Jini.  Most importantly, I-Tuples deals with
\underline{batches} of tuple operations, rather than individually, the
key point in terms of performance.  Thus the performance enhancement of
the type we have been discussing is not attainable in Jini.  And though
Jini does include a form of code shipping,\footnote{ In Jini, tuples
(called {\bf entries}) are Java objects, and thus may have code
associated with them.  That code migrates from one application node to
another as its associated tuple migrates, and is executable at any
application node which posseses the tuple.} the Jini TM is still unseen
and unaccessible.  

However, even though I-Tuples differs conceptually from the notion of
code shipping, it would be interesting to merge the I-Tuples concept
here with the concept of code shipping.  For example, Ariel, a current
research project of one of the authors \cite{raju} involves a highly
flexible framework for code mobility.  The Ariel framework still relies
on exact specification of a destination node in a code-shipping call,
but it could be generalized to include virtual nodes, in the following
sense.  Each virtual node number would have associated code which
processes messages and communicates with unseen entities.  That code
would be programmed as a low-level application, while code such as
I-Tuples would be considered a high-level application.  In this case the
unseen entity would be the TM, but many other applications are possible,
such as redirection of Web access requests.  In other words, this scheme
would extend the notions of code shipping, RPC and so on to include a
form of information hiding, in this case the location of the TM.

\begin{thebibliography}{}

\bibitem{almasi}

George Almasi and Allan Gottlieb.  {\it Highly Parallel Computing},
Benjamin/Cummings, 1989.

\bibitem{anderson}

T. Anderson {\it et al}.  ``A Case for Networks of Workstations,''
{\it IEEE Micro}, 54-64, February 1995.

\bibitem{raju}

Earl Barr and Raju Paney.  Ariel system for distributed programming.
(Paper submitted for publication.)

\bibitem{p4}

Ralph Butler, {\it et al}.  {\it p4-Linda: A Portable Implementation of
Linda}, ftp://info.mcs.anl.gov/pub/p4.   

\bibitem{linda}  

Nicholas Carriero and David Gelernter. {\it How to Write Parallel
Programs: a First Course}, MIT Press, 1990.

% \bibitem{ship}

% Tzi-cker Chiueh and Manish Verma. ``A Compiler-Directed Distributed 
% Shared Memory System,'' {\it Proceedings of the International Conference
% on Supercomputing}, 1995. 

\bibitem{uctuplets}

Michael Foster {\it et al}.  UCTuplets Parallel Processing Package,  
http://heather.cs.ucdavis.edu/uct.html

\bibitem{jini}

Eric Freeman {\it et al}.  {\it JavaSpaces(TM) Principles, Patterns 
and Practice}, Addison-Wesley, 1999.

\bibitem{gifford}

J. Stamos and D. Gifford.  ``Remote Evaluation,'' {\it ACM Transactions
on Programming Languages and Systems}, 12(4):537-565, October 1990.

\bibitem{beowulf}

T. Sterling {\it et al}.  {\it How to Build a Beowulf: a Guide to the
Implementation and Application of PC Clusters}, MIT Press, 1999.


\bibitem{ibm}

P. Wyckoff {\it et al}.  ``TSpaces,'' {\it IBM Systems Journal}, 37, 3,
1998.
 
\end{thebibliography}
 

\end{document}


