% \documentclass[11pt]{article}

% \setlength{\oddsidemargin}{0in}
% \setlength{\evensidemargin}{0in}
% \setlength{\topmargin}{0.0in}
% \setlength{\headheight}{0in}
% \setlength{\headsep}{0.5in}
% \setlength{\textwidth}{6.5in}
% \setlength{\textheight}{9.0in}
% \setlength{\parindent}{0in}
% \setlength{\parskip}{0.05in}

%% \usepackage{psfig}

\documentclass[times, 10pt,twocolumn]{article}
\usepackage{latex8}
\usepackage[dvips]{graphicx}
%% \usepackage{times}

\pagestyle{empty}

\begin{document}

\title{I-Tuples: A Programmer-Controllable Performance Enhancement
for the Linda Environment}
\author{Michael Foster, Norman Matloff, Raju Pandey, David Standring, 
and Robert Sweeney \\ 
University of California, Davis \\
\\
Correspondence Author: \\
Norman Matloff \\
matloff@cs.ucdavis.edu 
}

\date{April 27, 2001}

\maketitle

% \abstract{\noindent The Linda programming paradigm provides an elegant
% approach to parallel processing.  However, its efficiency is often
% limited by the communications overhead involved in transferring tuples
% to and from the tuple manager.  Here we propose an RPC approach to the
% problem, by enabling the tuple manager to process a set of several
% related tuple operations in one batch, without intervening
% communications with the application node.}

\begin{abstract}
A version of the Linda parallel programming model is developed in which
the user may offload batches of tuple operations to be done at the
tuple server, eliminating expensive communications which occur between
successive tuple operations.
\end{abstract}

{\it Keywords and phrases:}  Linda programming paradigm; performance;
communications costs.

\section{Introduction}

The motivation behind the Linda model \cite{linda} is to create a
protected shared memory space for parallel processing.  This space
consists of a collection of objects, called {\bf tuples}, which are
themselves ordered collections of typed fields and associated data.
Operators are available to client applications for insertion (out()),
deletion (in()), and inspection (rd()) of the collection of tuples, also
referred to as the {\bf tuple space}.  Tuple operations are carried out
by a tuple manager (TM) which maintains the tuple space.  (Many
implementations of Linda feature multiple tuple managers, but we assume
a single TM here.)

The Linda approach is quite elegant, and has interesting advantages over
other parallel-processing environments.  Atomic operations, for example,
are easy to implement by temporarily removing certain tuples from the
tuple space. A particularly useful feature is that processing nodes can
easily join the work group in the midst of a computation, and can easily
leave midstream as well.  The latter property also implies high
potential for writing fault-tolerant applications.  The Linda philosophy
was the inspiration for Sun Microsystems' new JavaSpaces system
\cite{javaspaces} and IBM's TSpaces \cite{ibm}.

However, one drawback to the Linda model, particularly in a Network of
Workstations (NOW) setting \cite{anderson} \cite{beowulf}, is the
relatively large communication overhead incurred in each operation.
When for example an application client performs an in() operation, the
client sends a request to the TM via the network, and the TM responds by
sending the tuple back to the client via the network.  (For
simplicity, we will assume here that each tuple operation produces one
network communication.  In some systems the situation may be more
complex, e.g. due to caching or due to making the operations
asynchronous.)  This is particularly evident in situations where a
program repeatedly retrieves a tuple, performs a small operation on data
in that tuple and writes the data back to the tuple space. In such
cases, the overhead associated with retrieving and returning the tuple
dominates the operation's overall cost.  

Consider the following example which might arise in a
matrix-multiplication computation:

\begin{verbatim}
  in("sp", "next row", &Row);
  Row++;
  out("si", "next row", Row);
\end{verbatim}

Here we are using the UCTuplets implementation of Linda
\cite{uctuplets}, somewhat similar to that of p4-Linda \cite{p4},
differing slightly from the standard \cite{linda}.  The first argument
in a call to in()/out()/rd() is not part of the tuple or template, but
instead provides information as to the types of the components of the
tuple.  For example, ``sp'' here means that the second component in the
tuple will be of string type and the third a pointer type, and ``si''
means a string type followed by an {\bf int} type.

This code represents a simple fetch-and-add operation \cite{almasi}.
The call to in() removes a tuple from the tuple space whose second
component is the string ``next row'', and places the value of that
tuple's third component into the variable Row.  The incremented version
of that value is then written back to the tuple space by calling out().

In a NOW setting, this will necessitate a total of four network
transactions: a request for retrieving the tuple, the response to that
request, a request to insert the incremented tuple, and the final
acknowledgment.  As such, this simple example incurs a significant
performance penalty due to overhead in packaging, transmitting,
decoding, and returning the data through the network.

A large portion of the research involving the Linda paradigm has
concerned addressing performance problems of this nature.  For example,
various proposals have been made for replicating or partitioning the
tuple space \cite{wc95}.  

Another direction of research has been to add new Linda primitives
designed to improve efficiency.  Bulk tuple operations have been
proposed, for instance \cite{rowstron}, to copy/move a group of tuples
in one operation.

An example of performance-motivated new Linda operations which is
especially relevant to our work here was included in the Eilean system
\cite{carreria}.  This took the form of an add() operation to the
in()/out()/rd() suite, whose function was essentially that of the
fetch-and-add operation described above.  Here, a single Linda operation
replaces two, with a corresponding halving of network overhead.  Our
work here may be viewed as an extension of that concept.

Our approach here is to allow the application programmer to define
his/her own new operators to add to the in()/out()/rd() suite, as
explained in the following sections.

\section{Proposed Solution:  the I-Tuples System}

To address this performance issue we propose an alternative
implementation, termed {\bf I-tuples}, which allows the programmer to
reduce the overhead associated with tuple operations such as in the
fetch-and-add example, for improved performance. 

We propose a solution in which the tuple manager is able to process
tuple operations on behalf of remote clients, using data that is local
to the TM and hence reducing the amount of network communication
required.

For example, the above fetch-and-add operation could be replaced by a
single function call made by a worker node but executed locally by the
TM.  The call to in() would not return a tuple to the worker node, but
instead the result tuple would be retained by the TM for use in the
succeeding lines of code.  The entire operation of three lines of code
would be executed as a package at the TM, via a new Linda operation,
tmexec().

All of this would require just two network transactions, compared to
four network transactions required in the former example.  The first
transaction instructs the TM to execute the specified function locally,
and the second would consist of return data from the TM to the
requesting client (in this case Row). As a result, overhead attributed
to network latency is reduced, in this example, by a factor of two.
Potential gains could be much more significant, depending primarily upon
the number of tuple operations offloaded to the TM.  

Note that our approach differs from {\bf code shipping} \cite{gifford},
remote procedure call, and so on.  This is discussed further in Section
\ref{disc}.

\section{Implementation}

As currently implemented, all the nodes running an I-Tuples application
must share the same architecture and compiler.  Our model here is a
network of identical workstations.  The reason for this restriction is
that the current implementation is single-image, relying on the address
of a function (the virtual address, i.e. \&f for a function f) having
the same value on all nodes of the system.

Application programs in the current system are in Single Program
Multiple Data (SPMD) form.  The same program is executed by all nodes,
including the node serving as the TM.  In other words, node 0 serves as
the TM while the nodes numbered greater than 0 are workers for the
application, yet all nodes including node 0 run the same program.  The
function main() calls a library function UCTInit().  The latter senses
the node number on which the program is being run, and then either calls
a library function to start the TM (in the case of node 0) or starts the
user-written function Worker() which comprises the application code (in
the case of the nodes numbered greater than 0).  By having the TM run
the same program as the worker nodes, the TM has direct access to the
I-Tuples functions written by the user.

To enable the user to develop his/her own Linda operations, we added a
new function to the operation set in()/out()/rd(), called tmexec()
(Tuple Manager EXECute).  This allows an application client to specify a
function for the TM to execute locally, typically containing multiple
in()/out()/rd() calls.  This new operation has the following
specification:

\begin{verbatim}
  tmexec (int (*ftn_ptr)(), 
     char* in_args, int in_length, 
     char* out_args, 
     int* out_length_ptr);
\end{verbatim}

In this prototype, ftn\_ptr is the address of the client-supplied
function (recall that due to the SPMD nature of the I-Tuples
application program, the function itself is already there at the tuple
manager), consisting of in()/out()/rd() calls that are to be serviced
locally by the TM. The strings ``in\_args'' and ``out\_args'' are used
to pass arguments from/to the client application to/from the specified
function. The two integer parameters represent the lengths of the input
and output arguments, respectively. The user is responsible for
marshalling/unmarshalling the input and output arguments, allocating
space for them, and appropriately specifying their lengths.  As with the
other Linda operations, the call blocks until the TM responds.

The following example is a modification of the fetch-and-add code
presented earlier.  The user-specified function is FetchAndAdd(), and it
and its indirect call via tmexec() are as follows:

\begin{verbatim}
   char *Ins,*Outs;  
   int InL,OutL;  int Row;
   ...
   FetchAndAdd(char* InArgs, 
      int InLength, char* OutArgs, 
      int* OutLengthPtr)

   {  int R;

      in("lsi", InArgs, &R);
      memcpy(OutArgs, R, sizeof(int)); 
      *OutLengthPtr = sizeof(int);
      out("lsi", InArgs, R+1);
   }
   ...
   strcpy(InArgs,"next row");
   InL = strlen(InArgs);
   tmexec(FetchAndAdd, Ins, InL, 
      Outs, &OutL);
   memcpy(Row, Outs, sizeof(int));
\end{verbatim}

To implement the tmexec() operator, it was necessary to establish a
mechanism to differentiate between in()/out()/rd() calls that are to be
serviced remotely (i.e. requests from worker nodes, as in standard
Linda) and those made locally by the TM.  To facilitate this
distinction, the key string used to specify tuple fields for
in()/out()/rd() access operations has been slightly modified. The user
must now designate the context of the access operation by adding `r'
(for remote access operations) or `l' (for local access operations) as
the first letter of the first tuple component.  Again, the designation
``remote'' or ``local'' which indicates the source of the operation
request are defined from the point of view of the TM.  (An alternate
approach would be to have the TM infer remote/local state on its own.)

For example, consider the call

\begin{verbatim}
  in("rsipI", "hello", 5, &var, 
     IntArray);
\end{verbatim}

This example illustrates how a remote call might look.  Notice that the
first letter in the key indicates that this is a remote operation,
intended to be called from the user application, i.e. as in standard
Linda; alternatively, an 'l' would indicate that the specified operation
is part of a user-defined function, for local execution by the TM, as
seen in our FetchAndAdd() example above.  The remainder of the key
string states that the next field will be of string type, then one of
pointer type ({\bf *int} here), then one of {\bf int} type, and finally
an {\bf int} array.  (After the call, the next-to-last field will
contain the length of the array, filled in by the TM.)

\section{Experimental Evaluation}

The motivation behind I-Tuples is to reduce the amount of network
traffic required to interact with the tuple space, by batching tuple
space access operations from the client to the tuple manager.
Intuitively, one might expect that application performance will likely
improve as a result.  However, by transferring workload to the TM, we
are effectively sacrificing parallelism, as execution is serialized
within the TM. For small operations, this transferral will likely
improve performance, as the loss of parallelism is insignificant
compared to the performance increase associated with the reduction of
network traffic.  However, as more work is offloaded to the manager, the
performance will likely degrade as parallelism is reduced. We will
illustrate this behavior here.

\subsection{Test Application}

We are interested in the point at which a given application no longer
benefits from offloading work to the manager.  To explore this point of
interest, we employ the following performance tests.  For each test, we
measure the overall run-time of the manager, from the point at which the
manager is launched to the point at which the manager exits.  The
startup time for each client is not represented.  

It should be emphasized that we are interested here only in the impacts
on performance which result from mild to aggressive use of tmexec().
The programs here were not especially optimized, nor did we make
comparisons to sequential versions which did not use the Linda paradigm.

As our test case, we considered an implementation of Dijkstra's
algorithm to determine the shortest path in a connected graph.  We
perform two optimizations of this algorithm.  The first implemented a
user function to update the ``done'' flag for vertices in the graph.
This function simply retrieves a tuple for the specified vertex, updates
its value and returns the new value to the tuple space.  The second
optimization implements a user function to update the distance to every
other vertex after the vertex with the current shortest path has been
determined.  

The program was run on a 150-node graph, under four different
conditions: the original implementation without optimizations, with only
the Done-Update optimization, with only the Dist-Update optimization,
and with both user-functions implemented.

\subsection{Experimental Results}

Here are the results of our study of the performance of Dijkstra's
shortest path algorithm for various levels of optimization.

\begin{figure}[h]
\includegraphics[width=3.2in]{dij.eps}
\caption{}
\end{figure}

Figure 1 presents data corresponding to each variant of the Dijkstra
application.  The horizontal axis is the number of processors running
the application, i.e. excluding the processor running the TM.  For the
non-optimized environment the data indicates that performance improves
until we reach a client count of 4.  For more than 4 clients, we find a
small degradation in performance attributable to an increase in network
overhead. Data for the Done-Update optimization indicates a performance
advantage associated with reducing the network traffic in updating the
``done'' tuple for each vertex.  Similarly, data for the dist-update
optimization shows a greater performance increase over the non-optimized
version, and additional performance benefits over the done-update
version.  This is attributable to further reduced network traffic
associated with updating the ``Dist'' information for each vertex.  We
find here that utilizing both optimizations improves performance for the
Dijkstra example.  This is because the parallelism is not sacrificed as
the clients still have sufficient work to do.

The I-Tuples environment and all test programs were compiled using CC on
SGI O2 systems connected by an ordinary 10 Mb Ethernet.  

\section{Conclusions and Discussion}
\label{disc}

We have proposed an extension of the Linda programming paradigm which
enables the programmer to enhance performance by requesting the TM to
perform batches of tuple operations (together with associated ``glue''
code), rather than discretely.  Initial experimental results suggest
that significant performance speedups are possible.

A byproduct novelty of I-Tuples is that it enables the application
programmer to interact in a more active manner with the TM.  In standard
Linda, the TM is completely transparent.  In the NOW setting, for
example, the programmer does not know on which network node the TM
resides, and could not directly communicate with that node even if its
identity were known.

Subsequent to our development of I-Tuples, we became aware of a
similar-looking feature in TSpaces.  Unfortunately the TSpaces
documentation provides only an example of this feature, not giving the
general details of the functions used to set up user-defined operations.
Nevertheless, it appears to be a powerful feature, even allowing one to
redefine the basic operations such as wr().  On the other hand, this
comes at a price, with definition of new operations lacking the direct
simplicity and ease of use which we believe our tmexec() approach
provides.

Though I-Tuples may at first look like a form of remote procedure call
(RPC), tmexec() differs from the standard RPC paradigm, in that  the TM
is unseen and resides in an unknown location.  

The actions in I-Tuples may also remind the reader of {\bf code
shipping} \cite{gifford}, in which application program can specify the
dynamic migration of code to another designated node.  Yet I-Tuples is
different, in that here the application code at the worker nodes is
communicating with an entity (the TM) whose node identity is unknown.
One could not, for example, implement tmexec() in JavaSpaces (unless
JavaSpaces itself were modified to expose the location of the TM and
give direct access to it).

However, even though I-Tuples differs conceptually from the notion of
code shipping, it would be interesting to merge the I-Tuples concept
here with the concept of code shipping.  For example, MAGE, a current
research project of one of the authors \cite{raju}, involves a highly
flexible framework for code mobility.  The MAGE framework has recently
been generalized to include virtual nodes, in the following sense.  Each
virtual node number has associated code which processes messages and
communicates with unseen entities.  That code would be programmed as a
low-level application, while code such as I-Tuples would be considered a
high-level application.  In this case the unseen entity would be the TM,
but many other applications are possible, such as redirection of Web
access requests.  In other words, this scheme would extend the notions
of code shipping, RPC and so on to include a form of information hiding,
in this case the location of the TM.

\begin{thebibliography}{}

\bibitem{almasi}

George Almasi and Allan Gottlieb.  {\it Highly Parallel Computing},
Benjamin/Cummings, 1989.

\bibitem{anderson}

T. Anderson {\it et al}.  ``A Case for Networks of Workstations,''
{\it IEEE Micro}, 54-64, February 1995.

\bibitem{raju}

Earl Barr and Raju Pandey.  ``MAGE: A Distributed Programming Model,''
{\it Proceedings of the International Conference on Distributed
Computing Systems}, 2001, Mesa, Arizona.        

\bibitem{p4}

Ralph Butler, {\it et al}.  {\it p4-Linda: A Portable Implementation of
Linda}, ftp://info.mcs.anl.gov/pub/p4.   

\bibitem{carreria}

J. Carreria, L. Silva and J. Silva.  ``On the Design of Eilean"  A
Linda-Like Library for MPI.''

\bibitem{linda}  

Nicholas Carriero and David Gelernter. {\it How to Write Parallel
Programs: a First Course}, MIT Press, 1990.

% \bibitem{ship}

% Tzi-cker Chiueh and Manish Verma. ``A Compiler-Directed Distributed 
% Shared Memory System,'' {\it Proceedings of the International Conference
% on Supercomputing}, 1995. 

\bibitem{uctuplets}

Michael Foster {\it et al}.  UCTuplets Parallel Processing Package,  
http://heather.cs.ucdavis.edu/uct.html

\bibitem{javaspaces}

Eric Freeman {\it et al}.  {\it JavaSpaces(TM) Principles, Patterns 
and Practice}, Addison-Wesley, 1999.

\bibitem{rowstron}

A. Rowstron.  {\it Bulk Primitives in Linda Run-Time Systems}, Ph.D.
thesis, University of York, 1996.

\bibitem{gifford}

J. Stamos and D. Gifford.  ``Remote Evaluation,'' {\it ACM Transactions
on Programming Languages and Systems}, 12(4):537-565, October 1990.

\bibitem{beowulf}

T. Sterling {\it et al}.  {\it How to Build a Beowulf: a Guide to the
Implementation and Application of PC Clusters}, MIT Press, 1999.

\bibitem{wc95}

G. Wells and A. Chalmers.  ``An Extended Linda System Using PVM, {\it
PVM Users' Group Meeting}, Pittsburgh, 1995.

\bibitem{ibm}

P. Wyckoff {\it et al}.  ``TSpaces,'' {\it IBM Systems Journal}, 37, 3,
1998.
 
\end{thebibliography}
 

\end{document}


