(DRAFT) Issues on the implementation of PrOgramming SYstem for
distriButed appLications
A free Linda implementation for Unix Networks
G. Schoinas(sxoinas@csd.uch.gr or sxoinas@csi.forth.gr)
Department of Computer Science
University of Crete
This a report on an implementation of a Linda System on the Unix Network.
It is indented mainly as a documentation guide
for someone that would decide to make changes to the code.
Also you can consider it a small introduction to the Linda model
and some of the issues that rise when you come to realize the model.
In section (2) an overview of the Linda model as defined in [CarGel89],
[CarGel89c] will be presented.
In section (3) a description of the POSYBL will be given.
In section (4) the implementation details of our system will be presented.
In section (5) a performance analysis for our system will be made.
Finally, in section (6) some benchmark results from our
system will be presented.
Linda is a parallel programming model, that was proposed by David
Gelernter to solve the problem of programming parallel machines.
Tuple space is Linda's mechanism for creating and coordinating
multiple execution threads. Tuple space is a bag of tuples,
where a tuple is simply a sequence of typed fields. Linda
provides operators for dropping tuples into the bag,
removing tuples out of the bag and reading them without
removing them. To find a particular tuple, we use
associative look-up, that is tuples are selected for removal
or reading on the basis of any combination of their field values.
Task creation in Linda is done by live tuples.
Live tuples are tuples whose fields are not evaluated until
after the tuple enters the tuple space. When a live tuple
is dropped in the tuple space it is evaluated independently,
and in parallel, with the task that dropped it in.
When it's done evaluating, it turns into an ordinary data tuple
that can be read or removed like any other tuple.
The Linda model is embedded in a computation language (C, Lisp, e.t.c.)
and the result is a parallel programming language.
The Linda model defines four operations on tuple space.
These are:
- 1.
- out(t);
Causes tuple t to be added in the tuple space.
- 2.
- in(s);
Causes some tuple t that matches the template s to
be withdrawn from the tuple space. The values of the actuals of
t are assigned to the formals of s and the executing process
continues. If no matching t is available when in(s) executes,
the executing process suspends until one is. If many matching t's
are available, one is chosen arbitrarily.
- 3.
- rd(s);
Its operation is the same as in(s); expect that the matching tuple
is not withdrawn from the tuple space.
- 4.
- eval(t);
Causes tuple t to be added to the tuple space but t is evaluated
after rather than before it enters the tuple space. A new process
is created to perform the evaluation.
The usual way to create a Linda System is to implement a
precompiler, that would analyze the source code, decide the
type of each field in the operations and insert the
necessary code to implement the operations[CarGel89b].
This way was adopted by the Linda Group of Yale (this system is
marketed by SCA, Inc) and by J. Leighter in vms/vaxes[Lei89]
(this system is marketed by LWR Systems).
Although, today the trend is towards Linda implementations based
on compiler technology the Linda model was used at the kernel
level for a distributed unix-like operating system for a board of
transputers by Wm. Leler[Lel90], and the result was an operating system
that, at least, from [Lel90] seems to be an item of rare beauty.
Unfortunately, their system doesn't seem to be commercially or
any other way, available yet.
The development time for this had to stay small, since this system
is a by product of what was doing last semester.
So, no effort was to made to built a fancy Linda system, that would include
a Linda preprocessor/compiler or a tuple space inspection program
or even specialized debugging facilities. This results to
some extra effort needed by the programmer in order to develop
his applications. In fact, the first versions of this
system were really unusable from anyone expect myself.
Since that time and with the help of some friends that
offered to experiment with the system a more or less
usable system has evolved. Of course, much is needed
to raise its quality to the commercial systems that are available.
The POSYBL consists of a daemon that
is supposed to run in every workstation we want to execute processes
and maintains the tuple space
and a library of calls to the Linda operations.
All the user has to do is to link his objects to the POSYBL
library and start the daemons in the workstations.
The main changes of the POSYBL from
the original Linda model are:
- 1.
- Since there isn't a preprocessor to analyze the source and
find the type and size of the variables that participate
in a Linda call, the user has to provide this information
by using simple description functions.
For the same reason,
the system cannot determine which tuple fields to use as a hash key, so
the convention that the first field of a tuple must always be an
actual is introduced.
- 2.
- For the same reason as above the implementation and semantics of
eval changed considerably. In eval, you don't specify
a function to be called but rather an executable file to be
evaluated.
- 3.
- The predicate function calls rdp and inp where not implemented
since I don't find them very useful.
With a slow communication medium such as the Ethernet Network
polling techniques are the next closest thing to disaster.
- 4.
- Formal fields are not allowed in the out operation. The reason
is purely a matter of taste.
The run time system consists of a bunch of daemon processes, called
Tuple Managers, that maintain the tuple space and another bunch of processes,
called POSYBL Program, that execute the user code and connect to
the Tuple Managers to issue operations on the tuple space.
The Tuple Managers that are as their name suggests, the ones that provide
the illusion of a tuple space, are rpc servers that wait for requests.
A user can run concurrently as many distributed POSYBL programs as he wishes.
The only limit on the number of processes is that a Tuple Manager
can connect at most with 25 processes since each connection consumes
a file descriptor and the rpc package can support only servers with up to
32 open file descriptors. To work around this problem two solutions
exist. The first is to modify the rpc package to work with as many
file descriptors as the Unix allows, but that is only 64. The
other is to use datagram connections between the local Tuple Manager
and the user processes but this means that a either we will have to do
without large tuples, or that a sequencing scheme will be introduced.
The POSYBL supports a small but adequate set of predefined
types. Two types of data are handled. Scalar types (int, char, etc.) and vector
types (array of scalars).
The scalar types are char, short, int, float,
double and NULL terminating string.
For all but the NULL terminated string type there exists a corresponding
vector type.
For more complex, user defined, types some simple macros are provided that
transparently treats them as a vector of chars.
The user doesn't have to program the actual calls to the
Tuple Managers since a much more flexible interface is used.
The user knows about five calls that are
considered linda operations and about some field description and data copying
functions.
- 1.
- init(RunGroup, Nodefile);
This function must be used when user wants
to run more than one POSYBL Programs concurrently.
Each POSYBL Program consists of a bunch of processes and in order
for them to be able to know their mates, they must all provide
to the Tuple Manager the same id. The RunGroup argument of init can be any
integer. The only requirement is that it is different from any other
integer that is supplied by the user to other POSYBL Programs that
run at the same time. If the init function is not called the
program is said to belong to RunGroup 0. Even in this case
two POSYBL Programs of the same user can run together as long as their
tuples don't interfere.
The Nodefile argument is the name of a file that contains the
nodes on which we want to execute remote processes. If it is not
supplied the routine first tries ' /.nodefile' and the
the './.nodefile' files. If it doesn't succeed to locate
and load the node information the program exits with an error message.
- 2.
- eval(); As it has been already stated, this function is
quite different from the original Linda model. Basically, it
is a 'rsh' invocation of an executable. It comes in four
different forms, eval_l(), eval_v(), eval_nl(), eval_nv(). Their
calling convention is similar to the Unix execl() and execv().
The first two start the remote command at the node with the
least load if the system is compiled with the DYNAMIC flag
defined and the host supports the getrusage rpc function(SunOs),
while in the last two you must specify the name of the node.
- 3.
- out(); Writes a tuple to the tuple space.
- 4.
- in(); Deletes a tuple from the tuple space.
- 5.
- rd(); Reads a tuple from the tuple space.
Finally as the field description functions are:
- 1.
- l[type](value);
- 2.
- ln[type](value, number);
- 3.
- ql[type](&address);
- 4.
- qln[type](&address, &address);
Let's try some examples of the above calls:
out(lstring("demo_tuple"), lchar('a'), lshort(20), lint(1000), lfloat(1.1));
In this out call we export a tuple that has five fields.
The first as mentioned before is the key that will be used for hashing
by the tuple manager.
The fields 2-5 are (as we hope that you can guess) a char, a short an integer
and a float.
An other example that uses vector type for the second field is:
out(lstring("demo_tuple_1"), lnchar("Hello World", strlen("Hello World")), lfloat(1.1));
We can access the above tuples with the following in() calls:
in(lstring("demo_tuple"), qlchar(&charvar), lshort(20), qlint(&intvar),
qlfloat(&realvar));
in(lstring("demo_tuple_1"), qlnchar(&charptr, &intvar), lfloat(1.1));
The last call changes the contents of charptr to make it point to array
data and the contents of intvar to make it hold the length of the array.
The data are located in a static area
that is overwritten after another call to one of the tuple accessing functions.
Some data copying functions are provided to copy them to a user defined
area and they are merely versions of memcpy. These functions
are :
- 1.
- copychars(d,s,l);
- 2.
- copyshorts(d, s, l);
- 3.
- copyints(d, s, l);
- 4.
- copyfloats(d, s, l);
- 5.
- copydoubles(d, s, l);
All of these functions copy l data objects from the area pointed
by s to the area pointed by d.
They are not necessary but rather convenient.
The requests to the Tuple Managers can be a in, rd, out or template
request. Each 'Tmanager' listens at specific sockets for specific
requests. There are four service waiting sockets:
- 1.
- One for local calls and it is a Unix domain stream socket.
- 2.
- One for small tuples
under 4Kb and it is a Udp socket. The same
socket is used for accepting system requests.
- 3.
- One for template requests under 1500 bytes
and it is a Udp socket.
The limit on the template size is due to the fact that templates are
broadcast in the Ethernet.
- 4.
- One for big tuples larger than 4Kb
and it is a Tcp socket.
As it has been stated, there are four tuple operation calls.
Also there are three 'system' calls. I will try to describe
all the the available calls.
- 1.
- Tuple operation calls.
- (a)
- LindaOut.
The incoming tuple is checked to see if someone either at our node
or in another node waits for such a tuple. First, we check for
a template that matches it from the local node and then from
remote nodes. A tuple to a remote node is send by calling
the remote node's LindaOut call with the tuple we received.
- (b)
- LindaIn.
The incoming template is matched against the tuple we store.
If a match is found the data are returned to the requester.
If not, we keep a note of the request and we broadcast a LindaTemplate
call. The broadcast is repeated periodically until we receive a
match tuple.
- (c)
- LindaRead.
We do exactly the same as the previous call. The only difference is
that when a tuple is found to match the template it is not thrown away
after the data are sent but rather stored for future accesses.
- (d)
- LindaTemplate.
When we receive this call we check for a tuple that
we match the template. If none is found, we store the template
so that we can check it against incoming tuples. We will keep
the template until a matching tuple is found or until it expires
after a predefined interval.
If an identical
template from the same process and node is stored already we
just reset its expiration time. The tuples are returned by performing
a LindaOut call on the requesting node.
- 2.
- 'System' calls.
The user can apply all the below described calls by the use
of 'system' program.
- (a)
- SystemStats
Print statistics for the Linda access calls.
- (b)
- SystemExit
If the uid under which the tmanager runs is identical to the uid
of the caller we exit.
- (c)
- SystemReset
The hash tables are destroyed for a specific (uid, rungroup)
pair so that any tuples remaining are destroyed.
In the previous section the operation of the system was described.
From that it is obvious, that there is a minimum amount of messages
that traverse the network. Indeed, from one node to in() or rd()
a tuple that exists in another node only three messages need to
pass through the network provided that the tuple exists in some
other network and that no message is lost. The first is the
broadcast of the template, the other is the out of the tuple
and the last the acknowledge of the receipt of the tuple.
As for the local calls, only two messages are exchanged between
the Tuple Manager and the user process. But this low level
efficiency doesn't guarantee overall efficient operation of the
system as we shall see here.
In parallel programs there are mainly four classes of data objects,
distinguished by their use. I will try to describe these classes,
and what our system's expected performance is, for every class.
- 1.
- Migratory objects. This class contains those data objects
that move from processor to processor but are never manipulated by
more than one processor at any time. They can be easily implemented
by Linda's tuples, that will be read or delete from the tuple
space. Our system does well with this kind of objects, since an
access operation requires the broadcast of a Template message
from the requester node and an LindaOut call from the owner node
to the requester node.
- 2.
- Locks. This class contains those data objects
that are used for synchronization purposes by the processors. They
are probably the most dangerous, from a performance point of view, objects
that can be found in a parallel application. But they can be implemented
very effectively with the Linda model, provided we are careful.
What we have to do is simply to avoid polling techniques.
This means, that we must request a tuple with such a template
that only the right tuple can match it. Of course, the frequent
update of the locks results in competition for the tuple
that contains them and thus in large traffic.
- 3.
- Mostly read objects. This class contains those data objects
that are supposed to be read, a large number of times by many processors
before they are disposed or altered. Their implementation is, again,
straightforward. That is, we can implement them by tuples that
we be read, from the tuple space, most of the times, and just a few times,
will be deleted with an in operation. But the problem is, that since we don't
have any kind of caching scheme, we treat these objects
as migratory ones, resulting in unnecessary traffic. In the next
release, of our system, a caching scheme to solve
exactly this problem, will be incorporated. According to this
scheme, whenever a tuple is found that excibits a reading
behavior (i.e. travels from node to node just with rd requests)
we will start sending a doublicate of it to the requesters.
Of course, when an in request comes for this tuple, it can only
be serviced by the node that decided to send each duplicates.
This node will have to gather all the duplicates, that has
before honoring the request. An more radical approach
would be to implement a scheme that would use the broadcast
facilities of Ethernet for transmitting these tuples.
But the problem is that Ethernet does not provide
reliable broadcasts and that the maximum size of a broadcast
packet is only 1500 bytes. Thus the resulting protocol
would be really hard to define and implement.
- 4.
- Frequently read and written objects. This class contains those data
objects that are frequently read and written. In any kind of architecture
or system of shared memory, this kind of objects are problematic.
Indeed, we cannot do many things, do avoid having them travel from
node to node, being deleted from the tuple space, and being
reinserted again.
Various low level benchmarks were executed in order to have an idea
of the expected cost of some primitive operations in our system.
The system has been tested in Sun3 and Sun4 workstations with
SunOs 4.0 and Decstation 3100 workstations with Ultrix 2.1 and
the results were the following:
- 1.
- in(), out() and rd() of the smallest possible tuple from the same node.
The experiment was (i) send to the local Tuple Manager a number of identical
tuples, (ii) read them a number of times and (iii) remove them from the tuple
space.
The results showed us that the amount of time, each operation took, was
the same for all operations.
This time was 3 - 3.5 msec/call on sun4s and decstations, while it was
6.2 - 7.1 msec/call on sun3s.
About 70% of that time is spent inside system calls (recv, send, e.t.c.).
This means that there are not many things we can do, to decrease it,
as long as we don't adopt a different interprocess communication
mechanism. But from the rest 30%, most of the time is spend in handling
the event queue of the RPC package. This queue is a simple priority
queue and the operation on it become really inefficient when its
size increase. Despite our effort to optimize it with the use of a
last reference pointer(made it 30% faster) and of a free list for
the allocation of event structures, when things get hard its performance
becomes unacceptable.
The proper solution would be implement a radically different data structure
for keeping the events with less overhead (maybe indexing mechanisms).
- 2.
- in() and out() of the smallest possible tuple from two nodes.
The algorithm used to request a tuple from another node involves three RPC
calls instead one that is needed when a tuple is founded in the local
Tuple Manager. Thus we expect this operation to be, at least, three times
slower which is 9 - 10 msec/call on sun4s and decstations.
The actual value was 12-14 msec/call which should be accounted to the
overhead of the ethernet.
This is verified by our results. But we issue successive request an
other problem is encountered.
While the first calls complete fast the next seem to take more time
until at a later time the system stabilizes. This is due to the fact that the
responses to the LindaOut RPC calls from one node to the other are cached for a specific interval (60 sec).
Thus the time to complete an operation (which is dependent from the cache size)
can become 24 msec/call on sun4s and sparcstations.
- 3.
- in() of a large tuple with the smallest possible
key located in a different node than the one from which the request is made.
This experiment was to see the efect of using TCP sockets for sending
big tuples from node to node.
We expected the time to complete an operation to rise if the tuples could
not be send through UDP sockets.
Indeed the results showed us a time of 43 msec/call on sun4s and decstations for
a tuple of size 4000 bytes (UDP protocol) and 46 msec/call for a tuple of size
4096 bytes (TCP protocol).
Thus the overhead doesn't seem to be too much.
I would like to thank everybody that helped me build this system and
especially, I. Kavaklis (kavaklis@csi.forth.gr, kavaklis@csd.uch.gr)
for his contribution in ideas, code pieces and for using the system.
Footnotes
- ...
- Linda is a trademark of Scientific Computing Associates, Inc.
Norm Matloff
11/7/1998