(DRAFT) Issues on the implementation of PrOgramming SYstem for distriButed appLications
A free Linda implementation for Unix Networks
G. Schoinas(sxoinas@csd.uch.gr or sxoinas@csi.forth.gr)
Department of Computer Science
University of Crete

Introduction

This a report on an implementation of a Linda System on the Unix Network. It is indented mainly as a documentation guide for someone that would decide to make changes to the code. Also you can consider it a small introduction to the Linda model and some of the issues that rise when you come to realize the model.

In section (2) an overview of the Linda model as defined in [CarGel89], [CarGel89c] will be presented. In section (3) a description of the POSYBL will be given. In section (4) the implementation details of our system will be presented. In section (5) a performance analysis for our system will be made. Finally, in section (6) some benchmark results from our system will be presented.

Description of the Linda Model

Linda is a parallel programming model, that was proposed by David Gelernter to solve the problem of programming parallel machines.

Tuple space is Linda's mechanism for creating and coordinating multiple execution threads. Tuple space is a bag of tuples, where a tuple is simply a sequence of typed fields. Linda provides operators for dropping tuples into the bag, removing tuples out of the bag and reading them without removing them. To find a particular tuple, we use associative look-up, that is tuples are selected for removal or reading on the basis of any combination of their field values. Task creation in Linda is done by live tuples. Live tuples are tuples whose fields are not evaluated until after the tuple enters the tuple space. When a live tuple is dropped in the tuple space it is evaluated independently, and in parallel, with the task that dropped it in. When it's done evaluating, it turns into an ordinary data tuple that can be read or removed like any other tuple.

The Linda model is embedded in a computation language (C, Lisp, e.t.c.) and the result is a parallel programming language. The Linda model defines four operations on tuple space. These are:

1.
out(t); Causes tuple t to be added in the tuple space.
2.
in(s); Causes some tuple t that matches the template s to be withdrawn from the tuple space. The values of the actuals of t are assigned to the formals of s and the executing process continues. If no matching t is available when in(s) executes, the executing process suspends until one is. If many matching t's are available, one is chosen arbitrarily.
3.
rd(s); Its operation is the same as in(s); expect that the matching tuple is not withdrawn from the tuple space.
4.
eval(t); Causes tuple t to be added to the tuple space but t is evaluated after rather than before it enters the tuple space. A new process is created to perform the evaluation.

The usual way to create a Linda System is to implement a precompiler, that would analyze the source code, decide the type of each field in the operations and insert the necessary code to implement the operations[CarGel89b]. This way was adopted by the Linda Group of Yale (this system is marketed by SCA, Inc) and by J. Leighter in vms/vaxes[Lei89] (this system is marketed by LWR Systems). Although, today the trend is towards Linda implementations based on compiler technology the Linda model was used at the kernel level for a distributed unix-like operating system for a board of transputers by Wm. Leler[Lel90], and the result was an operating system that, at least, from [Lel90] seems to be an item of rare beauty. Unfortunately, their system doesn't seem to be commercially or any other way, available yet.

Description of the POSYBL

Language Changes in our system

The development time for this had to stay small, since this system is a by product of what was doing last semester. So, no effort was to made to built a fancy Linda system, that would include a Linda preprocessor/compiler or a tuple space inspection program or even specialized debugging facilities. This results to some extra effort needed by the programmer in order to develop his applications. In fact, the first versions of this system were really unusable from anyone expect myself. Since that time and with the help of some friends that offered to experiment with the system a more or less usable system has evolved. Of course, much is needed to raise its quality to the commercial systems that are available.

The POSYBL consists of a daemon that is supposed to run in every workstation we want to execute processes and maintains the tuple space and a library of calls to the Linda operations. All the user has to do is to link his objects to the POSYBL library and start the daemons in the workstations.

The main changes of the POSYBL from the original Linda model are:

1.
Since there isn't a preprocessor to analyze the source and find the type and size of the variables that participate in a Linda call, the user has to provide this information by using simple description functions. For the same reason, the system cannot determine which tuple fields to use as a hash key, so the convention that the first field of a tuple must always be an actual is introduced.
2.
For the same reason as above the implementation and semantics of eval changed considerably. In eval, you don't specify a function to be called but rather an executable file to be evaluated.
3.
The predicate function calls rdp and inp where not implemented since I don't find them very useful. With a slow communication medium such as the Ethernet Network polling techniques are the next closest thing to disaster.
4.
Formal fields are not allowed in the out operation. The reason is purely a matter of taste.

Overview of the system

The run time system consists of a bunch of daemon processes, called Tuple Managers, that maintain the tuple space and another bunch of processes, called POSYBL Program, that execute the user code and connect to the Tuple Managers to issue operations on the tuple space. The Tuple Managers that are as their name suggests, the ones that provide the illusion of a tuple space, are rpc servers that wait for requests.

A user can run concurrently as many distributed POSYBL programs as he wishes. The only limit on the number of processes is that a Tuple Manager can connect at most with 25 processes since each connection consumes a file descriptor and the rpc package can support only servers with up to 32 open file descriptors. To work around this problem two solutions exist. The first is to modify the rpc package to work with as many file descriptors as the Unix allows, but that is only 64. The other is to use datagram connections between the local Tuple Manager and the user processes but this means that a either we will have to do without large tuples, or that a sequencing scheme will be introduced.

The POSYBL supports a small but adequate set of predefined types. Two types of data are handled. Scalar types (int, char, etc.) and vector types (array of scalars). The scalar types are char, short, int, float, double and NULL terminating string. For all but the NULL terminated string type there exists a corresponding vector type. For more complex, user defined, types some simple macros are provided that transparently treats them as a vector of chars.

The user doesn't have to program the actual calls to the Tuple Managers since a much more flexible interface is used. The user knows about five calls that are considered linda operations and about some field description and data copying functions.

1.
init(RunGroup, Nodefile); This function must be used when user wants to run more than one POSYBL Programs concurrently. Each POSYBL Program consists of a bunch of processes and in order for them to be able to know their mates, they must all provide to the Tuple Manager the same id. The RunGroup argument of init can be any integer. The only requirement is that it is different from any other integer that is supplied by the user to other POSYBL Programs that run at the same time. If the init function is not called the program is said to belong to RunGroup 0. Even in this case two POSYBL Programs of the same user can run together as long as their tuples don't interfere. The Nodefile argument is the name of a file that contains the nodes on which we want to execute remote processes. If it is not supplied the routine first tries ' /.nodefile' and the the './.nodefile' files. If it doesn't succeed to locate and load the node information the program exits with an error message.
2.
eval(); As it has been already stated, this function is quite different from the original Linda model. Basically, it is a 'rsh' invocation of an executable. It comes in four different forms, eval_l(), eval_v(), eval_nl(), eval_nv(). Their calling convention is similar to the Unix execl() and execv(). The first two start the remote command at the node with the least load if the system is compiled with the DYNAMIC flag defined and the host supports the getrusage rpc function(SunOs), while in the last two you must specify the name of the node.
3.
out(); Writes a tuple to the tuple space.
4.
in(); Deletes a tuple from the tuple space.
5.
rd(); Reads a tuple from the tuple space.

Finally as the field description functions are:

1.
l[type](value);
2.
ln[type](value, number);
3.
ql[type](&address);
4.
qln[type](&address, &address);

Let's try some examples of the above calls:

out(lstring("demo_tuple"), lchar('a'), lshort(20), lint(1000), lfloat(1.1));

In this out call we export a tuple that has five fields. The first as mentioned before is the key that will be used for hashing by the tuple manager. The fields 2-5 are (as we hope that you can guess) a char, a short an integer and a float. An other example that uses vector type for the second field is:

out(lstring("demo_tuple_1"), lnchar("Hello World", strlen("Hello World")), lfloat(1.1));

We can access the above tuples with the following in() calls:

in(lstring("demo_tuple"), qlchar(&charvar), lshort(20), qlint(&intvar), 
qlfloat(&realvar));

in(lstring("demo_tuple_1"), qlnchar(&charptr, &intvar), lfloat(1.1));

The last call changes the contents of charptr to make it point to array data and the contents of intvar to make it hold the length of the array. The data are located in a static area that is overwritten after another call to one of the tuple accessing functions. Some data copying functions are provided to copy them to a user defined area and they are merely versions of memcpy. These functions are :

1.
copychars(d,s,l);
2.
copyshorts(d, s, l);
3.
copyints(d, s, l);
4.
copyfloats(d, s, l);
5.
copydoubles(d, s, l);
All of these functions copy l data objects from the area pointed by s to the area pointed by d. They are not necessary but rather convenient.

Implementation of the System

The requests to the Tuple Managers can be a in, rd, out or template request. Each 'Tmanager' listens at specific sockets for specific requests. There are four service waiting sockets:

1.
One for local calls and it is a Unix domain stream socket.
2.
One for small tuples under 4Kb and it is a Udp socket. The same socket is used for accepting system requests.
3.
One for template requests under 1500 bytes and it is a Udp socket. The limit on the template size is due to the fact that templates are broadcast in the Ethernet.
4.
One for big tuples larger than 4Kb and it is a Tcp socket.

As it has been stated, there are four tuple operation calls. Also there are three 'system' calls. I will try to describe all the the available calls.

1.
Tuple operation calls.
(a)
LindaOut. The incoming tuple is checked to see if someone either at our node or in another node waits for such a tuple. First, we check for a template that matches it from the local node and then from remote nodes. A tuple to a remote node is send by calling the remote node's LindaOut call with the tuple we received.
(b)
LindaIn. The incoming template is matched against the tuple we store. If a match is found the data are returned to the requester. If not, we keep a note of the request and we broadcast a LindaTemplate call. The broadcast is repeated periodically until we receive a match tuple.
(c)
LindaRead. We do exactly the same as the previous call. The only difference is that when a tuple is found to match the template it is not thrown away after the data are sent but rather stored for future accesses.
(d)
LindaTemplate. When we receive this call we check for a tuple that we match the template. If none is found, we store the template so that we can check it against incoming tuples. We will keep the template until a matching tuple is found or until it expires after a predefined interval. If an identical template from the same process and node is stored already we just reset its expiration time. The tuples are returned by performing a LindaOut call on the requesting node.
2.
'System' calls. The user can apply all the below described calls by the use of 'system' program.
(a)
SystemStats Print statistics for the Linda access calls.
(b)
SystemExit If the uid under which the tmanager runs is identical to the uid of the caller we exit.
(c)
SystemReset The hash tables are destroyed for a specific (uid, rungroup) pair so that any tuples remaining are destroyed.

Performance Analysis

In the previous section the operation of the system was described. From that it is obvious, that there is a minimum amount of messages that traverse the network. Indeed, from one node to in() or rd() a tuple that exists in another node only three messages need to pass through the network provided that the tuple exists in some other network and that no message is lost. The first is the broadcast of the template, the other is the out of the tuple and the last the acknowledge of the receipt of the tuple. As for the local calls, only two messages are exchanged between the Tuple Manager and the user process. But this low level efficiency doesn't guarantee overall efficient operation of the system as we shall see here.

In parallel programs there are mainly four classes of data objects, distinguished by their use. I will try to describe these classes, and what our system's expected performance is, for every class.

1.
Migratory objects. This class contains those data objects that move from processor to processor but are never manipulated by more than one processor at any time. They can be easily implemented by Linda's tuples, that will be read or delete from the tuple space. Our system does well with this kind of objects, since an access operation requires the broadcast of a Template message from the requester node and an LindaOut call from the owner node to the requester node.
2.
Locks. This class contains those data objects that are used for synchronization purposes by the processors. They are probably the most dangerous, from a performance point of view, objects that can be found in a parallel application. But they can be implemented very effectively with the Linda model, provided we are careful. What we have to do is simply to avoid polling techniques. This means, that we must request a tuple with such a template that only the right tuple can match it. Of course, the frequent update of the locks results in competition for the tuple that contains them and thus in large traffic.

3.
Mostly read objects. This class contains those data objects that are supposed to be read, a large number of times by many processors before they are disposed or altered. Their implementation is, again, straightforward. That is, we can implement them by tuples that we be read, from the tuple space, most of the times, and just a few times, will be deleted with an in operation. But the problem is, that since we don't have any kind of caching scheme, we treat these objects as migratory ones, resulting in unnecessary traffic. In the next release, of our system, a caching scheme to solve exactly this problem, will be incorporated. According to this scheme, whenever a tuple is found that excibits a reading behavior (i.e. travels from node to node just with rd requests) we will start sending a doublicate of it to the requesters. Of course, when an in request comes for this tuple, it can only be serviced by the node that decided to send each duplicates. This node will have to gather all the duplicates, that has before honoring the request. An more radical approach would be to implement a scheme that would use the broadcast facilities of Ethernet for transmitting these tuples. But the problem is that Ethernet does not provide reliable broadcasts and that the maximum size of a broadcast packet is only 1500 bytes. Thus the resulting protocol would be really hard to define and implement.

4.
Frequently read and written objects. This class contains those data objects that are frequently read and written. In any kind of architecture or system of shared memory, this kind of objects are problematic. Indeed, we cannot do many things, do avoid having them travel from node to node, being deleted from the tuple space, and being reinserted again.

Benchmarks

Various low level benchmarks were executed in order to have an idea of the expected cost of some primitive operations in our system. The system has been tested in Sun3 and Sun4 workstations with SunOs 4.0 and Decstation 3100 workstations with Ultrix 2.1 and the results were the following:

1.
in(), out() and rd() of the smallest possible tuple from the same node. The experiment was (i) send to the local Tuple Manager a number of identical tuples, (ii) read them a number of times and (iii) remove them from the tuple space. The results showed us that the amount of time, each operation took, was the same for all operations. This time was 3 - 3.5 msec/call on sun4s and decstations, while it was 6.2 - 7.1 msec/call on sun3s. About 70% of that time is spent inside system calls (recv, send, e.t.c.). This means that there are not many things we can do, to decrease it, as long as we don't adopt a different interprocess communication mechanism. But from the rest 30%, most of the time is spend in handling the event queue of the RPC package. This queue is a simple priority queue and the operation on it become really inefficient when its size increase. Despite our effort to optimize it with the use of a last reference pointer(made it 30% faster) and of a free list for the allocation of event structures, when things get hard its performance becomes unacceptable. The proper solution would be implement a radically different data structure for keeping the events with less overhead (maybe indexing mechanisms).
2.
in() and out() of the smallest possible tuple from two nodes. The algorithm used to request a tuple from another node involves three RPC calls instead one that is needed when a tuple is founded in the local Tuple Manager. Thus we expect this operation to be, at least, three times slower which is 9 - 10 msec/call on sun4s and decstations. The actual value was 12-14 msec/call which should be accounted to the overhead of the ethernet. This is verified by our results. But we issue successive request an other problem is encountered. While the first calls complete fast the next seem to take more time until at a later time the system stabilizes. This is due to the fact that the responses to the LindaOut RPC calls from one node to the other are cached for a specific interval (60 sec). Thus the time to complete an operation (which is dependent from the cache size) can become 24 msec/call on sun4s and sparcstations.
3.
in() of a large tuple with the smallest possible key located in a different node than the one from which the request is made. This experiment was to see the efect of using TCP sockets for sending big tuples from node to node. We expected the time to complete an operation to rise if the tuples could not be send through UDP sockets. Indeed the results showed us a time of 43 msec/call on sun4s and decstations for a tuple of size 4000 bytes (UDP protocol) and 46 msec/call for a tuple of size 4096 bytes (TCP protocol). Thus the overhead doesn't seem to be too much.

Acknowledgemnts

I would like to thank everybody that helped me build this system and especially, I. Kavaklis (kavaklis@csi.forth.gr, kavaklis@csd.uch.gr) for his contribution in ideas, code pieces and for using the system.


Footnotes

...
Linda is a trademark of Scientific Computing Associates, Inc.



Norm Matloff
11/7/1998