Hints for Debugging Parallel Programs

Using ddd/gdb:

Use of a debugging tool like gdb can save you large amounts of time and frustration in any debugging project. But it is especially useful in debugging parallel programs. Whenever possible, avoid doing your debugging by simply adding calls to printf()! Use a debugging tool, for instance gdb, if possible. (Ordinarily I suggest using the ddd GUI to gdb. However, when debugging a parallel program, this may be difficult, as GUIs take up a lot of space on one's monitor screen.) I have a writeup on the art of debugging, and an introduction to gdb, at my debugging-tutorial Web page.

Debugging of parallel programs is particularly difficult, both because there is "too much happening at once," and because debugging tools like gdb were not designed for parallel use. However, here is how you can use gdb with MPI, PVM, the various DSM packages, and so on (see important note on page-based DSMs later on):

First, when you compile your application source code, make sure to use the -g option, to retain the symbol table for gdb.

Now, get the program running, say on the partition

fajita.engr.ucdavis.edu
chimi.engr.ucdavis.edu
Say the name of the program is Prime. A copy of Prime will now be running on each machine. You will need to go to each machine and attach gdb to these invocations of the program. To do this, type
ps ax | grep Prime
(or ps -e or ps -ux, depending on the system), and find the process number for Prime at each machine. You might find several lines of output from this, such as "tcsh Prime ..." or "rsh chimi Prime...". Ignore these; you want the line which is for the execution of Prime itself.

(Note: One way around this would be to actually initiate the execution of the program at each node via gdb itself. However, this might be difficult to do with some parallel processing library packages.)

Then type

gdb Prime process_number
and then use gdb as usual from that point on.

Note that Prime was ALREADY running at fajita and chimi! What we have done is attach gdb to two already-running processes. However, in order to keep those process from running away from you, get them to wait for you, using the following method:

In your source code define an integer variable named something like "DebugWait", initialize it to 1, and insert code like

while (DebugWait) ;
at the very beginning of main(). When you attach gdb to the two Prime processes, both will be stuck at that "while" loop line -- which is exactly what you want. Then for both of them, give the gdb command
(gdb) set DebugWait = 0
to "liberate" them. Then use gdb as usual, setting break points, single-stepping through the code and so on.

If you are using a page-based DSM, you need to tell gdb to ignore seg faults, which comprise the central mechanism for page-based DSM. To do this, issue the command


handle 11 nostop noprint

 

to gdb. (Seg faults are signal number 11 in UNIX.) Or better yet, place such a line in your .gdbinit startup file during the times when you are debugging your DSM programs.

Other Debugging Hints:

Make sure that you do not have any "zombie" processes still hanging around from previous debugging runs. In our examples above, for instance, our program was named Prime; make sure there aren't any old Prime processes still running, since they may interfere with new Prime processes.

Use malloc() instead of declaring static arrays. Some message-passing packages, for instances, will just quit without an error message of you have declared large (or in some cases even medium-sized) arrays.

If you find that your program still does not accept large arrays, use the Unix limit command to increase your maximum stack size.