Debugging mpi applications with pdbx and poe.


The IBM pdbx parallel debugger is an extension of the dbx debugger and provides support for mpi applications. There's a good tutorial and guide for pdbx parallel debugging at-

IBM parallel debugging tips

Parallel applications can be run with up to 4 processes on the login host.  The table below contains an example session showing debugging using pdbx to locate a segmentation fault. Commands are shown bold.
pdbx example session [segmentation fault]

Cu12:~/mpi 101% pdbx -procs 4 -labelio yes hello_world_tv
pdbx Version 3, Release 2 -- Jan 10 2005 12:24:15

0:reading symbolic information ...
1:reading symbolic information ...
2:reading symbolic information ...
3:reading symbolic information ...
1:[1] stopped in main at line 21 ($t1)
1: 21 MPI_Init(&argc, &argv);
0:[1] stopped in main at line 21 ($t1)
0: 21 MPI_Init(&argc, &argv);
3:[1] stopped in main at line 21 ($t1)
3: 21 MPI_Init(&argc, &argv);
2:[1] stopped in main at line 21 ($t1)
2: 21 MPI_Init(&argc, &argv);
0031-504 Partition loaded ...

pdbx(all) cont
0:Hello world! I'm 0 of 4 on Cu12
2:Hello world! I'm 2 of 4 on Cu12
1:Hello world! I'm 1 of 4 on Cu12
3:Hello world! I'm 3 of 4 on Cu12
1:
1:Segmentation fault in unnamed block $b1 at line 33 ($t1)
1: 33 *f=3.5;
^C
pdbx-subset(all) halt
0029-2108 The following RUNNING task(s): "0 2 3" have been interrupted.
0029-2109 No action taken on task(s): "1", because they have either been stoppedby the debugger, finished executing, or have been unhooked.
0029-2105 The current context contains at least one RUNNING task. When these
RUNNING task(s) reach a breakpoint or complete execution, a pdbx prompt is displayed.
1:
1:
1:
0:
0:Interrupt in _event_sleep at 0xd0057bec ($t2)
0:0xd0057bec (_event_sleep+0x90) 80410014 lwz r2,0x14(r1)
3:
3:Interrupt in _event_sleep at 0xd0057bec ($t2)
3:0xd0057bec (_event_sleep+0x90) 80410014 lwz r2,0x14(r1)
2:
2:Interrupt in _event_sleep at 0xd0057bec ($t2)
2:0xd0057bec (_event_sleep+0x90) 80410014 lwz r2,0x14(r1)

pdbx(all) where
0:_event_sleep(??, ??, ??, ??, ??) at 0xd0057bec
0:sigwait(??, ??) at 0xd005d2d4
0:pm_async_thread(??) at 0xd4dfcbc0
0:_pthread_body(??) at 0xd004d40c
1:unnamed block $b1, line 33 in "hello_world_tv.c"
1:main(argc = 1, argv = 0x2ff2245c, ... = 0x2ff22464, 0x28, 0x2ff22ff8, 0x0, 0x109fa097, 0x5), line 33 in "hello_world_tv.c"
2:_event_sleep(??, ??, ??, ??, ??) at 0xd0057bec
2:sigwait(??, ??) at 0xd005d2d4
2:pm_async_thread(??) at 0xd4dfcbc0
2:_pthread_body(??) at 0xd004d40c
3:_event_sleep(??, ??, ??, ??, ??) at 0xd0057bec
3:sigwait(??, ??) at 0xd005d2d4
3:pm_async_thread(??) at 0xd4dfcbc0
3:_pthread_body(??) at 0xd004d40c

pdbx(all) on 1

pdbx(1) where
1:unnamed block $b1, line 33 in "hello_world_tv.c"
1:main(argc = 1, argv = 0x2ff2245c, ... = 0x2ff22464, 0x28, 0x2ff22ff8, 0x0, 0x109fa097, 0x5), line 33 in "hello_world_tv.c"

pdbx(1) list 25,40
1: 25 MPI_Get_processor_name(name, &len);
1: 26
1: 27 printf ("Hello world! I'm %d of %d on %s\n", rank, size, name);
1: 28
1: 29 if ( (rank ==1) || (rank == 31) )
1: 30 {
1: 31 double *f;
1: 32 f=0;
1: 33 *f=3.5;
1: 34 }
1: 35 MPI_Finalize();
1: 36 exit(0);
1: 37 }

pdbx(1) print f
1:(nil)

pdbx(1) print *f
1:reference through nil pointer

pdbx(1) quit
Cu12:~/mpi 102%



MPI programs occasionally hang due to logic errors within a program.  Consider the consequences of the following code example-
        29	if ( (rank ==1) || (rank == 31) )
        30 {
        31         char buf[255];
        32       MPI_Request request;
        33       MPI_Status status;
        34       MPI_Recv( (void *)buf, 1, MPI_CHAR, 1, 55, MPI_COMM_WORLD,&status);
        35       MPI_Send( (void *)buf, 1, MPI_CHAR, 1, 55, MPI_COMM_WORLD);
        36 }
That code has a couple errors and the blocking MPI_Recv will cause it to hang for rank 1 or 31.  That's the observed runtime behavior.   The poe option -corefile_format STDERR can be used with signaling to get to the source of this sort of problem.

poe example with -corefile_format STDERR [program hangs]
Cu12:~/mpi128% poe hello_hang -procs 2 -labelio yes -corefile_format STDERR
0:Hello world! I'm 0 of 2 on Cu12
1:Hello world! I'm 1 of 2 on Cu12

In another terminal window, while the process is hanging, find one of the process ids and send it the SEGV signal so that poe's -corefile_format option can show where the program was stuck.
Cu12:~/mpi102% ps auxw | grep hello_hang
arnoldg  3186922  6.2  0.0 2804 2820      - A    11:32:04  0:13 ./hello_hang
arnoldg  1859702  5.8  0.0 2796 2812      - A    11:32:04  0:12 ./hello_hang
arnoldg  1564804  0.0  0.0  220  232 pts/90 A    11:32:17  0:00 grep hello_hang
arnoldg  3432602  0.0  0.0  620  656 pts/19 A    16:24:41  0:00 vi hello_hang.c
Cu12:~/mpi103% kill -SEGV 3186922
Cu12:~/mpi104%
Upon receipt of the SEGV signal, the hung program drops back into poe's -corefile_format STDERR option showing rank 1 stuck on line 34 in main-

0:+++PARALLEL TOOLS CONSORTIUM LIGHTWEIGHT COREFILE FORMAT version 1.0
0:+++LCB 1.0 Fri Oct 21 11:32:30 2005 Generated by IBM AIX 5.1
0:#
0:+++ID Node 0 Process 3186922 Thread 4
0:***FAULT "SIGSEGV - Segmentation violation"
0:+++STACK
0:_event_sleep : 0x00000090
0:sigwait : 0x000002e4
0:sigioThreadRoutine : 0x000000f8
0:_pthread_body : 0x000000e8
0:---STACK
0:---ID Node 0 Process 3186922 Thread 4
0:#
0:+++ID Node 0 Process 3186922 Thread 1
0:+++STACK
0:kickpipes : 0x0000007c
0:mpci_recv : 0x00000d10
0:barrier_shft_b : 0x000004fc
0:_mpi_barrier : 0x000005d4
0:MPI__Finalize : 0x000003d8
0:main : 37 # in file <hello_hang.c>
0:---STACK
0:---ID Node 0 Process 3186922 Thread 1
0:#
0:+++ID Node 0 Process 3186922 Thread 2
0:+++STACK
0:sigwait : 0x000002e4
0:pm_async_thread : 0x0000072c
0:_pthread_body : 0x000000e8
0:---STACK
0:---ID Node 0 Process 3186922 Thread 2
0:#
0:+++ID Node 0 Process 3186922 Thread 3
0:+++STACK
0:nsleep : 0x000000b0
0:usleep : 0x00000048
0:timerThreadRoutine : 0x00000100
0:_pthread_body : 0x000000e8
0:---STACK
0:---ID Node 0 Process 3186922 Thread 3
0:---LCB
ERROR: 0031-250 task 0: Segmentation fault
1:+++PARALLEL TOOLS CONSORTIUM LIGHTWEIGHT COREFILE FORMAT version 1.0
1:+++LCB 1.0 Fri Oct 21 11:32:30 2005 Generated by IBM AIX 5.1
1:#
1:+++ID Node 1 Process 1859702 Thread 2
1:+++STACK
1:pm_atexit : 0x0000036c
1:exit : 0x00000084
1:pm_child_sig_handler : 0x0000046c
1:pm_async_thread : 0x000009b4
1:_pthread_body : 0x000000e8
1:---STACK
1:---ID Node 1 Process 1859702 Thread 2
1:#
1:+++ID Node 1 Process 1859702 Thread 1
1:+++STACK
1:kickpipes : 0x00000c10
1:_mpi_recv : 0x0000015c
1:MPI__Recv : 0x00000630
1:main : 34 # in file <hello_hang.c>
1:---STACK
1:---ID Node 1 Process 1859702 Thread 1
1:#
1:+++ID Node 1 Process 1859702 Thread 3
1:+++STACK
1:nsleep : 0x000000b0
1:usleep : 0x00000048
1:timerThreadRoutine : 0x00000100
1:_pthread_body : 0x000000e8
1:---STACK
1:---ID Node 1 Process 1859702 Thread 3
1:#
1:+++ID Node 1 Process 1859702 Thread 4
1:+++STACK
1:sigwait : 0x000002e4
1:sigioThreadRoutine : 0x000000f8
1:_pthread_body : 0x000000e8
1:---STACK
1:---ID Node 1 Process 1859702 Thread 4
1:#There are more threads but pthread_getthrds_np returns <38>.
ERROR: 0031-250 task 1: Terminated
Cu12:~/mpi129%



References:

IBM Parallel Environment for AIX
Hitchhiker's Guide