| |
|
|
|
|
batch in-job core file debugging
|
batch in-job core file debugging
If a program fails in batch mode, it may help to try to debug it in
batch if a core file is created at the time of failure. Some bugs
are intermittent and may be difficult to reproduce outside of the batch
environment. This debug example shows batch file syntax with
comments describing how a batch file can use the debuggers when core
files are generated.
PBS batch job LSF batch job
Mercury PBS
batch example
#!/bin/csh
#PBS -lnodes=1:ppn=2,walltime=00:15:00 -qdebug
#PBS -N bugc
cd /home/ncsa/arnoldg/debug
# calculate NP from PBS_NODEFILE for use with mpirun
setenv NPROCS `wc -l < $PBS_NODEFILE`
setenv PROGRAM_NAME bugc
# wrapping the program in this manner ensures each mpi rank is run without core file limits
# Note, this could consume much filesystem space for a program running at a large scale if each
# mpi rank hit a bug and created a core file
(echo "ulimit -c unlimited"; echo ./${PROGRAM_NAME} ) > ${PROGRAM_NAME}_wrapper.sh
chmod +x ${PROGRAM_NAME}_wrapper.sh
# run the program in a sub-shell in the background so that the sleep and core
# file check below can proceed without waiting for the program to finish
(mpirun -np ${NPROCS} -machinefile ${PBS_NODEFILE} ${PROGRAM_NAME}_wrapper.sh ) &
# set the sleep here to be long enough to encounter the bug [300 seconds in this example]
sleep 300
if ( -f core* ) then
echo "=== gdb ==="
gdb ${PROGRAM_NAME} core* << EOF
where
quit
EOF
#^^ the EOF marker needs to be at the beginning of a line
echo " "
echo "=== idb ==="
idb -gdb ${PROGRAM_NAME} core* << EOF
where
quit
EOF
echo "=== totalviewcli ==="
soft add +totalview
totalviewcli ${PROGRAM_NAME} core* << EOF
dwhere
quit
yes
EOF
endif
# cleanup stray mpi processes
killall mpirun
killall perl
Sample output from the batch script pointing to line 37 in bugc.c -
---------------------------------------- Begin PBS Prologue Wed Mar 29 08:24:39 CST 2006 Job ID: 586778.tg-master.ncsa.teragrid.org Username: arnoldg Group: afw Nodes: tg-c280 End PBS Prologue Wed Mar 29 08:24:43 CST 2006 ---------------------------------------- Warning: no access to tty (Bad file descriptor). Thus no job control in this shell. [1] 21932 Hello world! I'm 1 of 2 on tg-c280 Hello world! I'm 0 of 2 on tg-c280 === gdb === Core was generated by `./bugc'. Program terminated with signal 11, Segmentation fault. #0 0x4000000000004bf1 in abug (rank=1) at bugc.c:56 56 a[1000]=4.0; #0 0x4000000000004bf1 in abug (rank=1) at bugc.c:56 #1 0x4000000000004910 in main (argc=1, argv=0x60000fffffffabc8) at bugc.c:37 === idb === Linux Application Debugger for Itanium(R)-based applications, Version 7.3.2, Build 20031209 ------------------ object file name: bugc core file name: core Reading symbols from bugc...done. Core file produced from executable bugc Initial part of arglist: ./bugc Thread terminated at PC 0x4000000000004bf1 by signal SEGV #0 0x4000000000004bf1 in abug (rank=1) at bugc.c:56 #1 0x4000000000004910 in main (argc=1, argv=(char * *) 0x60000fffffffabc8) at bugc.c:37 #2 0x2000000000318060 in __libc_start_main () in /lib/libc.so.6.1 #3 0x4000000000004040 in _start () in bugc === totalviewcli === Thread 1.1 has appeared Thread 1.1 received a signal (Segmentation violation) d1.<> > 0 abug PC=0x4000000000004bf1, FP=0x60000fffffffa8c0 [bugc.c#56] 1 main PC=0x4000000000004902, FP=0x60000fffffffa920 [bugc.c#37] 2 __libc_start_main PC=0x2000000000318052, FP=0x60000fffffffabb0 [/lib/libc.so.6.1] 3 _start PC=0x4000000000004032, FP=0x60000fffffffabb0 [bugc] d1.<> Do you really wish to exit TotalView? Detached from Process 1 ---------------------------------------- Begin PBS Epilogue Wed Mar 29 08:26:57 CST 2006 Job ID: 586778.tg-master.ncsa.teragrid.org Username: arnoldg Group: afw Job Name: bugc Session: 21330 Limits: ncpus=1,nodes=1:ppn=2,walltime=00:15:00 Resources: cput=00:00:02,mem=23808kb,vmem=40000kb,walltime=00:02:11 Queue: debug Account: afw Nodes: tg-c280
Killing leftovers...
End PBS Epilogue Wed Mar 29 08:27:02 CST 2006 ----------------------------------------
|
Tungsten LSF
batch example with gdb
#!/bin/csh
limit coredumpsize unlimited
# calculate NP from PBS_NODEFILE for use with mpirun
cd $SCR
setenv PROGRAM_NAME bugc
cp ~/debug/${PROGRAM_NAME} .
cmpirun -enable_corefiles -lsf ${PROGRAM_NAME}
if ( -f core.* ) then
( echo where ; echo quit ) > gdbcommands
gdb --command gdbcommands --batch ${PROGRAM_NAME} core.*
endif
|
|
|
|
|
|