NCSA Home
Contact Us | Intranet | Search

batch in-job core file debugging

batch in-job core file debugging If a program fails in batch mode, it may help to try to debug it in batch if a core file is created at the time of failure.  Some bugs are intermittent and may be difficult to reproduce outside of the batch environment.  This debug example shows batch file syntax with comments describing how a batch file can use the debuggers when core files are generated.

PBS batch job
LSF batch job

Mercury PBS batch example
#!/bin/csh
#PBS -lnodes=1:ppn=2,walltime=00:15:00 -qdebug
#PBS -N bugc

cd /home/ncsa/arnoldg/debug
# calculate NP from PBS_NODEFILE for use with mpirun
setenv NPROCS `wc -l < $PBS_NODEFILE`
setenv PROGRAM_NAME bugc

# wrapping the program in this manner ensures each mpi rank is run without core file limits
# Note, this could consume much filesystem space for a program running at a large scale if each
# mpi rank hit a bug and created a core file
(echo "ulimit -c unlimited"; echo ./${PROGRAM_NAME} ) > ${PROGRAM_NAME}_wrapper.sh
chmod +x  ${PROGRAM_NAME}_wrapper.sh

# run the program in a sub-shell in the background so that the sleep and core
# file check below can proceed without waiting for the program to finish
(mpirun -np ${NPROCS} -machinefile ${PBS_NODEFILE} ${PROGRAM_NAME}_wrapper.sh ) &

# set the sleep here to be long enough to encounter the bug [300 seconds in this example]
sleep 300

if ( -f core* ) then

  echo "=== gdb ==="
  gdb ${PROGRAM_NAME} core* << EOF
     where
     quit
EOF
#^^ the EOF marker needs to be at the beginning of a line
echo " "

  echo "=== idb ==="
  idb -gdb ${PROGRAM_NAME} core* << EOF
     where
     quit
EOF

  echo "=== totalviewcli ==="
  soft add +totalview
  totalviewcli ${PROGRAM_NAME} core* << EOF
     dwhere
     quit
     yes
EOF

endif

# cleanup stray mpi processes
killall mpirun
killall perl


Sample output from the batch script pointing to line 37 in bugc.c -

----------------------------------------
Begin PBS Prologue Wed Mar 29 08:24:39 CST 2006
Job ID: 586778.tg-master.ncsa.teragrid.org
Username: arnoldg
Group: afw
Nodes: tg-c280
End PBS Prologue Wed Mar 29 08:24:43 CST 2006
----------------------------------------
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
[1] 21932
Hello world! I'm 1 of 2 on tg-c280
Hello world! I'm 0 of 2 on tg-c280
=== gdb ===
Core was generated by `./bugc'.
Program terminated with signal 11, Segmentation fault.
#0 0x4000000000004bf1 in abug (rank=1) at bugc.c:56
56 a[1000]=4.0;
#0 0x4000000000004bf1 in abug (rank=1) at bugc.c:56
#1 0x4000000000004910 in main (argc=1, argv=0x60000fffffffabc8) at bugc.c:37
=== idb ===
Linux Application Debugger for Itanium(R)-based applications, Version 7.3.2, Build 20031209
------------------
object file name: bugc
core file name: core
Reading symbols from bugc...done.
Core file produced from executable bugc
Initial part of arglist: ./bugc
Thread terminated at PC 0x4000000000004bf1 by signal SEGV
#0 0x4000000000004bf1 in abug (rank=1) at bugc.c:56
#1 0x4000000000004910 in main (argc=1, argv=(char * *) 0x60000fffffffabc8) at bugc.c:37
#2 0x2000000000318060 in __libc_start_main () in /lib/libc.so.6.1
#3 0x4000000000004040 in _start () in bugc
=== totalviewcli ===
Thread 1.1 has appeared
Thread 1.1 received a signal (Segmentation violation)
d1.<> > 0 abug PC=0x4000000000004bf1, FP=0x60000fffffffa8c0 [bugc.c#56]
1 main PC=0x4000000000004902, FP=0x60000fffffffa920 [bugc.c#37]
2 __libc_start_main PC=0x2000000000318052, FP=0x60000fffffffabb0 [/lib/libc.so.6.1]
3 _start PC=0x4000000000004032, FP=0x60000fffffffabb0 [bugc]
d1.<> Do you really wish to exit TotalView? Detached from Process 1
----------------------------------------
Begin PBS Epilogue Wed Mar 29 08:26:57 CST 2006
Job ID: 586778.tg-master.ncsa.teragrid.org
Username: arnoldg
Group: afw
Job Name: bugc
Session: 21330
Limits: ncpus=1,nodes=1:ppn=2,walltime=00:15:00
Resources: cput=00:00:02,mem=23808kb,vmem=40000kb,walltime=00:02:11
Queue: debug
Account: afw
Nodes: tg-c280

Killing leftovers...

End PBS Epilogue Wed Mar 29 08:27:02 CST 2006
----------------------------------------



Tungsten LSF batch example with gdb
#!/bin/csh

limit coredumpsize unlimited

# calculate NP from PBS_NODEFILE for use with mpirun
cd $SCR
setenv PROGRAM_NAME bugc

cp ~/debug/${PROGRAM_NAME} .

cmpirun -enable_corefiles -lsf ${PROGRAM_NAME}

if ( -f core.* ) then
  ( echo where ;  echo quit ) > gdbcommands
  gdb --command gdbcommands --batch ${PROGRAM_NAME} core.*
endif