NCSA Home
Contact Us Intranet

Electric Fence Malloc Debugger

User Information Home
Data
Security
Allocations
Consulting
Training

NCSA's Help Desk is available 24 hours a day, seven days a week, 365 days a year:
help.ncsa.illinois.edu
217-244-0710
help@ncsa.illinois.edu

Debugging c memory and pointer bugs with Electric Fence.

Introduction.
Tungsten mpi example.
Cobalt serial example.
mercury serial example.
What about the IBMp690 [copper] ?

The Electric Fence malloc debugger is installed on the login hosts of the NCSA tungsten, mercury, and cobalt clusters.  It can be used with serial or mpi programs containing c memory allocation routines [free(), malloc(), ...] and pointers.  Since c memory allocation is a frequent source of program bugs, it's a good idea to try electric fence when programs exhibit sporadic errors.  To get started with Electric Fence, see the efence man page:

[tunb ~/c]$ man efence
efence(3)                                                            efence(3)
 
NAME
       efence - Electric Fence Malloc Debugger
 
SYNOPSIS
       #include <stdlib.h>
...

This example was run on the tungsten cluster.  Electric Fence doesn't work with the myrinet/gm malloc() so the code was compiled with a version of mpich-tcp available via softenv:

[tunb ~/c]$ cat $HOME/.soft
@default
+mpich-tcp-1.2.5.2-intel8
[tunb ~/c]$ cat $HOME/.cshrc
limit coredumpsize 100000
[tunb ~/c]$ cat hello_world.c
#include <stdio.h>
#include <mpi.h>

main(argc, argv)
 
int                     argc;
char                    *argv[];
 
{
        int             rank, size, len;
        char            name[MPI_MAX_PROCESSOR_NAME];
double *badptr;
        MPI_Init(&argc, &argv);
        MPI_Barrier(MPI_COMM_WORLD);
 
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);
        MPI_Comm_size(MPI_COMM_WORLD, &size);
         
        MPI_Get_processor_name(name, &len);
        MPI_Barrier(MPI_COMM_WORLD);
 
system("/bin/uname -a");
 
        printf ("Hello world! I'm %d of %d on %s\n", rank, size, name);
 
        if (rank == 1)
        {
           free(badptr);
        }
 
 
        MPI_Finalize();
        exit(0);
}

Note that the program may typically compile and run without showing an error:

[tunb ~/c]$ mpicc -g -o hello_world hello_world.c
[tunb ~/c]$ cat hosts
tunb
tunb
[tunb ~/c]$ mpirun -np 2 -machinefile hosts hello_world
Linux tunb 2.4.20-31.9smp_perfctr_lustre #2 SMP Thu Jun 24 21:02:14 CDT 2004 i686 i686 i386 GNU/Linux
Linux tunb 2.4.20-31.9smp_perfctr_lustre #2 SMP Thu Jun 24 21:02:14 CDT 2004 i686 i686 i386 GNU/Linux
Hello world! I'm 0 of 2 on tunb.ncsa.uiuc.edu
Hello world! I'm 1 of 2 on tunb.ncsa.uiuc.edu

That is misleading though, since calling free() with an unassigned pointer is an error.  Electric Fence can usually catch such mistakes.  Re-link with -lefence and see what happens [note, linking with -lefence will slow down the resultant program, so drop -lefence in the final link after debugging]:

[tunb ~/c]$ mpicc -g -o hello_world hello_world.c -lefence
[tunb ~/c]$ mpirun -np 2 -machinefile hosts hello_world
 
  Electric Fence 2.2.0 Copyright (C) 1987-1999 Bruce Perens <bruce@perens.com>
 
  Electric Fence 2.2.0 Copyright (C) 1987-1999 Bruce Perens <bruce@perens.com>
Linux tunb 2.4.20-31.9smp_perfctr_lustre #2 SMP Thu Jun 24 21:02:14 CDT 2004 i686 i686 i386 GNU/Linux
Hello world! I'm 0 of 2 on tunb.ncsa.uiuc.edu
Linux tunb 2.4.20-31.9smp_perfctr_lustre #2 SMP Thu Jun 24 21:02:14 CDT 2004 i686 i686 i386 GNU/Linux
                                                                               
ElectricFence Aborting: free(40016148): address not from malloc().
Illegal instruction (core dumped)
Killed by signal 2.

The gdb debugger can throw a little more light upon that error message from Electric Fence:

[tunb ~/c]$ gdb hello_world core.9073
GNU gdb Red Hat Linux (5.3post-0.20021129.18rh)
Copyright 2003 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux-gnu"...
Core was generated by `/u/ncsa/arnoldg/c/hello_world tunb 39600   4amslave -p4yourname tunb -p4rmrank'.
Program terminated with signal 4, Illegal instruction.
Reading symbols from /usr/lib/libefence.so.0...done.
Loaded symbols for /usr/lib/libefence.so.0
Reading symbols from /lib/tls/libc.so.6...done.
Loaded symbols for /lib/tls/libc.so.6
Reading symbols from /usr/local/pgi/linux86/5.2/lib/libpgc.so...done.
Loaded symbols for /usr/local/pgi/linux86/5.2/lib/libpgc.so
Reading symbols from /lib/tls/libm.so.6...done.
Loaded symbols for /lib/tls/libm.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
#0  0xffffe002 in ?? ()
(gdb) where
#0  0xffffe002 in ?? ()
#1  0x4002d346 in EF_Abort () from /usr/lib/libefence.so.0
#2  0x4002cc1e in free () from /usr/lib/libefence.so.0
#3  0x08049ffa in main (argc=1, argv=0x40112f84) at hello_world.c:40
#4  0x420156a4 in __libc_start_main () from /lib/tls/libc.so.6
(gdb) l
40                 free(badptr);
41              }
42
43
44              MPI_Finalize();
45              exit(0);
46      }
(gdb) q
[tunb ~/c]$


This is a simple serial program with a bug diagnosed using Electric Fence on cobalt.

[co-login1 ~/c]$ cat efence_test.c
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
 
int main(void)
{
        char *myptr;
 
        printf("begin\n");

        myptr= malloc(1);
 
        strcpy(myptr, "end                                                done");
        printf("%s\n", myptr);
}

Once again, the program appears to compile and run without incident.

[co-login1 ~/c]$ icc -g -o efence_test efence_test.c
[co-login1 ~/c]$ ./efence_test
begin
end                                                done

Calling the program with the ef utility, wraps the library selection so that the Electric Fence library is consulted 1st for c allocation routines.  This has the same effect as linking with -lefence without requiring any change to the program.  Gdb may be used to get more information about the bug.

[co-login1 ~/c]$ limit coredumpsize 100000
[co-login1 ~/c]$ rm -f core*
[co-login1 ~/c]$ ef ./efence_test
 
  Electric Fence 2.2.0 Copyright (C) 1987-1999 Bruce Perens <bruce@perens.com>
begin
/usr/bin/ef: line 20: 32183 Segmentation fault      (core dumped) ( export LD_PRELOAD=libefence.so.0.0; exec $* )
[co-login1 ~/c]$ gdb efence_test core*
GNU gdb Red Hat Linux (5.2-2)
Copyright 2002 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "ia64-redhat-linux"...
Core was generated by `'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /usr/lib/libefence.so.0.0...done.
Loaded symbols for /usr/lib/libefence.so.0.0
Reading symbols from /lib/libm.so.6.1...done.
Loaded symbols for /lib/libm.so.6.1
Reading symbols from /usr/local/intel/8.0.069/lib/libcprts.so.6...done.
Loaded symbols for /usr/local/intel/8.0.069/lib/libcprts.so.6
Reading symbols from /usr/local/intel/8.0.069/lib/libcxa.so.6...done.
Loaded symbols for /usr/local/intel/8.0.069/lib/libcxa.so.6
Reading symbols from /usr/local/intel/8.0.069/lib/libunwind.so.6...done.
Loaded symbols for /usr/local/intel/8.0.069/lib/libunwind.so.6
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libc.so.6.1...done.
Loaded symbols for /lib/libc.so.6.1
Reading symbols from /lib/libpthread.so.0...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/ld-linux-ia64.so.2...done.
Loaded symbols for /lib/ld-linux-ia64.so.2
#0  0x20000000004056e0 in strcpy () at soinit.c:56
56      soinit.c: No such file or directory.
        in soinit.c
(gdb) where
#0  0x20000000004056e0 in strcpy () at soinit.c:56
#1  0x4000000000001270 in main () at efence_test.c:16
(gdb) list
51      in soinit.c
(gdb) list efence_test.c:16
11       */
12
13              myptr= malloc(80 * sizeof(char) );
14              myptr= malloc(1);
15
16              strcpy(myptr, "end                                                done");
17              printf("%s\n", myptr);
18      }
(gdb) q
[co-login1 ~/c]$


This is the serial program example again showing electric fence being used by linking it with the program and running the program with gdb on mercury.

arnoldg/c> cat efence_test.c
#include 
#include 
#include 

int main(void)
{
        char *myptr;

        printf("begin\n");

        myptr= malloc(1);

        strcpy(myptr, "end                                                done");
        printf("%s\n", myptr);
}
arnoldg/c> icc -g -o efence_test efence_test.c -lefence 
arnoldg/c> gdb ./efence_test
GNU gdb 5.3
Copyright 2002 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "ia64-suse-linux"...
(gdb) run
Starting program: /home/ncsa/arnoldg/c/efence_test 

  Electric Fence 2.2.0 Copyright (C) 1987-1999 Bruce Perens 
begin

Program received signal SIGSEGV, Segmentation fault.
0x2000000000380200 in strcpy () from /lib/libc.so.6.1
(gdb) where
#0  0x2000000000380200 in strcpy () from /lib/libc.so.6.1
#1  0x4000000000001230 in main () at efence_test.c:13
(gdb) list
13              strcpy(myptr, "end                                                done");
14              printf("%s\n", myptr);
15      }
(gdb)  

For the IBM p690 cluster, there's a nice compile time option [ -qheapdebug ] that can help with memory allocation bugs.  It's similar to Electric Fence.  Here's the first example again showing the use of -qheapdebug:

Cu12:~/c115% cat ef_mpi.c
 
#include <stdio.h>
#include <mpi.h>
 
main(argc, argv)
 
int                     argc;
char                    *argv[];
 
{
        int             rank, size, len;
        char            name[MPI_MAX_PROCESSOR_NAME];
char *badptr;
        MPI_Init(&argc, &argv);
MPI_Barrier(MPI_COMM_WORLD);
 
 
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);
        MPI_Comm_size(MPI_COMM_WORLD, &size);
         
        MPI_Get_processor_name(name, &len);
MPI_Barrier(MPI_COMM_WORLD);
 
system("/bin/uname -a");
 
        printf ("Hello world! I'm %d of %d on %s\n", rank, size, name);
 
if (rank == 1)
{
        badptr=(char *) malloc(1 * sizeof(char));
        strcat(badptr,"this is a test string                            .");
}
 
 
        MPI_Finalize();
        exit(0);
}
Cu12:~/c116% mpcc_r -qheapdebug -g -o ef_mpi ef_mpi.c

Note that you can run mpi poe applications with or without the poe command because the poe libraries are compiled in to the program.

Cu12:~/c117% ./ef_mpi -procs 2
AIX Cu12 1 5 0024BB0A4C00
Hello world! I'm 1 of 2 on Cu12
1546-504 Internal storage object was overwritten at 0x2458E5AB.
AIX Cu12 1 5 0024BB0A4C00
1546-522 Traceback:
           20112070 = _debug_strcat + 0x94
           201115f4 = strcat + 0x20
           100004a4 = main + 0xE0
Hello world! I'm 0 of 2 on Cu12
ERROR: 0031-250  task 1: IOT/Abort trap
ERROR: 0031-250  task 0: Terminated
 
The runtime error already provides a good clue with the strcat information in the traceback.  The pdbx debugger can further illuminate the problem:

Cu12:~/c123% pdbx ef_mpi -procs 2
pdbx Version 3, Release 2 -- May  5 2004 14:06:57
 
reading symbolic information ...
reading symbolic information ...
[1] stopped in main at line 24 ($t1)
   24           MPI_Init(&argc, &argv);
[1] stopped in main at line 24 ($t1)
   24           MPI_Init(&argc, &argv);
0031-504  Partition loaded ...
 
pdbx(all) cont
AIX Cu12 1 5 0024BB0A4C00
AIX Cu12 1 5 0024BB0A4C00
Hello world! I'm 1 of 2 on Cu12
Hello world! I'm 0 of 2 on Cu12
1546-504 Internal storage object was overwritten at 0x2458E5AB.
1546-522 Traceback:
           20112070 = _debug_strcat + 0x94
           201115f4 = strcat + 0x20
           100004a4 = main + 0xE0
 
IOT/Abort trap in pthread_kill at 0xd005cb14 ($t1)
0xd005cb14 (pthread_kill+0xa8) 80410014        lwz   r2,0x14(r1)
^C
pdbx-subset(all) tasks
  0:R    1:D
 
 
pdbx-subset(all) on 1
 
pdbx(1) where
pthread_kill(??, ??) at 0xd005cb14
_p_raise(??) at 0xd005c120
_tm_msg_print(??, ??, ??, ??, ??) at 0x20193380
_mem_error(??, ??, ??, ??, ??) at 0x2019ac98
_test_dbg_allocated(??) at 0x20193494
_int_uheap_verify(??, ??, ??, ??) at 0x20193b98
_chk_if_heap(??, ??, ??, ??) at 0x20198440
_debug_strcat(??, ??, ??, ??) at 0x20112070
@cbase.strcat(??, ??) at 0x201115f4
main(argc = 1, argv = 0x2ff224e4, ... = 0x2ff224ec, 0x28, 0x2ff22ff8, 0x0, 0x10348007, 0x5), line 41 in "ef_mpi.c"
 
pdbx(1) l
   25   MPI_Barrier(MPI_COMM_WORLD);
   26
   27
   28           MPI_Comm_rank(MPI_COMM_WORLD, &rank);
   29           MPI_Comm_size(MPI_COMM_WORLD, &size);
   30
   31           MPI_Get_processor_name(name, &len);
   32   MPI_Barrier(MPI_COMM_WORLD);
   33
   34   system("/bin/uname -a");
 
pdbx(1) l
   35
   36           printf ("Hello world! I'm %d of %d on %s\n", rank, size, name);
   37
   38   if (rank == 1)
   39   {
   40           badptr=(char *) malloc(1 * sizeof(char));
   41           strcat(badptr,"this is a test string .");
   42   }
   43
   44
 
pdbx(1) q
Cu12:~/c108%