NCSA Home
Contact Us | Intranet | Search

IBM pSeries 690 Running Jobs FAQ

 

Interactive use (cu12):

Batch system:

Generic execution questions/problems:


Interactive use:

My program was killed on cu12 and this message was displayed: Cputime limit exceeded.

    The current cpu time limit is 30 minutes. Your program will be killed for using more time than that [threaded programs will consume cpu time at an accelerated rate, a 4-processor OpenMP program may be killed in less than 8 minutes elapsed time]. Cu12 is for compiling and short tests of your programs. For production runs, use the batch environment [see Running Jobs].

I see Segmentation fault (core dumped) on a program I know to be correct and debugged.

    It's possible that the stacksize may not be large enough. If your program is 32-bit (the default), try relinking with -bmaxstack:0x80000000 to increase the maximum stacksize to 2 Gbytes. If that does not work, try compiling with -q64 to enable 64-bit addressing.

    If your program is 64-bit and you are trying to run interactively, the program may need more memory than is available: try running in a batch job. If you are already running within a batch job, try setting a higher value for ConsumableMemory.

Program exits with: Error encountered while attempting to allocate a data object.

    Datasize may not be large enough. If your program is 32-bit (the default), try relinking with -bmaxdata:0x80000000 to increase the maximum datasize to 2 Gbytes. If that does not work, try compiling with -q64 to enable 64-bit addressing.

    If your program is 64-bit and you are trying to run interactively, the program may need more memory than is available: try running in a batch job. If you are already running within a batch job, try setting a higher value for ConsumableMemory.

My MPI program will not run on cu12 and this message is printed: Fewer nodes (4) specified in /etc/host.list than tasks ...

    The current interactive mpi process limit is 4. Your program will not start if you request more processors. For production runs, use the batch environment [see Running Jobs].

My program fails on cu12 with a message like: Not enough space/memory

    The current interactive memory limit is 1 gigabyte. Your program will fail if it tries to use more. For production runs, use the batch environment [see Running Jobs].
Back to top

Batch system:

I have a job in the queue and it's not running yet--why?

    At NCSA, we run the same job scheduling package with all of our batch systems. The scheduling of jobs is based on first-in, first-out [FIFO] but with some important modifications to make sure that all jobs and users get a fair chance to run. Some of the various scenarios you may observe are explained here.

    Q: Why are many machines or resources currently idle?

    A: In the time period before a large cpu-count job starts, it always looks this way. A large cpu-count job will probably be starting very soon and the idle resources will be used once again.

    Q: User X just submitted a job like mine after my job and their job has started, why?

    A: Our scheduler "remembers" how many jobs have been run by each user for a while. If User X has run few jobs recently, and you've run more, then User X will get more priority so that they can catch up and everyone gets a fair-share of the system.

    The scheduler will also backfill smaller and shorter jobs to utilize any idle resources.

    Q: How can I get a test, debug, or any of my jobs to start sooner--I really need to try something out now!

    A: The debug queue is available for small cpu jobs. In addition, because the scheduler has to let resources sit idle before starting any large [many cpu] job, you can try to take advantage of that situation by submitting a job that requests short walltime.

    Q: I don't understand why my job hasn't started. How can anyone get work done when the queues are so busy?

    A: See the output of the qs command. It shows the queued jobs along with the time they've been waiting in the queue. You can compare your jobs to jobs that have waited a similar time and determine that other users' experience is in common with yours.

How can I save output files from batch jobs on the p690?

    The saveafterjob utility is available for automated, guaranteed saving of output files from batch jobs to the mass storage system. See the documentation and the sample batch scripts for information on its use.

Can I override LoadLeveler cmdfile options on the command line?

    No. You'll need to create a separate loadleveler script with the options you prefer.

I can't use ${jobid} in non-#@ lines in my batch script (Illegal variable name.)
OR
Are there special runtime Load Leveler environment variables I can use inside the batch script?

    For a full list of Load Leveler runtime environment variables, please see the following Load Leveler documentation
     
    Here's a couple of the more common ones:
    VariableExample value
    LOADL_STEP_IDcu12.ncsa.uiuc.edu.2960.0
    LOADL_JOB_NAMEmy_mpi_job
    LOADL_STEP_OUTmy_mpi_job.2960.out
    LOADL_STEP_ERRmy_mpi_job.2960.err
    LOADL_STEP_INITDIR/u/ac/johndoe/submitdir

    To get just the job id (i.e. 2960), put the following line in your batch script:
     setenv jobid  `echo $LOADL_STEP_ID | cut -d'.' -f5`
    

I can't use $LOGIN or any other environment variable in LoadLeveler directives in my script (i.e. in the naming of the output and error files).

    LoadLeveler does not allow environment variables in directives (lines starting with #@). Variables you can use are listed in the LoadLeveler documentation
Back to top

My job output/error file is too large for my home directory. What can I do?

    Your code may be writing a lot of information to standard output or standard error. Try redirecting the output of your executable to another file.
     
    Examples:
    • Redirect only the standard output to a file:
        a.out > stdout
      
    • Redirect both the standard output and standard error to a single file:
        a.out >& stdout
      
    • Redirect both the standard output and standard error but to separate files:
        (a.out > stdout) >& stderr
      
    You can also use job variables in the filenames.
     
    In all cases, do not forget to save the output files before the job finishes.
      msscmd "cd test2, mput std*"
    
    When the job is done, the standard output/error from your executable will be on UniTree. You'll need to retrieve them first in order to look at them. Standard output and standard error from the rest of the script will still be in job output files in your home space like before.
     
    When reporting problems with a job that redirects output, please include any appropriate pieces of the redirected output file(s).
Back to top

Why can't I use the -procs option on poe to run on different numbers of processors in a LoadLeveler job?

    When poe is used in a LoadLeveler job, the following POE environment variables, and associated command line options, are validated, but not used, for batch jobs submitted using LoadLeveler.
    • MP_PROCS
    • MP_RMPOOL
    • MP_EUIDEVICE
    • MP_EUILIB
    • MP_MSG_API (except for programs that use LAPI and also use the LoadLeveler requirements keyword to specify Adapter="hps_user")
    • MP_HOSTFILE
    • MP_SAVEHOSTFILE
    • MP_RESD
    • MP_RETRY
    • MP_RETRYCOUNT
    • MP_ADAPTER_USE
    • MP_CPU_USE
    • MP_NODES
    • MP_TASKS_PER_NODE
    poe uses the LoadLeveler directive "tasks_per_node" to determine how many MPI processes to run. To run the same program on different numbers of processors, you need to submit multiple jobs or use job steps.

My program fails in the batch environment with a message like: System error: There is not enough memory available now. or 1525-108 Error encountered while attempting to allocate a data object.

    If compiling with -q32 (default) or with OBJECT_MODE=32 (default) then try to relink with the following switches:
    -bmaxstack:0x80000000 -bmaxdata:0x80000000
    
    If this doesn't help, try compiling with -q64. If compiling with -q64, you should not use -bmaxdata and -bmaxstack.

    NOTE: If -bmaxstack and -bmaxdata are used during compilation, the requested limits are compiled in to the program. If the runtime limits are not greater than the values used at compile time [as set with ConsumableMemory() in the LoadLeveler script], then the program will continue to produce the "not enough memory" error.

Why doesn't the llsummary command give information about my batch job?

    NCSA moves the LoadLeveler accounting files from the default location to a new location every night, so llsummary requires an argument specifying the location of the accounting file. For example:
       llsummary -l -j cu12.45986 /u/loadl/spool/history /u/loadl/spool/globalhist.*
    
    Alternatively, you can use the llhist command to get information about your batch job. For example:
       llhist 45986
    
    See the llhist man page for details.

I get an AFS error in the email from my batch job. What does it mean?

    The AFS errors in your email message:
     LoadL_starter: 2512-906 Cannot set user credentials.
     LoadL_starter: 2539-762 Failed to set AFS credentials. errno=2 [A file or directory in the path name does not exist.]
    
    means that the AFS tokens you had when you submitted your job expired before your job started. This error can be safely ignored because AFS filesystems are not used for running jobs at NCSA.

I have a job in the queue and it's not running yet--why?

Back to top

Generic execution questions/problems:

My program takes a long time to start and then this message is printed: Paging space low

    Your program is trying to use more memory than is available on the system, or the system temporarily ran out of free memory. Verify your memory use is what you expect with "llq -w" in the batch environment or "ps u" on cu12.

I see Segmentation fault (core dumped) on a program I know to be correct and debugged.

    It's possible that the stacksize may not be large enough. Try 'limit stacksize kkkkk' where kkkkk is the number of kilobytes that you want to set the stacksize to. Try replacing kkkkk with unlimited if all else fails. Also try to relink with -bmaxstack:0x80000000.

Program exits with: Error encountered while attempting to allocate a data object.

    Datasize may not be large enough. Try 'limit datasize kkkkk' where kkkkk is the number of kilobytes that you want to set the datasize to. Try replacing kkkkk with unlimited if all else fails. Also try to relink with -bmaxdata:0x80000000.

How do I find out how much memory my program used?

    You can use the llhist command.
     
    Or you can use llsummary to see the maximum resident set size [maxrss] used by your program and all of its processes and threads. The maxrss field is shown in 1k units, so a command like this will yield the maximum megabytes or gigabytes used:
    Cu12:~/loadleveler315% expr `llsummary -l -j cu12.4123 | grep 'Step maxrss'|cut -d':' -f2` / 1024
    229051  # megabytes
    Cu12:~/loadleveler316% expr `llsummary -l -j cu12.4123 | grep 'Step maxrss'|cut -d':' -f2` / 1048576
    223     # gigabytes
    
    For programs running on the interactive machine, try "ps u" and look at the RSS field [like llsummary, the units are 1k pages]:
    Cu12:~/loadleveler329% ps u
    USER         PID %CPU %MEM   SZ  RSS    TTY STAT    STIME  TIME COMMAND
    arnoldg  3686468  4.2  7.0 887704 887732 pts/17 A    11:45:17  0:02 ./malloctest
    Cu12:~/loadleveler330% expr 887732 / 1024
    866  # megabytes
    
Back to top