Interactive use (cu12):
Batch system:
Generic execution questions/problems:
Interactive use:
The current cpu time limit is 30 minutes. Your program will be killed for using more time than that [threaded programs will consume cpu time at an accelerated rate, a 4-processor OpenMP program may be killed in less than 8 minutes elapsed time]. Cu12 is for compiling and short tests of your programs. For production runs, use the batch environment [see Running Jobs].
It's possible that the stacksize may not be large enough.
If your program is 32-bit (the default),
try relinking with
-bmaxstack:0x80000000 to increase the maximum stacksize to
2 Gbytes. If that does not work, try compiling with -q64
to enable 64-bit addressing.
If your program is 64-bit and you are trying to run interactively,
the program may need more memory than is available: try running
in a batch job. If you are already running within a batch job,
try setting a higher value for
ConsumableMemory.
Datasize may not be large enough.
If your program is 32-bit (the default),
try relinking with
-bmaxdata:0x80000000 to increase the maximum datasize to
2 Gbytes. If that does not work, try compiling with -q64
to enable 64-bit addressing.
If your program is 64-bit and you are trying to run interactively,
the program may need more memory than is available: try running
in a batch job. If you are already running within a batch job,
try setting a higher value for
ConsumableMemory.
The current interactive mpi process limit is 4. Your program will not start if you request more processors. For production runs, use the batch environment [see Running Jobs].
The current interactive memory limit is 1 gigabyte. Your program will fail if it tries to use more. For production runs, use the batch environment [see Running Jobs].
Back to top
Batch system:
At NCSA, we run the same job scheduling package with all of our batch systems.
The scheduling of jobs is based on first-in, first-out [FIFO] but with some
important modifications to make sure that all jobs and users get a fair chance
to run. Some of the various scenarios you may observe are explained here.
Q: Why are many machines or resources currently idle?
A: In the time period before a large cpu-count job starts, it always looks
this way. A large cpu-count job will probably be starting very soon and
the idle resources will be used once again.
Q: User X just submitted a job like mine after my job and their job has
started, why?
A: Our scheduler "remembers" how many jobs have been run by each user for
a while. If User X has run few jobs recently, and you've run more, then
User X will get more priority so that they can catch up and everyone gets
a fair-share of the system.
The scheduler will also backfill smaller and shorter jobs to utilize any idle resources.
Q: How can I get a test, debug, or any of my jobs to start sooner--I really
need to try something out now!
A: The debug queue is available for small cpu jobs. In addition, because the scheduler has to let resources sit idle before starting any
large [many cpu] job, you can try to take advantage of that situation by
submitting a job that requests short walltime.
Q: I don't understand why my job hasn't started. How can anyone get work
done when the queues are so busy?
A: See the output of the qs command. It shows the queued jobs along with
the time they've been waiting in the queue. You can compare your jobs to
jobs that have waited a similar time and determine that other users'
experience is in common with yours.
The saveafterjob utility is available for automated, guaranteed
saving of output files from batch jobs to the mass storage system. See
the documentation and the sample batch scripts for information
on its use.
No. You'll need to create a separate loadleveler script with the options you prefer.
For a full list of Load Leveler runtime environment variables, please see
the following
Load Leveler documentation
Here's a couple of the more common ones:
| Variable | Example value |
| LOADL_STEP_ID | cu12.ncsa.uiuc.edu.2960.0 |
| LOADL_JOB_NAME | my_mpi_job |
| LOADL_STEP_OUT | my_mpi_job.2960.out |
| LOADL_STEP_ERR | my_mpi_job.2960.err |
| LOADL_STEP_INITDIR | /u/ac/johndoe/submitdir |
To get just the job id (i.e. 2960), put the following line in your
batch script:
setenv jobid `echo $LOADL_STEP_ID | cut -d'.' -f5`
LoadLeveler does not allow environment variables in directives (lines starting with #@). Variables you can use are listed in the
LoadLeveler documentation
Back to top
Your code may be writing a lot of information to standard output or
standard error. Try redirecting the output of your executable to another
file.
Examples:
You can also use job variables in the filenames.
In all cases, do not forget to save the output files before the job finishes.
msscmd "cd test2, mput std*"
When the job is done, the standard output/error from your executable will be
on UniTree. You'll need to retrieve them first in order to look at them.
Standard output and standard error from the rest of the script will still
be in job output files in your home space like before.
When reporting problems with a job that redirects output, please include
any appropriate pieces of the redirected output file(s).
Back to top
When poe is used in a LoadLeveler job, the following POE environment variables, and associated command line options, are validated, but not used, for batch jobs submitted using LoadLeveler.
- MP_PROCS
- MP_RMPOOL
- MP_EUIDEVICE
- MP_EUILIB
- MP_MSG_API (except for programs that use LAPI and also use the LoadLeveler requirements keyword to specify Adapter="hps_user")
- MP_HOSTFILE
- MP_SAVEHOSTFILE
- MP_RESD
- MP_RETRY
- MP_RETRYCOUNT
- MP_ADAPTER_USE
- MP_CPU_USE
- MP_NODES
- MP_TASKS_PER_NODE
poe uses the LoadLeveler directive "tasks_per_node" to determine how many MPI processes to run. To run the same program on different numbers of processors, you need to submit multiple jobs or use job steps.
If compiling with -q32 (default) or with OBJECT_MODE=32 (default) then try
to relink with the following switches:
-bmaxstack:0x80000000 -bmaxdata:0x80000000
If this doesn't help, try compiling with -q64.
If compiling with -q64, you should not use -bmaxdata and -bmaxstack.
NOTE: If -bmaxstack and -bmaxdata are used during compilation, the requested limits are
compiled in to the program. If the runtime limits are not greater than
the values used at compile time [as set with ConsumableMemory() in the LoadLeveler script], then the program will continue to produce the "not enough memory" error.
NCSA moves the LoadLeveler accounting files from the default location to
a new location every night, so llsummary requires an argument specifying
the location of the accounting file. For example:
llsummary -l -j cu12.45986 /u/loadl/spool/history /u/loadl/spool/globalhist.*
Alternatively, you can use the llhist command to get information about
your batch job. For example:
llhist 45986
See the llhist man page for details.
The AFS errors in your email message:
LoadL_starter: 2512-906 Cannot set user credentials.
LoadL_starter: 2539-762 Failed to set AFS credentials. errno=2 [A file or directory in the path name does not exist.]
means that the AFS tokens you had when you submitted your job expired before
your job started. This error can be safely ignored because
AFS filesystems are not used for running jobs at NCSA.
I have a job in the queue and it's not running yet--why?
Back to top
Generic execution questions/problems:
Your program is trying to use more memory than is available on the system, or the system temporarily ran out of free memory. Verify your memory use is what you expect with "llq -w" in the batch environment or "ps u" on cu12.
It's possible that the stacksize may not be large enough. Try 'limit
stacksize kkkkk' where kkkkk is the number of kilobytes that you want to
set the stacksize to. Try replacing kkkkk with unlimited if all else
fails. Also try to relink with -bmaxstack:0x80000000.
Datasize may not be large enough. Try 'limit datasize kkkkk' where
kkkkk is the number of kilobytes that you want to set the datasize to. Try
replacing kkkkk with unlimited if all else fails. Also try to relink with
-bmaxdata:0x80000000.
You can use the llhist command.
Or you can use llsummary to see the maximum resident set size [maxrss] used by your program and all of its processes and threads. The maxrss field is shown in 1k units, so a command like this will yield the maximum megabytes or gigabytes used:
Cu12:~/loadleveler315% expr `llsummary -l -j cu12.4123 | grep 'Step maxrss'|cut -d':' -f2` / 1024
229051 # megabytes
Cu12:~/loadleveler316% expr `llsummary -l -j cu12.4123 | grep 'Step maxrss'|cut -d':' -f2` / 1048576
223 # gigabytes
For programs running on the interactive machine, try "ps u" and look at the RSS field [like llsummary, the units are 1k pages]:
Cu12:~/loadleveler329% ps u
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND
arnoldg 3686468 4.2 7.0 887704 887732 pts/17 A 11:45:17 0:02 ./malloctest
Cu12:~/loadleveler330% expr 887732 / 1024
866 # megabytes
Back to top