lsbatch is a load sharing batch system built on top of the LSF
(Load Share Facility) system. See the online man page "man lsfbatch"
and the file /usr/news/lsf for details.
The file /usr/news/lsf or the lsbatch command bqueues
will give the current table of queues available.
Sample scripts are available in the directory
/usr/local/doc/lsf/samples.
You can submit a batch job either using
(1) bsub < scriptname
or
(2) bsub scriptname
With (1):
- the batch script is spooled, which means that the batch system makes
a copy of the script at the time of submission to be used when the job
runs.
- embedded bsub options are recognized
With (2):
- the job executes from the actual location at the time the job starts,
which means that any changes made to the script while the job is pending
will take effect (which may or may not be desirable).
- embedded bsub options are ignored
- the script needs to be executable (use chmod) to work
correctly.
In general, we recommend option (1).
Yes. See /usr/news/Interactive_Limits for the current limits.
Each user can have a total of 5 queued and running jobs, with a limit of
3 running jobs.
Yes, you can do this (for your own jobs only) with the lsbatch
command bpeek. See the man page for details.
#BSUB -P xyz
to charge a batch job to PSN xyz.
Alternatively, setting a default PSN (see /usr/news/Charge_Projects for
information) will cause all jobs to be charged to this PSN.
Please refer to the file /usr/news/Refund_Policy for NCSA's
refund policy and the
procedure you should follow to obtain a refund.
There can be several reasons for this:
- There are limits on the number of jobs (both running and queued) per
user, so if
a user already has the maximum number of jobs running, the
batch system will not start another job until atleast one of the jobs has
completed. There is also a limit on the maximum number of jobs per queue.
The command bqueues gives the job slots for each queue.
Currently there is also a limit of 3 running jobs per user on the Origin.
- Another reason is that the resources requested by a job need to
be available before the job can start up.
NCSA uses a resource based system in lsbatch on the Origin,
where jobs don't start up unless the resources required by them is
available. This ensures that the machines don't get overloaded.
Jobs in the same queue could have different resource requirements.
The job monitoring page LSF job and queue status available
off the
Origin2000 Hardware page includes the memory and thread requests of
each job.
In the short term, single threaded and small parallel jobs get started
much more quickly because small number of CPUs become available more
frequently. In order to not have single threaded jobs lock out highly
parallel jobs (>=32 threads) indefinitely, NCSA also have a reservation
setup, where if a highly parallel job has been waiting for a period of time,
then CPUs will start being held by the system for the parallel job.
- There are also different priorities for the different queues, with
smaller job queues having higher priority than larger jobs.
- When jobs are lost due to a system crash, users can also request
that their resubmitted jobs be moved to the top of the queue.
Status SSUSP means that the batch system has suspended the job.
lsbatch will suspend jobs when load conditions on the execution
host have reached our site-defined thresholds. The command
bhist -l will give the history of the job: the time(s)
at which it was suspended and the reason(s) why. This is normal, and is
meant to prevent overloading the system. lsbatch will
automatically start up the job again when resources become available.
If you find that your job repeatedly gets suspended for extended
periods of time, please let us know.
lsbatch uses regular Unix signals/error codes and gets them from a signal or error
from whatever process was running at the time in your batch job i.e., the last process. So in
general it can come from any process (anything from your
code to lsbatch system problems). From Platform Computing (the folks
who developed lsbatch), for exit codes > 128, subtract 128 to get the number.
Check man 5 signal for signal numbers and man 2 intro for
system error numbers.
Since lsbatch just gives an exit code (and doesn't specify whether it's coming
from a signal or an error), you'll need to look at both
to check which one fits.
Exit code 130 translates to (130-128) = 2.
The sequence of signals to terminate jobs that exceed the limits of
run time,
number of processes/threads, and the lsbatch command
bkill (either issued by the user, or system personnel) is:
Interrupt signal SIGINT(2)
wait 10 minutes
Kill signal SIGKILL(9)
All jobs that terminate for the above reasons will exit with the interrupt
signal (unless the job traps the interrupt signal, in which case, they will
exit with the kill signal).
So the most common reason for exit code 130 is that the job has exceeded
one of the limits stated above.
sbatchd is the slave batch daemon that runs on all the Origin2000
Array machines. A common reason for sbatchd on a machine to
go down is that the machine has crashed (which also kills all running
jobs on that machine). You will need to resubmit your job in this case.
Note that any charges incurred for jobs that died due to a machine crash
are refunded.
[Exit code 255 translates to (255-128) = 127. From man 2 intro,
this error is "Network is down".]
The utility busage gives job resource statistics and can be used
for both running and completed jobs. The statistics include cpu time,
run time, peak physical memory, and number of threads/processes used
by the job.
For information by process, the C-shell set time command can be
used for non (SGI native) MPI/PVM jobs.
The following example gives peak physical memory, CPU time,
and wall clock time for all processes over 5 CPU seconds.
set time=(5 "Phys Mem peak(Kb): %M, CPU(secs): %U+%Ss, Elaps(secs): %E")
This can be used in C-shell batch scripts. See "man csh" for
other options to
set time.
CPU time for all jobs can be obtained using the SGI perfex
utility. CPU time for non MPI/PVM jobs can also be obtained using
/bin/time or timex.
lsbatch also reports statistics at the end of a batch job.
At this time, we recommend ignoring these numbers since we are not
sure of their accuracy.
You can add the usage report to the end of your job output file
simply by adding the following as the last executable line of your batch
script:
busage $LSB_JOBID
The environment variable $LSB_JOBID is the ID of the currently running
job. It is defined in the man page for bsub.
There is no way to disable the e-mail message, except by using
a filtering mechanism which may be a part of your e-mail program.