NCSA Home
Contact Us | Intranet | Search

Running Jobs


What is the batch system?

    lsbatch is a load sharing batch system built on top of the LSF (Load Share Facility) system. See the online man page "man lsfbatch" and the file /usr/news/lsf for details.

What queues are available?

    The file /usr/news/lsf or the lsbatch command bqueues will give the current table of queues available.

Is there a sample batch script online?

    Sample scripts are available in the directory /usr/local/doc/lsf/samples.

How do I submit a batch job?

    You can submit a batch job either using
    (1)        bsub < scriptname
    
    or
    (2)        bsub scriptname 
    
    With (1):
    • the batch script is spooled, which means that the batch system makes a copy of the script at the time of submission to be used when the job runs.
    • embedded bsub options are recognized
    With (2):
    • the job executes from the actual location at the time the job starts, which means that any changes made to the script while the job is pending will take effect (which may or may not be desirable).
    • embedded bsub options are ignored
    • the script needs to be executable (use chmod) to work correctly.

    In general, we recommend option (1).

Are there limits on interactive jobs?

    Yes. See /usr/news/Interactive_Limits for the current limits.

Is there a limit on the number of jobs I can have in the batch system at a given time?

    Each user can have a total of 5 queued and running jobs, with a limit of 3 running jobs.

Is there any way to view lsbatch stdout/stderr while the job is running?

    Yes, you can do this (for your own jobs only) with the lsbatch command bpeek. See the man page for details.

How can I charge my batch jobs to a specific PSN?

    Use the bsub option
    #BSUB -P xyz
    
    to charge a batch job to PSN xyz.

    Alternatively, setting a default PSN (see /usr/news/Charge_Projects for information) will cause all jobs to be charged to this PSN.

How can I get a refund for lost jobs?

    Please refer to the file /usr/news/Refund_Policy for NCSA's refund policy and the procedure you should follow to obtain a refund.

Why is it that jobs submitted after mine to the batch system have started running while my job is still pending?

    There can be several reasons for this:
    1. There are limits on the number of jobs (both running and queued) per user, so if a user already has the maximum number of jobs running, the batch system will not start another job until atleast one of the jobs has completed. There is also a limit on the maximum number of jobs per queue. The command bqueues gives the job slots for each queue. Currently there is also a limit of 3 running jobs per user on the Origin.
    2. Another reason is that the resources requested by a job need to be available before the job can start up. NCSA uses a resource based system in lsbatch on the Origin, where jobs don't start up unless the resources required by them is available. This ensures that the machines don't get overloaded. Jobs in the same queue could have different resource requirements. The job monitoring page LSF job and queue status available off the Origin2000 Hardware page includes the memory and thread requests of each job. In the short term, single threaded and small parallel jobs get started much more quickly because small number of CPUs become available more frequently. In order to not have single threaded jobs lock out highly parallel jobs (>=32 threads) indefinitely, NCSA also have a reservation setup, where if a highly parallel job has been waiting for a period of time, then CPUs will start being held by the system for the parallel job.
    3. There are also different priorities for the different queues, with smaller job queues having higher priority than larger jobs.
    4. When jobs are lost due to a system crash, users can also request that their resubmitted jobs be moved to the top of the queue.

My previously running batch job now shows a status of SSUSP. How can I get it running again?

    Status SSUSP means that the batch system has suspended the job. lsbatch will suspend jobs when load conditions on the execution host have reached our site-defined thresholds. The command bhist -l will give the history of the job: the time(s) at which it was suspended and the reason(s) why. This is normal, and is meant to prevent overloading the system. lsbatch will automatically start up the job again when resources become available.

    If you find that your job repeatedly gets suspended for extended periods of time, please let us know.

Is there an explanation for exit codes from lsbatch?

    lsbatch uses regular Unix signals/error codes and gets them from a signal or error from whatever process was running at the time in your batch job i.e., the last process. So in general it can come from any process (anything from your code to lsbatch system problems). From Platform Computing (the folks who developed lsbatch), for exit codes > 128, subtract 128 to get the number.

    Check man 5 signal for signal numbers and man 2 intro for system error numbers. Since lsbatch just gives an exit code (and doesn't specify whether it's coming from a signal or an error), you'll need to look at both to check which one fits.

What does exit code 130 mean?

    Exit code 130 translates to (130-128) = 2.

    The sequence of signals to terminate jobs that exceed the limits of run time, number of processes/threads, and the lsbatch command bkill (either issued by the user, or system personnel) is:

    Interrupt signal SIGINT(2) 
    wait 10 minutes
    Kill signal SIGKILL(9)
    
    All jobs that terminate for the above reasons will exit with the interrupt signal (unless the job traps the interrupt signal, in which case, they will exit with the kill signal).

    So the most common reason for exit code 130 is that the job has exceeded one of the limits stated above.

What does the message "Exited with exit code 255 while sbatchd was down. CPU time used is unknown." from lsbatch mean?

    sbatchd is the slave batch daemon that runs on all the Origin2000 Array machines. A common reason for sbatchd on a machine to go down is that the machine has crashed (which also kills all running jobs on that machine). You will need to resubmit your job in this case. Note that any charges incurred for jobs that died due to a machine crash are refunded.

    [Exit code 255 translates to (255-128) = 127. From man 2 intro, this error is "Network is down".]

How can I get statistics like memory and CPU time used by a batch job?

    The utility busage gives job resource statistics and can be used for both running and completed jobs. The statistics include cpu time, run time, peak physical memory, and number of threads/processes used by the job.

    For information by process, the C-shell set time command can be used for non (SGI native) MPI/PVM jobs. The following example gives peak physical memory, CPU time, and wall clock time for all processes over 5 CPU seconds.

    set time=(5 "Phys Mem peak(Kb): %M, CPU(secs): %U+%Ss, Elaps(secs): %E")
    
    This can be used in C-shell batch scripts. See "man csh" for other options to set time.

    CPU time for all jobs can be obtained using the SGI perfex utility. CPU time for non MPI/PVM jobs can also be obtained using /bin/time or timex.

    lsbatch also reports statistics at the end of a batch job. At this time, we recommend ignoring these numbers since we are not sure of their accuracy.

How can I get the job usage report at the end of my job standard output file rather than in email?

    You can add the usage report to the end of your job output file simply by adding the following as the last executable line of your batch script:
       busage $LSB_JOBID
    
    The environment variable $LSB_JOBID is the ID of the currently running job. It is defined in the man page for bsub.

    There is no way to disable the e-mail message, except by using a filtering mechanism which may be a part of your e-mail program.