At NCSA, we run the same job scheduling package with all of our batch systems.
The scheduling of jobs is based on first-in, first-out [FIFO] but with some
important modifications to make sure that all jobs and users get a fair chance
to run. Some of the various scenarios you may observe are explained here.
Q: Why are many machines or resources currently idle?
A: In the time period before a large cpu-count job starts, it always looks
this way. A large cpu-count job will probably be starting very soon and
the idle resources will be used once again.
Q: User X just submitted a job like mine after my job and their job has
started, why?
A: Our scheduler "remembers" how many jobs have been run by each user for
a while. If User X has run few jobs recently, and you've run more, then
User X will get more priority so that they can catch up and everyone gets
a fair-share of the system.
The scheduler will also backfill smaller and shorter jobs to utilize any idle resources.
Q: How can I get a test, debug, or any of my jobs to start sooner--I really
need to try something out now!
A: The debug queue is available for small cpu jobs. In addition, because the scheduler has to let resources sit idle before starting any
large [many cpu] job, you can try to take advantage of that situation by
submitting a job that requests short walltime.
Q: I don't understand why my job hasn't started. How can anyone get work
done when the queues are so busy?
A: See the output of the qs command. It shows the queued jobs along with
the time they've been waiting in the queue. You can compare your jobs to
jobs that have waited a similar time and determine that other users'
experience is in common with yours.
This is probably because your account is expired or overused. You can use
the batch_accts command to
check your accounts.
You can also check details of your allocation and usage on the TeraGrid Portal or using the TeraGrid tgusage command.
This error message occurs when the jobname specified in the -N
option is > 15 characters long OR if the first character in the name is
non-alphabetic. From the qsub man page:
-N name Declares a name for the job. The name specified may be up to
and including 15 characters in length. It must consist of
printable, non white space characters with the first character
alphabetic.
If your batch script does not specify the
-N option, the
name of the batch script is used as the jobname, so the above limitations
will apply for the batch script name in this case.
This message is harmless and can safely be ignored. It just means that there
is no interactive shell access for this job.
Why does the qsub command give unknown resource type error?
error when submitting qsub:
qsub: Unknown resource type Resource_List..2gb
That's because the memory limit needs to be specified as an integer. For example, if you want 1.2gb, then request 1200mb instead.
Why do I get ssh related error message from my batch job?
If you are using ssh command in your batch script, you might get error message like this: /usr/local/openssh-mechglue/bin/real/ssh: error while loading shared libraries: libkrb4.so.2: cannot open shared object file: No such file or directory. This is because ssh is not supported on compute nodes.
My batch job can't run with error message: /var/spool/pbs/mom_priv/jobs/jobid.co-master.SC: Command not found. Why?
Your batch script was probably copied from Windows/MAC system and there may be hidden formatting characters in it. Please run the command dos2unix to convert your file from DOS/MAC format to UNIX format.
What error messages would you get if jobs fail because of a system problem?
There are some common error messages you would possible get when the system has problem or crashed. Please send email to consult@ncsa.uiuc.edu with the standard error and output files from the batch job that failed.
(1) PBS Job Id: jobid.co-master
Job Name: jobname
Aborted by PBS Server
Job cannot be executed
See job standard error file
(2) PBS Job Id: jobid.co-master
Job Name: jobname
Aborted by PBS Server
Job cannot be executed
See Administrator for help
PBS Job Id: jobid.co-master
Job Name: jobname
Post job file processing error; job jobid.co-master on host
co-compute1:ssinodes=16:mem=63979520k$
Unlink of stage out file /var/spool/pbs/spool/jobid.co-master.OU failed
(3) PBS Job Id: jobid.co-master
Job Name: jobname
Post job file processing error; job jobid.co-master on host
co-compute1:ssinodes=16:mem=63979520kb:ncpus=32
Unable to copy file jobid.co-master.OU to co-login1.ncsa.uiuc.edu
>>> error from copy
/bin/cp: cannot create regular file `standard error output file name`:
No such file or directory
>>> end error output
Output retained on that host in:
/var/spool/pbs/undelivered/jobid.co-master.OU
(4) Post job file processing error; job jobid.co-master on host co-compute1:ssinodes=1:ncpus=2:mem=3998720kb
Unable to copy file jobid.co-master.OU to co-login1:/dev/null
>>> error from copy
co-login1.ncsa.uiuc.edu: Connection refused
>>> end error output
Output retained on that host in:
/var/spool/pbs/undelivered/jobid.co-master.OU
This is probably because your account is expired or overused. You can use
the batch_accts command to
check your accounts.
You can also check details of your allocation and usage on the TeraGrid Portal or using the TeraGrid tgusage command.