At NCSA, we run the same job scheduling package with all of our batch systems.
The scheduling of jobs is based on first-in, first-out [FIFO] but with some
important modifications to make sure that all jobs and users get a fair chance
to run. Some of the various scenarios you may observe are explained here.
Q: Why are many machines or resources currently idle?
A: In the time period before a large cpu-count job starts, it always looks
this way. A large cpu-count job will probably be starting very soon and
the idle resources will be used once again.
Q: User X just submitted a job like mine after my job and their job has
started, why?
A: Our scheduler "remembers" how many jobs have been run by each user for
a while. If User X has run few jobs recently, and you've run more, then
User X will get more priority so that they can catch up and everyone gets
a fair-share of the system.
The scheduler will also backfill smaller and shorter jobs to utilize any idle resources.
Q: How can I get a test, debug, or any of my jobs to start sooner--I really
need to try something out now!
A: The debug queue is available for small cpu jobs. In addition, because the scheduler has to let resources sit idle before starting any
large [many cpu] job, you can try to take advantage of that situation by
submitting a job that requests short walltime.
Q: I don't understand why my job hasn't started. How can anyone get work
done when the queues are so busy?
A: See the output of the qs command. It shows the queued jobs along with
the time they've been waiting in the queue. You can compare your jobs to
jobs that have waited a similar time and determine that other users'
experience is in common with yours.
That's because the memory limit needs to be specified as an integer.