- Overview
- Interactive Use
- Queues
- Scheduling Policies
- Batch Commands
- qsub
- qsub -I
- qstat
- qhist
- qs
- qdel
- qps
- qhosts
- Sample Batch Scripts
- Disk Space for Batch Jobs
- Automated Saving of Files from Batch Jobs
1. Overview
The NCSA SGI Altix UV uses the Altair
Portable Batch System (PBS) Pro
with the Moab Workload Manager
for running jobs. To keep all jobs running within each system's memory [for best performance], and to achieve improved system uptime, memory specification for batch jobs is required and enforced.
2. Interactive Use
The access node ember.ncsa.illinois.edu is available for interactive
use.
User limits (for all active login sessions) are as follows:
- a maximum of 4 processes per job
- 4 Gbyte memory per process
- CPU time of 30 mins per process
Jobs exceeding the above policy will be terminated.
In general, interactive use should be limited to compiling and other
development tasks, such as editing source and debugging;
and limited staging of files. The batch system is available for all other jobs.
See the section on
qsub -I
for instructions on how to run an interactive job on the compute nodes.
3. Queues
The following queues are currently available for users:
| Queue | Wall Clock Limit | Max #Cores | Max Memory |
| debug | 30 minutes | 24
| 127 GB |
| normal | 100 hours | 192 | 1000 GB |
long
(3/16/11) | 200 hours | 192 | 1000 GB |
| dedicated (1) (2) | 50 hours | 378 | 2000 GB |
(1)
Queue not scheduled automatically. Please send email to consult@ncsa.illinois.edu to run jobs in this
queue.
(2)
Jobs in the dedicated queue will be charged for all
cores on the host regardless of how many processors the job uses.
4. Scheduling Policies
The scheduling policy on Ember is set to favor large memory or large core-count jobs. Jobs on Ember are not allowed to span more than one SMP compute host and timeshared jobs are not allowed to be larger than 192 cores (or its memory equivalent). Jobs larger than that are serviced by the dedicated queue and jobs are run from that queue by special request.
For timeshared jobs, the scheduler allocates
cpusets (cores and
associated memory) to jobs so different jobs don't share resources.
The smallest cpuset on Ember is 6 cores and around 30GB of usable memory.
Jobs are allocated the resources required to cover their request of
cores and memory in multiples of 6.
As with other HPC systems at NCSA, the scheduling policy includes
fair-share.
This is a policy whereby a job's priority may be increased or
decreased because of other jobs that the user's project may be running or have
recently run. Basically, in order to give everyone a fair opportunity to
run jobs, a user's job will have a higher priority if users in their project
haven't run jobs in the recent past.
Fair-share also factors in the
ratio of the service units the user's project is allocated and the
time to the allocation expiration.
To maximize utilization,
the scheduler will also back-fill jobs. When trying to schedule
large blocks of cores for large jobs, there are often "holes" where some
cores are idle waiting to be added to a pool to start a large waiting job.
The scheduler will attempt to increase system utilization and job throughput by opportunistically scheduling smaller and shorter jobs into these "holes" which would have otherwise left resources idle.
It is in the user's best interest that job wall time requests be made as accurately as possible. If everyone defaults to the maximum time allowed in the queue, there won't be any opportunistic back-fill. Even if a significant fraction of the submitted jobs default to the maximum wall time allowed, then the scheduler back fill will tend to fragment available resources as the prediction of "holes" will be inaccurate - resulting in having most jobs wind up waiting for a longer time to start.
When figuring out a job's priority relative to other jobs, there are
several factors which are taken into account. Some of these factors
include:
- job size (how many cores)
- job expansion factor (the ratio of the time the job has spent eligible
to be run versus how much time the job has requested)
- the raw amount of time the job has spent eligible to be run
- fair-share factors
A relative weighting of these factors contributes to a job's priority.
A debug queue is available to facilitate fast turnaround on
debugging/testing jobs. Jobs in this queue have an intrinsically
higher priority; additionally, they accrue priority
at a much higher rate because the expansion factor (and its associated
priority factor) increases very quickly.
In order to keep jobs from the long queue from dominating the
system and causing shorter jobs to wait behind them, there is a limit on
the number of cores currently running jobs from the long queue.
Given the fluid nature of our job load, this limit is adjusted from time
to time, but in the general case we tend to keep it between 1/4 and 1/3 of
the available cores.
When that limit is reached, subsequent jobs in the queue may go into a
blocked state until running jobs finish and free up resources.
Then the jobs will automatically be moved from the blocked state and
get scheduled to run.
5. Batch Commands
Below are brief descriptions of the useful batch commands.
For more detailed information, refer to the individual man pages or the PBS Users' Guide.
5.1. qsub
The qsub command is used to submit a batch job to a queue.
All options to qsub can be specified either on the command line
or as a line in a script (known as an embedded option). Command line
options have precedence over embedded options.
Scripts can be submitted using
qsub [list of qsub options] script_name
The main qsub options are listed below.
The sample batch scripts illustrate
qsub usage and options.
Also see the qsub man page for other options.
-
-l resource-list: specifies resource limits.
The resource_list argument is of the form:
resource_name[=[value]][:resource_name[=[value]]:...]:resource
The resource_names required are:
walltime: maximum wall clock time (hh:mm:ss) [default: 10 mins]
ncpus: the number of processors to use.
mem: the total memory required for the job (all processors).
Example:
#PBS -l walltime=00:30:00 -l ncpus=6 -l mem=20gb
It is important to provide an accurate estimate of the memory requirement because
of the way the batch system allocates memory and processors.
Notes:
- For timeshared jobs, the scheduler allocates cpusets (cores and
associated memory) to jobs so different jobs don't share resources. The
Altix UV systems have 6 cores per socket, with
two sockets per node board. The minimum number of cores allocated to a
job is 6. Charging will be based on the resources required to
accommodate the core and memory specifications of the job.
- The memory specification for your job will be enforced so your job must run within the requested memory.
Jobs will be terminated if they exceed their memory request.
-
-q queue_name: specify queue name.[default: normal]
- -N jobname: specifies the job name.
- -o out_file:
store the standard output of the job to file out_file.
[default :<jobname>.o<PBS_JOBID>]
- -j oe:
merge standard output and standard error into standard output file.
- -k oe:
place standard output and standard error files in your $HOME
directory. The filenames will be of the form
<jobname>.o<PBS_JOBID> and <jobname>.e<PBS_JOBID>
respectively. If this option
is used in conjunction with -j oe,
standard output and standard error are combined into standard output file.
The -k option overrides the -o option.
- -V:
export all your environment variables to the batch job.
-
-m be:
send mail at the begining and end of a job.
- -A psn:
charge your job to a
specific project (PSN). (for users on more than one PSN)
5.1.1 qsub -I
The -I option tells qsub you want to run an interactive job. You can also
use other qsub options such as those documented in the batch sample scripts
(/usr/local/doc/pbs/samples/).
For example, the following command:
qsub -I -V -l walltime=00:30:00,ncpus=6,mem=20gb -q debug
will run an interactive job with a wall clock limit of 30 minutes, using
6 cores and 20 gigabytes of memory.
After you enter the command, you will have to wait for PBS to start the
job. As with any job, your interactive job will wait in the queue until
the resources are available. For jobs less than 30 minutes of wall time,
specify the debug queue for higher priority. Once the job
starts, you will see something like this:
This job will be charged to account: ags (TG-STA060003)
qsub: waiting for job 38989.ember to start
qsub: job 38989.ember ready
----------------------------------------
!Begin PBS Prologue Wed Mar 2 10:24:32 2011
Job ID: 38989
Username: sjohn
Group: aaa
Creating Batch Directory 38989 in /scratch/batch/
----------------------------------------
set_SCR: using existing PBS job directory /scratch/batch/38989
[sjohn@ember-cmp2 ~]$
When you are done with your interactive commands, you can use the exit command to end
the job:
[sjohn@ember-cmp2 ~]$ exit
logout
qsub: job 38989.ember completed
You will be charged for the wall time used until you end the job.
Note: For running applications that require an X display on a
compute host in an interactive batch job,
please see Using VNC on Ember.
5.2. qstat
The
qstat command displays the status of PBS batch jobs.
- qstat -a gives the status of all jobs on the system.
- qstat -n lists nodes allocated to a running job in
addition to basic information.
- qstat -f PBS_JOBID gives detailed information
on a particular job.
- qstat -q provides summary information on all the queues.
See the man page for other options available.
5.3. qhist
The qhist command summarizes the raw accounting record(s) for one or more jobs. See the output of "qhist --help" for details.
NOTE: SU charges for a job are available the day after the job completes.
To display information about a specific job, the syntax is qhist PBS_JOBID.
$ $ qhist 238
Scanning PBS raw accounting records: 07/14/2010 - 09/27/2010
Compute Host: ember-cmp3
JobId: 238
JobName: testjob
User: arnoldg
TG acct (project): -local proj- (aaa)
Queue: normal
Job limits:
wall clock: 10:00:00
Requested CPUs: 120
Requested Memory: 500mb
Queued: 09/24/10 13:58
Started: 09/24/10 13:59
Ended: 09/25/10 00:01
Usage:
wall clock: 10:02:03
cputime: 00:01:26
SUs: 60.20
memory: 19.32M
qhist can also produce tables of information from the PBS raw accounting records. For example, to create a table for your jobs that started between September 24, 2010 and September 25, 2010, run the following command
$ qhist -S 9/24/10,9/25/10
Scanning PBS raw accounting records: 09/22/2010 - 09/27/2010
JobId JobName NCPU Stat StartDate EndDate SUs
---------------------------------------------------------------------------
220 D 1 E 09/24/10 12:28 09/24/10 12:29 0.12
221 D 1 E 09/24/10 12:30 09/24/10 13:19 4.88
222 D 12 E 09/24/10 13:20 09/24/10 13:39 1.83
225 testjob 64 E 09/24/10 13:41 09/24/10 13:41 0.01
226 testjob 64 E 09/24/10 13:43 09/24/10 13:43 0.01
227 testjob 64 E 09/24/10 13:45 09/24/10 13:45 0.01
229 asIF! 32 E 09/24/10 13:47 09/24/10 13:50 0.28
230 asIF! 32 E 09/24/10 13:51 09/24/10 13:51 0.02
238 testjob 120 E 09/24/10 13:59 09/25/10 00:01 60.20
239 testjob 120 E 09/24/10 14:01 09/25/10 00:03 60.20
240 testjob 120 E 09/24/10 14:01 09/25/10 00:03 60.20
241 testjob 64 E 09/25/10 00:11 09/25/10 00:11 0.01
---------------------------------------------------------------------------
Total # jobs = 12
Total # SUs = 187.77
Notes:
- Depending upon the search criteria, qhist may search through records for a couple days on both ends of the date range you specify in order to collect more information about the job(s).
- the Stat column displays the last known status of the job:
- Q - queued
- H - queued and in hold state
- D - deleted (delete record found, but end record not found)
- S - started running (start record found)
- E - ended, can be either normally or via deletion (end record found)
- A - aborted by Torque server, for example due to user being over quota or failed job dependencies (-W depend).
5.4 qs
The qs command
displays a detailed table of the status of PBS batch jobs, and can
be used as an alternative to the qstat command.
Information like job dependancy, queue wait time, elapse time and execution host
for running jobs can be viewed at a glance.
qs
will display all queued and running jobs.
See qs -h for options.
5.5 qdel
The qdel command deletes a queued job or kills a running job.
The syntax is qdel PBS_JOBID.
5.6 qps
The qps command prints ps style information for processes* running on the Ember compute hosts.
[* only the first openmp or thread is shown for threaded processes]
qps
will display the process information for all your processes on the Ember compute hosts.
qps -j JOBID
will display the process information for a particular job.
See qps -h for other options.
5.7 qhosts
The qhosts command summarizes PBS information for hosts and provides counts
of claimed CPU and MEM for each host on Ember. The load average, memory used, and uptime data for each host are also provided.
6. Sample Batch Scripts
Sample batch scripts are available in the directory
/usr/local/doc/batch_scripts for use as a template.
7. Disk Space for Batch Jobs
Scratch space for batch jobs is provided via a per-job scratch directory that
is created at the beginning of the job. This directory is created under
/scratch/batch, and is based on the JobID. If the batch script uses one of the sample scripts as a template, the name of this scratch directory is
available to job scripts with the
$SCR environment variable.
Your job scratch directory may be deleted soon
[possibly immediately] after your job completes, so
you should take care to transfer results to the mass storage system (see
the section Automated Saving of Files from Batch Jobs).
The cdjob command
can be used to change the working directory to the scratch directory of a
running batch job.
The syntax is
cdjob PBS_JOBID
8. Automated Saving of Files from Batch Jobs
The saveafterjob utility is available for
automated, guaranteed saving of output files from batch jobs to the mass
storage system.
For details on its use, see the saveafterjob
page and the sample PBS batch scripts.
back to Top