NCSA Home
Contact Us Intranet

Running Jobs on NCSA's SGI Altix

User Information Home
Data
Security
Allocations
Consulting
Training

NCSA's Help Desk is available 24 hours a day, seven days a week, 365 days a year:
help.ncsa.illinois.edu
217-244-0710
help@ncsa.illinois.edu

  1. Overview
  2. Interactive Use
  3. Queues
  4. Batch Commands
    1. qsub
      1. qsub -I
    2. qstat
    3. qhist
    4. qhosts
    5. qpeek
    6. qps
    7. qdel
  5. Sample Batch Script
  6. Managing Batch Scripts
  7. Disk Space for Batch Jobs
  8. Automated Saving of Files from Batch Jobs

1. Overview

The NCSA SGI Altix uses the Altair Portable Batch System (PBS) Pro with the Moab Workload Manager for running jobs. To keep all jobs running within each system's memory [for best performance], and to achieve improved system uptime, memory specification for batch jobs is required and enforced. See PBS memory enforcement and monitoring for details.

2. Interactive Use

The access node cobalt.ncsa.uiuc.edu is available for interactive use. User limits (for all active login sessions) are as follows:

  • a maximum of 4 processes per job
  • 4 Gbyte memory per process
  • CPU time of 30 mins per process
Jobs exceeding the above policy will be terminated. In general, interactive use should be limited to compiling and other development tasks, such as editing source and debugging; and limited staging of files. The batch system is available for all other jobs. See the section on qsub -I for instructions on how to run an interactive job on the compute nodes.

3. Queues

The following queues are currently available for users:

QueueWall Clock LimitMax #ProcessorsMax Memory
debug(1)30 mins24256 Gbytes
standard50 hours2561.5 Tbytes
long200 hours1281.5 Tbytes
dedicated(2)200 hours5043 Tbytes

(1) Jobs in the debug queue may run either on the login node or on the compute nodes. Jobs that run on the Cobalt login node do so in a way that does not conflict with the interactive login resources.
(2) Queue available by special request. Please send email to consult@ncsa.uiuc.edu to request access.

Note: For best performance, the machine has been configured to reserve 8 processors to protect operating system and other system processes. So it is strongly recommended to specify a maximum of 504 processors for jobs in the dedicated queue.

Jobs in the dedicated queue will be charged for all CPUs on the host regardless of how many processors the job uses.

4. Batch Commands

Below are brief descriptions of the useful batch commands. For more detailed information, refer to the individual man pages or the PBS Users' Guide.

4.1. qsub

The qsub command is used to submit a batch job to a queue. All options to qsub can be specified either on the command line or as a line in a script (known as an embedded option). Command line options have precedence over embedded options. Scripts can be submitted using

qsub [list of qsub options] script_name

The main qsub options are listed below. The sample batch script illustrates qsub usage and options. Also see the qsub man page for other options.

  • -l resource-list: specifies resource limits. The resource_list argument is of the form:
    resource_name[=[value]][:resource_name[=[value]]:...]:resource
    

    The resource_names required are:

    walltime: maximum wall clock time (hh:mm:ss) [default: 10 mins]
    ncpus: the number of processors to use.
    mem: the total memory required for the job (all processors).

    It is important to provide an accurate estimate of the memory requirement because of the way the batch system allocates memory and processors.

    Note: The memory specification for your job will be enforced so your job must run within the requested memory. Jobs will be terminated if they exceed their memory request. See also: Checking memory use

    Example:
    #PBS -l walltime=00:30:00 -l ncpus=8 -l mem=16gb
    

    A resource named altix4700 is available to specify that jobs run on the SGI Altix 4700 (co-compute3) when desired. Add the following qsub option to your batch script:
    #PBS -l altix4700=true
    

  • -q queue_name: specify queue name.[default: standard]

  • -N jobname: specifies the job name.

  • -o out_file: store the standard output of the job to file out_file. [default :<jobname>.o<PBS_JOBID>]

  • -j oe: merge standard output and standard error into standard output file.

  • -k oe: place standard output and standard error files in your $HOME directory. The filenames will be of the form <jobname>.o<PBS_JOBID> and <jobname>.e<PBS_JOBID> respectively. If this option is used in conjunction with -j oe, standard output and standard error are combined into standard output file. The -k option overrides the -o option.

  • -V: export all your environment variables to the batch job.

  • -m be: send mail at the begining and end of a job.

  • -A psn: charge your job to a specific project (PSN). (for users on more than one PSN)

4.1.1 qsub -I

The -I option tells qsub you want to run an interactive job. You can also use other qsub options such as those documented in the batch sample scripts (/usr/local/doc/pbs/samples/). For example, the following command:

   qsub -I -V -l walltime=00:30:00,ncpus=8,mem=8gb

will run an interactive job with a wall clock limit of 30 minutes, using 8 processors and 8 gigabytes of memory.

After you enter the command, you will have to wait for PBS to start the job. As with any job, your interactive job will wait in the queue until the specified number of nodes is available. If you specify a small number of nodes, the wait will be shorter. Once the job starts, you will see something like this:

	qsub: waiting for job 1298.co-login1 to start
	qsub: job 1298.co-login1 ready
	 
	----------------------------------------
	!Begin PBS Prologue Thu Aug  5 12:45:53 CDT 2004
	Job ID:           1298
	Username:               arnoldg
	Group:            aau
	Creating Batch Directory 1298 in /scratch/batch
	----------------------------------------
	$

When you are done with your interactive commands, you can use the exit command to end the job:

	$ exit
	logout
 
	qsub: job 1298.co-login1 completed

You will be charged for the cpu time used by all requested nodes until you end the job.

4.2. qstat

The qstat command displays the status of PBS batch jobs.
  • qstat -a gives the status of all jobs on the system.
  • qstat -n lists nodes allocated to a running job in addition to basic information.
  • qstat -f PBS_JOBID gives detailed information on a particular job.
  • qstat -q provides summary information on all the queues.

See the man page for other options available.

4.3. qhist

The qhist command summarizes the raw accounting record(s) for one or more jobs. See the output of "qhist --help" for details.
NOTE: SU charges for a job are available the day after the job completes.

To display information about a specific job, the syntax is qhist PBS_JOBID.

$ qhist 428975

Compute Host:       co-compute2
JobId:              428975
JobName:            STDIN
User:               jsiwek
TG acct (project):  -local proj- (aau)
Queue:              standard

Job limits:
  wall clock:       01:00:00    
  Requested CPUs:   16       
  Requested Memory: 88gb     

Queued:             11/06/08 13:32
Started:            11/06/08 13:32
Ended:              11/06/08 14:14

Usage:
  wall clock:       00:41:59    
     cputime:       00:52:37    
         SUs:       11.20       
      memory:       255.59M     


qhist can also produce tables of information from the PBS raw accounting records. For example, to create a table for your jobs that started between November 5, 2008 and November 10, 2008, run the following command
$ qhist -S 11/5/08,11/10/08

Scanning PBS raw accounting records: 11/03/2008 - 11/12/2008


  JobId  JobName       NCPU  Stat  StartDate       EndDate              SUs  
---------------------------------------------------------------------------
 428032  STDIN            1     E  11/05/08 09:51  11/05/08 09:52      0.04
 428365  STDIN            4     E  11/05/08 15:50  11/05/08 15:50      0.02
 428975  STDIN           16     E  11/06/08 13:32  11/06/08 14:14     11.20
---------------------------------------------------------------------------
Total # jobs = 3
Total # SUs  = 11.26

Notes:

  • Depending upon the search criteria, qhist may search through records for a couple days on both ends of the date range you specify in order to collect more information about the job(s).
  • the Stat column displays the last known status of the job:
    • Q - queued
    • H - queued and in hold state
    • D - deleted (delete record found, but end record not found)
    • S - started running (start record found)
    • E - ended, can be either normally or via deletion (end record found)
    • A - aborted by Torque server, for example due to user being over quota or failed job dependencies (-W depend).

4.4 qhosts

The qhosts command summarizes PBS information for hosts and provides counts of claimed CPU and MEM for each host in the Cobalt array. Load Average, memory used, and uptime data are also provided. Example:


[consult@co-login1 ~/bin]$ qhosts
 
---------PBS job data-----------|-----Performance Co-Pilot measurments-----
                                | [..load average....]
       HOST  JOBS  CPUS MEM[gb] | 1 min  5 min  15 min   MEM   UPTIME
----------------------------------------------------------------------------
 co-compute1    3   148     790 |    133    131    143   733   10 days 01:07
 co-compute2    3   132     612 |    106    112    117   991   04 days 03:15
   co-login1                    |      0      1      1   178   10 days 01:16
     co-viz1                    |      0      0      0     2   14 days 16:34
     co-viz2                    |      0      0      0     2   10 days 01:24
     co-viz3                    |      0      0      0     0   00 days 01:14
     co-viz4                    |      0      0      0     2   04 days 04:12
     co-viz5                    |      0      0      0     2   10 days 01:24
     co-viz6                    |      0      0      0     2   10 days 01:21
     co-viz7                    |      0      0      0     2   08 days 03:19
     co-viz8                    |      0      0      0     2   08 days 05:53

4.5 qpeek

The qpeek command (currently in beta) displays the standard output and standard error of an unfinished job.

     qpeek JobID
will display the stdout/stderr for a specific job.

See qpeek -h for other options.

4.6 qps

The qps command (currently in beta) prints ps style information for processes* running on the cobalt array. [* only the first openmp or thread is shown for threaded processes]

     qps 
will display the process information for all your processes on the cobalt array
     qps -j JOBID
will display the process information for a particular job

See qps -h for other options.

4.7 qdel

The qdel command deletes a queued job or kills a running job. The syntax is qdel PBS_JOBID.

5. Sample Batch Script

Sample batch scripts are available in the directory /usr/local/doc/pbs/samples for use as a template.

The sample batch scripts use UniTree for permanent storage of files. It assumes that the executable and any input files are already on UniTree. If that's not true in your case or if you have problems with UniTree within batch jobs, see this FAQ.

6. Managing Batch Scripts

There is a program named find_batch_scripts that will help you locate batch scripts on the system [should you forget their location].

7. Disk Space for Batch Jobs

Scratch space for batch jobs is provided via a per-job scratch directory that is created at the beginning of the job. This directory is created under /scratch/batch, and is based on the JobID. If the batch script uses one of the sample scripts as a template, the name of this scratch directory is available to job scripts with the $SCR environment variable.

Your job scratch directory may be deleted soon [possibly immediately] after your job completes, so you should take care to transfer results to the mass storage system (see the section Automated Saving of Files from Batch Jobs).

The cdjob command can be used to change the working directory to the scratch directory of a running batch job. The syntax is

cdjob PBS_JOBID

8. Automated Saving of Files from Batch Jobs

The saveafterjob utility is available for automated, guaranteed saving of output files from batch jobs to the mass storage system. For details on its use, see the saveafterjob page and the sample PBS batch scripts.

back to Top