- Overview
- Running Programs
- serial
- OpenMP
- mpi
- Interactive Access
- Classes [Queues] and the max. number of running jobs
- Disk space for batch jobs
- Workload Management commands
- llsubmit
- llq
- llcancel
- llsummary
- llhist
- llhosts
- Sample LoadLeveler Scripts
- Managing Batch Scripts
- Automated Saving of Files from Batch Jobs
- Notes
- References
1. Overview
The NCSA IBM pSeries 690 uses the
IBM LoadLeveler
workload management system with the
Moab
scheduler.
2. Running Programs
2.1 serial
To run a serial program, just enter the executable name and any necessary arguments at the shell prompt:
% ~HOME/c/hello_world
hi from c
hi from fortran
2.2 OpenMP
To run an OpenMP program, set the environment variable OMP_NUM_THREADS
to the desired number of threads, then enter the executable name and any
necessary arguments at the shell prompt as with a serial program.
% cc_r -qsmp=omp -o test_openmp test.c
% setenv OMP_NUM_THREADS 2
% ./test_openmp
omp_get_dynamic=1
omp_get_num_procs=16
omp_get_max_threads=2
Threads allocated : 2
%
2.3 mpi
The
Parallel Operating Environment
is used for running MPI programs.
Instead of using mpirun to run MPI programs, use the poe
command. List the poe options after the program options and any
file redirections. For example, to run a program on 4 processors:
poe a.out < myin > myout -procs 4
See
"poe -h" for the short help.
There are many environment variables and options
that can be used with
poe.
When debugging MPI programs, consider using the following poe options:
- -infolevel 1
- poe displays errors and warnings
- -labelio yes
- output from the parallel tasks is labeled by task ID
- -euidevelop yes
- the message passing interface performs more detailed
checking during execution. This additional checking is intended for
developing applications and can significantly slow performance.
Here is an example of the
poe command with various debug options:
% poe allall 100 100 2000 -procs 2 -labelio yes -infolevel 1 -euidevelop yes
0:Node 0 Complete...
0:Host Cu12 to Host Cu12(1): 1642.407474MB/sec
1:Node 1 Complete...
1:Host Cu12 to Host Cu12(0): 1623.271113MB/sec
3. Interactive Access
The machine cu.ncsa.uiuc.edu is available for interactive
access. It has 32 processors and 64 gigabytes of memory available to
support interactive users.
User limits (for all active login sessions) are as follows:
- a total of 4 processes
- 1 Gbyte memory per process
- CPU time of 30 mins per process
Jobs exceeding the above policy are terminated.
In general, interactive use should be limited to compiling and other
development tasks, such as editing source and debugging.
For users with multiple projects, use the newgrp command to change
projects for charging purposes when working interactively. See the
newgrp manual page for more information [man newgrp].
4. Classes [Queues] and the max. number of running jobs.
The following classes are currently available for users:
class wall_clock_limit max processors per job max memory per job
--------------------------------------------------------------------------------
debug 00:30:00 4 128GB
batch 100:00:00 16 128GB
cap (*) 600:00:00 16 128GB
dedicated 100:00:00 32 256GB
(*) The cap queue is open to all users as of February 2007.
As of March 2006, there are no formal limits on the number of running jobs you may have in queue. A "fair share" policy and queue wait times are evaulated to determine the mix of running jobs.
Note: The NCSA IBM p690 does not support jobs across multiple hosts.
5. Disk space for batch jobs
The IBM p690 system creates a scratch directory for each running batch
job. The job directory is created for you when LoadLeveler starts your job
and is accessible within the batch script using the $SCR
environment variable. See the sample scripts in /usr/local/doc/ll/
for examples that use $SCR.
You can use the interactive system (cu12) to view your data
while your job is running. The cdjob command can be used to change the
working directory to the scratch directory of a running batch job. The syntax is:
cdjob jobid
Your job scratch directory may be deleted soon
[possibly immediately] after your job completes, so
you should take care to transfer results to the mass storage system (see
the section Automated Saving of Files from Batch Jobs).
Please contact the NCSA Consulting Office
(consult@ncsa.uiuc.edu)
if you plan to use more
than 100 Gigabytes of disk space in a single job.
6. Workload Management commands [LoadLeveler / batch]
You could read the entire
LoadLeveler
book, but that is not as exciting as it sounds to you right now, and you may be eager to get some science done today. Here you'll find the basic commands to help you work with the batch environment.
6.1 llsubmit
This command submits a batch script to the LoadLeveler environment. With LoadLeveler, a batch script may contain multiple job steps, and each step may run independently on different batch machines. The LoadLeveler scripts can be as complex as many progr
ams. A couple of sample batch scripts
are provided.
Example usage:
Cu12:~101% llsubmit ll.job
llsubmit: The job "cu12.ncsa.uiuc.edu.1211" has been submitted.
Below are some common LoadLeveler directives you can use in your LoadLeveler
scripts. See the
IBM LoadLeveler directives documentation for
details and other directives available. Note that LoadLeveler does not
support specification of directives on the command line.
- shell
-
Specifies the name of the shell to use. If not specified, the shell listed
in the owner's password file entry is used. The syntax is:
#@ shell = name
- job_type
- Specifies whether the job is serial or parallel. The default is
serial.
The syntax is:
#@ job_type = string
For example, to specify an MPI or OpenMP job:
#@ job_type = parallel
- environment
- Specifies your initial environment variables in your job. Separate environment specifications with semicolons. An environment specification may be one of the following:
COPY_ALL
specifies that all the environment variables from your shell be copied.
$var specifies that the environment variable var be copied into the environment of your job when LoadLeveler starts it.
!var specifies that the environment variable var not be copied into the environment of your job when LoadLeveler starts it. This is most useful in conjunction with COPY_ALL.
var=value specifies that the environment variable var be set to the value value and copied into the environment of your job when LoadLeveler starts
it.
The syntax is:
#@ environment = env1 ; env2 ; ...
For example:
#@ environment = COPY_ALL; !env2;
- notification
-
Specifies when the LoadLeveler system sends mail to you. The syntax is:
#@ notification = always|error|start|never|complete
where:
always
notify the user when the job begins, ends, or if it incurs error conditions.
error
notify the user only if the job fails.
start
notify the user only when the job begins.
never
never notify the user.
complete
notify the user only when the job ends. This is the default.
For example, if you want to be notified with mail only when your job step completes, your notification keyword would be:
#@ notification = complete
- class
- Specifies the name of a job class (default: batch)
The syntax is:
#@ class = name
For example, to submit jobs to a class called batch, your class keyword
would look like the following:
#@ class = batch
A LoadLeveler class is simlar to a queue in other batch systems.
- account_no
- Specifies the account name string for the job [for charging to projects].
The syntax is:
#@ account_no = abc
- tasks_per_node
-
Specifies the number of tasks of an MPI parallel program you want to run.
For OpenMP, threaded, or serial programs, you can take the default (1) and omit this directive.
The value of the tasks_per_node keyword applies only to the job step in
which you specify the keyword. (That is, this keyword is not inherited by other
job steps.) The syntax is:
#@ tasks_per_node = number
Where number is the number of tasks or processes you want to run per node. The default is
one task per node.
- resources
- Specifies quantities of the consumable resources "consumed" by each task or process in the job step.
For OpenMP, or threaded programs, set ConsumableCpus(N), where N is the number of threads you plan to employ (note: you will still need to set OMP_NUM_THREADS for OpenMP programs). For OpenMP and serial programs, ConsumableMemory is the total memory your
program can use. For MPI programs, ConsumableMemory is the total memory each MPI task or process can use.
The syntax is:
resources=name(count) name(count) ... name(count)
Here is an example for an OpenMP program using 4 threads and a total of 1 gigabyte of memory:
#@ resources = ConsumableCpus(4) ConsumableMemory(1 gb)
This is an example for an MPI program that will require 500 megabytes per process or task:
#@ resources = ConsumableCpus(1) ConsumableMemory(500 mb)
- wall_clock_limit
- Sets the limit for the elapsed time for which a job can run. In computing the elapsed time for a job, LoadLeveler considers the start time to be the time the job is dispatched. The default value is 30 minutes (00:30:00).
The syntax is:
#@ wall_clock_limit = limit
An example is:
#@ wall_clock_limit = 5:00
- job_name
-
Specifies the name of the job.
The syntax is:
job_name = job_name
- output
- Specifies the name of the file to use as standard output (stdout) when your job step runs. If not specified, the file /dev/null is used [the output will be discarded]. The syntax is:
#@ output = filename
For example:
#@ output = out.$(jobid)
- error
- Specifies the name of the file to use as standard error (stderr) when your job step runs. If you do no specify this keyword, the file /dev/null is used [the standard error stream will be discarded]. The syntax is:
#@ error = filename
For example:
#@ error = $(jobid).$(stepid).err
- queue
- Places one copy of the job step in the queue. This statement is required. The queue statement marks the end of a job step. Note that you can specify statements between queue statements. The syntax is:
#@ queue
6.2 llq
To view the current queue of job steps, run llq:
% llq
Id Owner Submitted ST PRI Class Running On
------------------------ ---------- ----------- -- --- ------------ -----------
cu12.6.0 arnoldg 12/23 14:52 R 50 batch cu10
cu12.7.0 arnoldg 12/23 14:52 ST 50 batch cu08
2 job step(s) in queue, 0 waiting, 1 pending, 1 running, 0 held, 0 preempted
The values of the ST (state) field can be:
C Completed
CA Canceled
CK Checkpointing
CP Complete Pending
D Deferred
E Preempted
EP Preempt Pending
H User Hold
HS User Hold and System Hold
I Idle
MP Resume Pending
NR Not Run
NQ Not Queued
P Pending
R Running
RM Removed
RP Remove Pending
S System Hold
ST Starting
SX Submission Error
TX Terminated
V Vacated
VP Vacate Pending
X Rejected
XP Reject Pending
Here's a list of some helpful flags you can use with llq:
- -x
- Provides extended information about a selected job.
- -s
- Provides information on why a selected list of jobs remain in
the Hold, NotQueued, Idle or Deferred state.
Example:
% llq -s 1169
...
==================== EVALUATIONS FOR JOB STEP cu12.ncsa.uiuc.edu.1169.0 ====
The class of this job step is "batch".
Total number of available initiators of this class on all machines in
the cluster: 0
Minimum number of initiators of this class required by job step: 16
The number of available initiators of this class is not sufficient for
this job step.
- -l
- Specifies that a long listing be generated for each job for
which status is requested.
- -w
- Provides AIX WLM CPU and real memory statistics
for running jobs only. This option only accepts a single hostname
and a single step id when used in conjunction with the -h flag.
The following statistics are displayed for every node the job is
running on:
- Current CPU resource consumption as a percentage of the total
resources available
- Total CPU time consumed in milliseconds
- Current real memory consumption as a percentage of the total
resources available
- The highest number of resident memory pages used
Example:
Cu12:% llq -w
=============== Job Step cu12.ncsa.uiuc.edu.1691.0 ===============
cu06.ncsa.uiuc.edu:
Resource: CPU
snapshot: 100
total: 152681218
Resource: Real Memory
snapshot: 10
high water: 5990157
- -u userlist
- Is a blank-delimited list of users. When used with -h option,
only the user's jobs monitored on the machines in the hostlist
are queried. When used alone, only the user's jobs monitored
on the schedd host are queried.
- -h hostlist
- Is a blank-delimited list of machines. If the -s flag is not
specified, all jobs monitored on the machines in the hostlist
are queried. If the -s flag is specified, the list of machines
is considered when determining why a job remains in Idle state.
When used with -u option, the userlist is used to further select
jobs for querying.
- -c classlist
- Is a blank-delimited list of classes. If -s option is specified,
this option is ignored. When used with -h option, only the classes
specified on the machines in the hostlist are queried. When
used alone, only the classes specified on the schedd host
are queried.
6.3 llcancel
Should you want to cancel a job, there's llcancel to the rescue:
The syntax is
llcancel JobID
% llcancel 1713
llcancel: Cancel command has been sent to the central manager.
You can cancel all of your jobs quickly with the -u flag:
-u userlist
Is a blank-delimited list of users. When used with
the -h option, only the user's jobs monitored on the
machines in the hostlist are canceled. When used alone, only
the user's jobs monitored by the machine issuing the command
are canceled.
6.4 llsummary
llsummary will show information about jobs that have completed.
Cu12:~207% llsummary -l -j cu12.2411 | head -19
================== Job cu12.ncsa.uiuc.edu 2411 ==================
Job Id: cu12.ncsa.uiuc.edu 2411
Job Name: cu12.ncsa.uiuc.edu.2411
Structure Version: 210
Owner: arnoldg
Unix Group: aau
Submitting Host: cu12.ncsa.uiuc.edu
Submitting Userid: 25114
Submitting Groupid: 1023
Number of Steps: 3
------------------ Step cu12.ncsa.uiuc.edu.2411.0 ------------------
Job Step Id: cu12.ncsa.uiuc.edu.2411.0
Step Name: 0
Queue Date: Thu Jan 30 11:03:44 CST 2003
Dependency:
Status: Removed
Dispatch Time: Thu Jan 30 11:03:55 CST 2003
Start Time: Thu Jan 30 11:03:55 CST 2003
Completion Date: Thu Jan 30 11:04:40 CST 2003
6.5 llhist
The llhist command provides resource usage information for currently running
and completed LoadLeveler batch jobs. The syntax is llhist JOBID.
Example:
Cu12:~102% llhist 17420
--------------------------------------------------------
IBM pSeries 690 Batch Job Summary
--------------------------------------------------------
Job Id : 17420
Job Name : mumps1M_symm_32
User : skoric
--------------------------------------------------------
STEP 0
--------------------------------------------------------
Job Status : Completed ...
Submitted : Tue Apr 1 21:27:52 CST 2003
Started : Tue Apr 1 21:49:38 CST 2003
Finished : Tue Apr 1 22:05:17 CST 2003
Host : cu01.ncsa.uiuc.edu
Project : acr
Class : dedicated
Usage:
Cpu Time : 07:38:24 [hh:mm:ss]
Run Time : 00:15:39
Peak Task Memory : 3.73 GB
Service Units : 8.35
Limits:
Wall Clock Limit : 00:45:00 [hh:mm:ss]
Number of CPU's : 32
Memory : 125.00 GB
See the man page for details of the output.
6.6 llhosts
The llhosts command provides machine utilization information.
The syntax is llhosts .
Example:
Cu12:% llhosts
host jobs Gb_free Startd 00 25 50 75 100 % load
_________________________________________________________________________
cu01 0 226 Idle |
cu02 3 217 Running |------------------------------
cu03 8 182 Running |-------------------------------*
cu04 3 47 Running |-------------------------------***
cu05 13 49 Running |-----------------------
cu06 0 219 Idle |
cu07 3 46 Running |----------------
cu08 0 6 Idle |
cu09 7 34 Running |-----------------------
cu10 5 53 Running |--------------------------
cu11 1 35 Running |
cu12 0 18 Idle |----------
See the man page for details of the output.
7. Sample LoadLeveler Scripts
See the samples in /usr/local/doc/ll/ . You can copy one that's similar to what you want to do and customize it for your requirements.
Some advanced LoadLeveler scripts are also being developed.
The sample batch scripts use UniTree
for permanent storage of files. They assume that the executable and any
input files are already on UniTree. If that's not true in your case
or if you have problems with UniTree within batch jobs, see this FAQ.
8. Managing Batch Scripts
There is a program named
find_batch_scripts that will help you locate batch scripts
on the system [should you forget their location].
9. Automated Saving of Files from Batch Jobs
The saveafterjob utility on the NCSA IBM p690 is available for
automated, guaranteed saving of output files from batch jobs to the mass
storage system.
For details on its use, see the saveafterjob
page and the sample LoadLeveler scripts.
10. Notes
The standard output and standard error [from "output =" and "error =" in your
LoadLeveler scripts] will be placed in the directory you were in at the time
you submitted the script with llsubmit. If you're working in a shared filesystem [ nfs mount, or gpfs filesystems ] you can watch the output spool in real time by using the tail command:
% tail -f cu12.33.0.out
0:
0:Running on 16 PEs
0:
0:sampling from 2^0 to 2^24 bytes
0:
0:Effective Bandwidth: 3760.54 [MB/sec]
0:
0:
0:***********************************************************
0:
If you submit a job from a local filesystem that is not shared [local scratch, /tmp, or /var/tmp], your standard output and error files will be in the same directory on the execution machine for that job step. If that's the case, take care to copy them t
o mass storage or a shared filesystem at the end of the job step --otherwise they'll be stranded on the execution machine and you will not be
able to see them.
11. References
Submitting and managing LoadLeveler jobs
"Diagnosis and Messages Guide"
IBM Parallel Environment for AIX (PE)
Parallel Environment diagnostic and error messages
, and the POE
Hitchhiker's guide.