Running Multiple Jobs in the Dedicated Queues on the NCSA Origin2000
- Introduction
- Placement Files
- Batch script
Users whose codes do not scale to the number of processors available
in the dedicated queues (128 or 256) can still use the dedicated queues
to run multiple jobs. Optimal performance is accomplished using the tool
dplace.
dplace executes a specified program,
initializing the processes and memory of that program on the nodes that
you specify.
This eliminates the potential for poor performance resulting from
multiple threads executing on the same processor.
As with a single dedicated job, for best performance, it is recommended that
you do not use all the processors on the system. Leave 1-2 processors free
for use by the operating system and other system processes.
Note that for benchmarking runs, SGI recommends running jobs one at
a time rather than using dplace to run multiple jobs at a time.
The following
two man pages give information that
you will need to know about dplace:
- man 1 dplace documents the command-line arguments
- man 5 dplace documents the syntax of the placement file you
use to specify program placement.
The section
Non-MP Library Programs and Dplace in Chapter 8 of
the SGI manual
Origin2000 and Onyx2 Performance Tuning and Optimization Guide
also has detailed information on dplace.
To place data according to the file placement_file for the executable
a.out that would normally be run by:
a.out
you would use:
dplace -place placement_file a.out
For MPI jobs that would normally be run by:
mpirun -np n a.out
you would use:
setenv MPI_DSM_OFF
mpirun -np n dplace -place placement_file a.out
Each Origin
node contains two CPUs, and each
module contains
four nodes. The 128 processor systems have modules numbered
1 2 3 4 5 6 7 8 11 12 13 14 15 16 17 18
while the 256 processor systems (as of Aug 20, 1999) have
modules numbered
1 2 3 4 5 6 7 8 11 12 13 14 15 16 17 18
21 22 23 24 25 26 27 28 31 32 33 34 35 36 37 38
Note: Prior to Aug 20, 1999, the modules on the 256 processor
systems were numbered:
1 2 3 4 5 6 7 8 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 33 34 35 36 37 38 39 40
The dplace command can be used
to specify physical nodes for each job that will run simultaneously.
For example, to run 4 31-processor jobs in the 128 dedicated queues,
the jobs can be run as follows:
JOB MODULE
1 1 2 3 4
2 5 6 7 8
3 11 12 13 14
4 15 16 17 18
The placement file for the first job would be:
- Native shared memory parallel jobs:
memories 16 in topology physical near \
/hw/module/1/slot/n1/node \
/hw/module/1/slot/n2/node \
/hw/module/1/slot/n3/node \
/hw/module/1/slot/n4/node \
/hw/module/2/slot/n1/node \
/hw/module/2/slot/n2/node \
/hw/module/2/slot/n3/node \
/hw/module/2/slot/n4/node \
/hw/module/3/slot/n1/node \
/hw/module/3/slot/n2/node \
/hw/module/3/slot/n3/node \
/hw/module/3/slot/n4/node \
/hw/module/4/slot/n1/node \
/hw/module/4/slot/n2/node \
/hw/module/4/slot/n3/node \
/hw/module/4/slot/n4/node
threads 31
distribute threads 0:30 across memories
- MPI jobs:
memories 16 in topology physical near \
/hw/module/1/slot/n1/node \
/hw/module/1/slot/n2/node \
/hw/module/1/slot/n3/node \
/hw/module/1/slot/n4/node \
/hw/module/2/slot/n1/node \
/hw/module/2/slot/n2/node \
/hw/module/2/slot/n3/node \
/hw/module/2/slot/n4/node \
/hw/module/3/slot/n1/node \
/hw/module/3/slot/n2/node \
/hw/module/3/slot/n3/node \
/hw/module/3/slot/n4/node \
/hw/module/4/slot/n1/node \
/hw/module/4/slot/n2/node \
/hw/module/4/slot/n3/node \
/hw/module/4/slot/n4/node
threads 32
distribute threads 0:30 across memories
Similar placement files would be used for the other 3 jobs by specifying
the other modules. Note that with MPI jobs, there is an extra mpirun process,
therefore, the number of threads specified is 32 (31 + 1).
Note 1:Use of the above dplace construct in the timeshared queues is
not
recommended because of possible poor performance from interactions with other
user jobs running on the system.
Note 2: Use of the above dplace construct is not appropriate for codes
that use POSIX threads.
In a batch job, the commands would be:
setenv MP_SET_NUMTHREADS 31
dplace -place placement_file1 a.out &
dplace -place placement_file2 a.out &
dplace -place placement_file3 a.out &
dplace -place placement_file4 a.out &
wait
For MPI jobs:
setenv MPI_DSM_OFF
mpirun -np 31 dplace -place placement_file1 a.out &
mpirun -np 31 dplace -place placement_file2 a.out &
mpirun -np 31 dplace -place placement_file3 a.out &
mpirun -np 31 dplace -place placement_file4 a.out &
wait
Note that the
wait command will wait for all the background
subjobs to complete before going on to the next command in the batch script
(for example,
msscmd commands to save output files).
Ideally, all the subjobs will do approximately the same amount of
computation, and will execute in approximately the same amount of time. This
will miminize the wall-clock time (and thus charges for the job)
waiting for one or more of the subjobs to complete.