NCSA Home
Contact Us | Intranet | Search

Grid Universal Remote (GUR) Co-scheduler

Co-scheduling is the process of making and synchronizing reservations on computation resources at multiple sites. The GUR tool is a python script which uses the ssh and scp commands to help users make reservations, compile programs, and co-schedule jobs. GUR is installed on the IA-64 clusters at NCSA (Mercury) and at SDSC. GUR is invoked from the command line from either machine. Co-scheduling is expected to be available on resources at other sites in the near future. A Web interface is in development. Access to this software is provided by invoking the +gur softenv key.

Note: mpich-g2 is required for a co-scheduled mpi job run.  Code should be compiled under the same mpich-g2 environment at both sites specified in the co-schedule job file.  To set your environment so that the intel built mpich-g2 is available, use the softenv key "+mpich-g2-intel".

   Ex.    soft add +mpich-g2-intel

Paths and Policies

Policies at NCSA for co-scheduling are the same as for other reservations; reservations policies are documented here.

Site Path Policies SoftEnv Command
NCSA /usr/local/GUR/gur.py NCSA reservation policies soft add +gur
SDSC /usr/local/apps/gur/gur.py Reservations policies in SDSC User Portal

GUR Workflow

  1. User runs grid-proxy-init to establish grid credential
    grid-proxy-init
  2. User constructs an appropriate jobfile (See example jobfiles)
    vi jobfile
    or
    gur.py --dumpjobfile --output=metajob.script
  3. User runs gur, with jobfile as the input
    gur.py --reserve --jobfile=jobfile
    GUR returns path to file containing reservation information
    GUR: metajob submitted:
    /<working directory path>/<username>
    /info/gur/test/gurdata/metajob.1190763126.7100041
  4. GUR makes reservations at remote clusters. GUR uses gsissh to invoke commands on remote machines.
  5. User runs jobs on remote clusters (See example rsl files)
    mpirun -globusrsl job.rsl
  6. User cancels reservation, with metajob script as the input
    gur.py --cancel
    --metajobfile=/rmount/users01/sdsc/<username>
    /info/gur/test/gurdata/
    metajob.1190763126.7100041

Example Jobfiles, by scenario

Lines followed with " \" should be typed on one line.

Scenario 1: 128 nodes over two systems, without regard to distribution to each system
[metajob]

# total nodes
total_nodes = 128

machine_preference = tg-login1.sdsc.teragrid.org:2119 \
#slash#jobmanager-pbs_gcc_resid \
#ia64-compute,grid-hg.ncsa.teragrid.org \
#slash#jobmanager-pbs#fastcpu

# allow re-ordering of the machine_preference list?
machine_preference_reorder = yes

# duration
duration = 3600

earliest_start = 11:30_06/07/2007

latest_end = 17:00_06/15/2007

# use single or multiple clusters ('single' or 'multiple')
usage_pattern = multiple

# Machine-specific info
machines_dict_string = {
  'tg-login1.sdsc.teragrid.org:2119#slash \
   #jobmanager-pbs_gcc_resid#ia64-compute'     : {
                     'username_string' : '',
                     'account_string' : 'TG-XYZ999999X',
                     'email_notify' : 'johndoe@sdsc.edu',
                     'min_int' : 1,
                     'max_int' : 128
                   },
  'grid-hg.ncsa.teragrid.org#slash \
     #jobmanager-pbs#fastcpu'    : {
                     'username_string' : '',
                     'account_string' : 'TG-XYZ999999X;',
                     'email_notify' : 'johndoe@sdsc.edu;',
                     'min_int' : 1,
                     'max_int' : 128
                   },
  } 
Scenario 2: 256 nodes over two systems, 128 nodes each
[metajob]

# total nodes
total_nodes = 256

machine_preference = tg-login1.sdsc.teragrid.org:2119 \
#slash#jobmanager-pbs_gcc_resid \
#ia64-compute,grid-hg.ncsa.teragrid.org \
#slash#jobmanager-pbs#fastcpu

# allow re-ordering of the machine_preference list?
machine_preference_reorder = yes

# duration
duration = 3600

earliest_start = 11:30_06/07/2007

latest_end = 17:00_06/15/2007

# use single or multiple clusters ('single' or 'multiple')
usage_pattern = multiple

# Machine-specific info
machines_dict_string = {
  'tg-login1.sdsc.teragrid.org:2119 \
  #slash#jobmanager-pbs_gcc_resid#ia64-compute'   : {
                     'username_string' : '',
                     'account_string' : 'TG-XYZ999999X',
                     'email_notify' : 'johndoe@sdsc.edu',
                     'min_int' : 128,
                     'max_int' : 128
                   },
  'grid-hg.ncsa.teragrid.org#slash \
     #jobmanager-pbs#fastcpu'    : {
                     'username_string' : '',
                     'account_string' : 'TG-XYZ999999X',
                     'email_notify' : 'johndoe@sdsc.edu',
                     'min_int' : 128,
                     'max_int' : 128
                   },
  }
      
Scenario 3: 64 nodes over two systems, all on one cluster is okay
[metajob]

# total nodes
total_nodes = 64

machine_preference = tg-login1.sdsc.teragrid.org:2119 \
#slash#jobmanager-pbs_gcc_resid \
#ia64-compute,grid-hg.ncsa.teragrid.org \
#slash#jobmanager-pbs#fastcpu

# allow re-ordering of the machine_preference list?
machine_preference_reorder = yes

# duration
duration = 3600

earliest_start = 11:30_06/07/2007

latest_end = 17:00_06/15/2007

# use single or multiple clusters ('single' or 'multiple')
usage_pattern = multiple

# Machine-specific info
machines_dict_string = {
  'tg-login1.sdsc.teragrid.org:2119 \
  #slash#jobmanager-pbs_gcc_resid#ia64-compute'   : {
                     'username_string' : '',
                     'account_string' : 'TG-XYZ999999X',
                     'email_notify' : 'johndoe@sdsc.edu',
                     'min_int' : 0,
                     'max_int' : 64
                   },
  'grid-hg.ncsa.teragrid.org#slash \
     #jobmanager-pbs#fastcpu'  : {
                     'username_string' : '',
                     'account_string' : 'TG-XYZ999999X',
                     'email_notify' : 'johndoe@sdsc.edu',
                     'min_int' : 0,
                     'max_int' : 64
                   },
  }
      

Example job.rsl file

+
(&
   (resourceManagerContact="grid-hg.ncsa.teragrid.org \
     /jobmanager-pbs")
   (count=2)
   (hostcount=1)
   (maxTime=10)
   (jobtype=mpi)
   (label="subjob 0")
   (environment=(TESTENV1 1)
                (GLOBUS_DUROC_SUBJOB_INDEX 0)
                (TESTENV2 2))
   (arguments= -t 10 -n 2 -l 10 -i 0.03125)
   (directory=/home/ncsa/kenneth/testprog)
   (executable=/home/ncsa/kenneth/testprog/ring26g2)
   (reservation_id=johndoe.1289)
)
(&
   (resourceManagerContact="tg-login1.sdsc.teragrid.org:2119 \
     /jobmanager-pbs_gcc_resid")
   (count=2)
   (hostcount=1)
   (maxTime=10)
   (jobtype=mpi)
   (label="subjob 1")
   (environment=(TESTENV1 1)
                (GLOBUS_DUROC_SUBJOB_INDEX 1)
                (TESTENV2 2))
   (arguments= -t 10 -n 2 -l 10 -i 0.03125)
   (directory=/users/kenneth/testprog)
   (executable=/users/kenneth/testprog/ring26g2)
   (reservation_id=1191274794)
)