NCSA Home
Contact Us Intranet

SGI Altix Running Jobs FAQ

User Information Home
Data
Security
Allocations
Consulting
Training

NCSA's Help Desk is available 24 hours a day, seven days a week, 365 days a year:
help.ncsa.illinois.edu
217-244-0710
help@ncsa.illinois.edu

 

I have a job in the queue and it's not running yet--why?

    At NCSA, we run the same job scheduling package with all of our batch systems. The scheduling of jobs is based on first-in, first-out [FIFO] but with some important modifications to make sure that all jobs and users get a fair chance to run. Some of the various scenarios you may observe are explained here.

    Q: Why are many machines or resources currently idle?

    A: In the time period before a large cpu-count job starts, it always looks this way. A large cpu-count job will probably be starting very soon and the idle resources will be used once again.

    Q: User X just submitted a job like mine after my job and their job has started, why?

    A: Our scheduler "remembers" how many jobs have been run by each user for a while. If User X has run few jobs recently, and you've run more, then User X will get more priority so that they can catch up and everyone gets a fair-share of the system.

    The scheduler will also backfill smaller and shorter jobs to utilize any idle resources.

    Q: How can I get a test, debug, or any of my jobs to start sooner--I really need to try something out now!

    A: The debug queue is available for small cpu jobs. In addition, because the scheduler has to let resources sit idle before starting any large [many cpu] job, you can try to take advantage of that situation by submitting a job that requests short walltime.

    Q: I don't understand why my job hasn't started. How can anyone get work done when the queues are so busy?

    A: See the output of the qs command. It shows the queued jobs along with the time they've been waiting in the queue. You can compare your jobs to jobs that have waited a similar time and determine that other users' experience is in common with yours.

I have an account and can login, but why do I get the message saying I have no accounts that can be used for batch jobs on this system when I try to submit jobs?

    This is probably because your account is expired or overused. You can use the batch_accts command to check your accounts.

    You can also check details of your allocation and usage on the TeraGrid Portal or using the TeraGrid tgusage command.

I get the error message "qsub: illegal -N value " when I submit a PBS job. What does this mean?

    This error message occurs when the jobname specified in the -N option is > 15 characters long OR if the first character in the name is non-alphabetic. From the qsub man page:

    -N name  Declares  a name for the job.  The name specified may be up to
              and including 15 characters in length.   It  must  consist  of
              printable, non white space characters with the first character
              alphabetic.
    
    If your batch script does not specify the -N option, the name of the batch script is used as the jobname, so the above limitations will apply for the batch script name in this case.

I get a warning message "Warning: no access to tty (Bad file descriptor). Thus no job control in this shell." in my stdout file when I run a batch job.

    This message is harmless and can safely be ignored. It just means that there is no interactive shell access for this job.

How do I find out how much memory my program used?

Why do I get emailed about memory overuse?

What is NCSA's refund policy for lost jobs?

Why does the qsub command give unknown resource type error?

    When you submit a batch job and get the following error message:

      error when submitting qsub:
      qsub: Unknown resource type Resource_List..2gb
    
    That's because the memory limit needs to be specified as an integer. For example, if you want 1.2gb, then request 1200mb instead.

Why do I get ssh related error message from my batch job?

    If you are using ssh command in your batch script, you might get error message like this: /usr/local/openssh-mechglue/bin/real/ssh: error while loading shared libraries: libkrb4.so.2: cannot open shared object file: No such file or directory. This is because ssh is not supported on compute nodes.

My batch job can't run with error message: /var/spool/pbs/mom_priv/jobs/jobid.co-master.SC: Command not found. Why?

    Your batch script was probably copied from Windows/MAC system and there may be hidden formatting characters in it. Please run the command dos2unix to convert your file from DOS/MAC format to UNIX format.

What error messages would you get if jobs fail because of a system problem?

    There are some common error messages you would possible get when the system has problem or crashed. Please send email to consult@ncsa.uiuc.edu with the standard error and output files from the batch job that failed.

    (1)  PBS Job Id: jobid.co-master
         Job Name:   jobname
         Aborted by PBS Server
         Job cannot be executed
         See job standard error file
    (2) PBS Job Id: jobid.co-master
         Job Name:   jobname
         Aborted by PBS Server 
         Job cannot be executed
         See Administrator for help
    
         PBS Job Id: jobid.co-master
         Job Name:   jobname
         Post job file processing error; job jobid.co-master on host 
         co-compute1:ssinodes=16:mem=63979520k$
    
         Unlink of stage out file /var/spool/pbs/spool/jobid.co-master.OU failed
    (3) PBS Job Id: jobid.co-master
         Job Name:   jobname
         Post job file processing error; job jobid.co-master on host
         co-compute1:ssinodes=16:mem=63979520kb:ncpus=32
         Unable to copy file jobid.co-master.OU to co-login1.ncsa.uiuc.edu
         >>> error from copy
         /bin/cp: cannot create regular file `standard error output file name`: 
         No such file or directory
         >>> end error output
         Output retained on that host in: 
         /var/spool/pbs/undelivered/jobid.co-master.OU
    
    (4)  Post job file processing error; job jobid.co-master on host co-compute1:ssinodes=1:ncpus=2:mem=3998720kb
         Unable to copy file jobid.co-master.OU to co-login1:/dev/null
         >>> error from copy
         co-login1.ncsa.uiuc.edu: Connection refused
         >>> end error output
         Output retained on that host in: 
         /var/spool/pbs/undelivered/jobid.co-master.OU
    

I have an account and can login, but why do I get the message saying I have no accounts that can be used for batch jobs on this system when I try to submit jobs?

    This is probably because your account is expired or overused. You can use the batch_accts command to check your accounts.

    You can also check details of your allocation and usage on the TeraGrid Portal or using the TeraGrid tgusage command.