NCSA Home
Contact Us Intranet

SGI Altix UV Running Jobs FAQ

User Information Home
Data
Security
Allocations
Consulting
Training

NCSA's Help Desk is available 24 hours a day, seven days a week, 365 days a year:
help.ncsa.illinois.edu
217-244-0710
help@ncsa.illinois.edu

 

I have a job in the queue and it's not running yet--why?

    At NCSA, we run the same job scheduling package with all of our batch systems. The scheduling of jobs is based on first-in, first-out [FIFO] but with some important modifications to make sure that all jobs and users get a fair chance to run. Some of the various scenarios you may observe are explained here.

    Q: Why are many machines or resources currently idle?

    A: In the time period before a large cpu-count job starts, it always looks this way. A large cpu-count job will probably be starting very soon and the idle resources will be used once again.

    Q: User X just submitted a job like mine after my job and their job has started, why?

    A: Our scheduler "remembers" how many jobs have been run by each user for a while. If User X has run few jobs recently, and you've run more, then User X will get more priority so that they can catch up and everyone gets a fair-share of the system.

    The scheduler will also backfill smaller and shorter jobs to utilize any idle resources.

    Q: How can I get a test, debug, or any of my jobs to start sooner--I really need to try something out now!

    A: The debug queue is available for small cpu jobs. In addition, because the scheduler has to let resources sit idle before starting any large [many cpu] job, you can try to take advantage of that situation by submitting a job that requests short walltime.

    Q: I don't understand why my job hasn't started. How can anyone get work done when the queues are so busy?

    A: See the output of the qs command. It shows the queued jobs along with the time they've been waiting in the queue. You can compare your jobs to jobs that have waited a similar time and determine that other users' experience is in common with yours.

I have an account and can login, but why do I get the message saying I have no accounts that can be used for batch jobs on this system when I try to submit jobs?

    This is probably because your account is expired or overused. You can use the batch_accts command to check your accounts.

    You can also check details of your allocation and usage on the TeraGrid Portal or using the TeraGrid tgusage command.

I get the error message "qsub: illegal -N value " when I submit a PBS job. What does this mean?

    This error message occurs when the jobname specified in the -N option is > 15 characters long. If your batch script does not specify the -N option, the name of the batch script is used as the jobname, so the above limitations will apply for the batch script name in this case.

I get a warning message "Warning: no access to tty (Bad file descriptor). Thus no job control in this shell." in my stdout file when I run a batch job.

    This message is harmless and can safely be ignored. It just means that there is no interactive shell access for this job.

What is NCSA's refund policy for lost jobs?

Why does the qsub command give illegal resource value error?

    When you submit a batch job and get the following error message:

    	qsub: Illegal attribute or resource value Resource_List.mem
    
    That's because the memory limit needs to be specified as an integer.

My OpenMP program, which runs fine on a lower number of threads, fails to do so when I increase the number of threads.

    Increasing the number of threads in your OpenMP program may result in an increase in the stack size when you run your executable. The default stacksize for OpenMP applications, denoted by the environment variable KMP_STACKSIZE, is 4M. You may try setting the KMP_STACKSIZE environment variable to 8M or 16M and see if it solves your problem.

I get the error MPI: mmap failed (memmap_base) for 170761512744 pages (699439156199424 bytes) on 324 ranks when running my MPI program at very large scale.

    Please add the following environment variable to your batch script before the mpirun command:

    setenv MPI_MEMMAP_OFF 1    # for csh/tcsh scripts
    
    export MPI_MEMMAP_OFF=1    # for bash scripts
    

Since ssh access to a compute host is not available, how can I run an application that requires an X display in an interactive batch job?

    You can use VNC (Virtual Network Computing) to accomplish this - see Using VNC on Ember for details.

I need to submit a MPI-OpenMP hybrid job on Ember - is there a sample batch script I can use?

    You may consider using the online sample mpi batch script (/usr/local/doc/batch_scripts/mpi.pbs) with the following changes:
    • list the total cpus (MPI_processes * OpenMP_threads) in the #PBS -l ncpus directive
    • set the environment variable OMP_NUM_THREADS to the number of OpenMP threads
    • explicitly set the number of MPI processes in the -np argument to mpirun