NCSA Home
Contact Us | Intranet | Search

NCSA TeraGrid IA-64 Linux Cluster: Frequently Asked Questions

 

Environment

Files/File Systems

Mass Storage System (UniTree)

Batch System

Error Messages


Environment

    When I log in, I see error messages similar to the following and it can't find any commands:

         SoftEnv 1.6.2: updating your software environment, one moment...
         SoftEnv 1.6.2 system error: could not create the file:
         /u/ac/jdoe/.soft.cache.csh
      
      Usually this is a sign that the home quota is exceeded. If you are seeing "Command not found", you'll need to temporarily use full paths to commands. rm and mv are both in /bin.

      To check your disk usage:
      quota -s

    My environment seems to be munged ($PATH, etc.)

      As the first step, try logging off and logging back on. If this doesn'thelp, try removing the SoftEnv cache files .soft.cache.csh and .soft.cache.sh files in your home directory, and logging off and logging back on. Removing the .soft file in your home directory will cause a fresh default .soft file to be created the next time you log in, which may help as well.

Files/File Systems

  • I accidentally deleted files in my home directory. Is there any way to get them back?

    As noted on the File Systems web page, NCSA does daily backups on home directories, so the files can be restored from backups. Send email to help@teragrid.org with the full pathname of the files and the dates when the files were intact. Be sure to mention that this is for the NCSA TeraGrid IA-64 Linux cluster.

Mass Storage System (UniTree)

  • Are the mssftp and msscmd utilities available on the NCSA TeraGrid IA-64 Linux Cluster as on other NCSA HPC systems?

    Yes, mssftp and msscmd are available as of Feb 14, 2005. Also note that the uberftp utility offers the same functionality with the "-a MSS" authentication. See the Permanent Storage section of the documentation for details.

Batch System

  • I have a job in the queue and it's not running yet--why?

      At NCSA, we run the same job scheduling package with all of our batch systems. The scheduling of jobs is based on first-in, first-out [FIFO] but with some important modifications to make sure that all jobs and users get a fair chance to run. Some of the various scenarios you may observe are explained here.

      Q: Why are many machines or resources currently idle?

      A: In the time period before a large cpu-count job starts, it always looks this way. A large cpu-count job will probably be starting very soon and the idle resources will be used once again.

      Q: User X just submitted a job like mine after my job and their job has started, why?

      A: Our scheduler "remembers" how many jobs have been run by each user for a while. If User X has run few jobs recently, and you've run more, then User X will get more priority so that they can catch up and everyone gets a fair-share of the system.

      The scheduler will also backfill smaller and shorter jobs to utilize any idle resources.

      Q: How can I get a test, debug, or any of my jobs to start sooner--I really need to try something out now!

      A: The debug queue is available for small cpu jobs. In addition, because the scheduler has to let resources sit idle before starting any large [many cpu] job, you can try to take advantage of that situation by submitting a job that requests short walltime.

      Q: I don't understand why my job hasn't started. How can anyone get work done when the queues are so busy?

      A: See the output of the qs command. It shows the queued jobs along with the time they've been waiting in the queue. You can compare your jobs to jobs that have waited a similar time and determine that other users' experience is in common with yours.

    I have an account and can login, but why do I get the message saying I have no accounts that can be used for batch jobs on this system when I try to submit jobs?

      That's probably because your account is expired or overused. Or you may only have a 10 SUs courtesy account which cannot be used to submit batch jobs. You can use the 'tgusage' command to check your account expiring date and SUs that were allocated and used.

    I get the error message "qsub: illegal -N value " when I submit a PBS job. What does this mean?

      This error message occurs when the jobname specified in the -N option is > 15 characters long OR if the first character in the name is non-alphabetic. From the qsub man page:

      -N name  Declares  a name for the job.  The name specified may be up to
                and including 15 characters in length.   It  must  consist  of
                printable, non white space characters with the first character
                alphabetic.
      
      If your batch script does not specify the -N option, the name of the batch script is used as the jobname, so the above limitations will apply for the batch script name in this case.

    I get a warning message "Warning: no access to tty (Bad file descriptor). Thus no job control in this shell." in my stdout file when I run a batch job.

      This message is harmless and can safely be ignored. It just means that there is no interactive shell access for this job.

    How do I enable X11 forwarding for an interactive batch session?

      Use the "-X" option/flag on your qsub line.

      Ex.
           qsub -I -V -X -l walltime=00:05:00,nodes=1:ppn=2

  • My job is done. Where is my output?

    The standard output and error file(s) for the job should be in the directory from where the qsub command was issued. If they are not there check the directory: ~/.pbs_spool. ~/.pbs_spool is the temporary location for job output files. In order for Torque to move those temporary files to the final destination, the permissions on your home directory and $HOME/.pbs_spool directory must have execute permissions for other users:

    chmod o+x $HOME $HOME/.pbs_spool

  • I get error messages about problems with copying my stdout and stderr files.

    For error messages similar to:

    Unable to copy file /home/ac/jdoe/.pbs_spool/123456.tg-m.OU to
    jdoe@tg-login3.ncsa.teragrid.org:/home/ac/jdoe/testjob.o123456
    >>> error from copy
    /home/ac/jdoe/.pbs_spool/123456.tg-m.OU: No such file or directory
    >>> end error output
    
    Your home directory and $HOME/.pbs_spool directory need to have world execute permission for the files to be able to transfer to the files specified in your PBS -o and -e directives in the directory from which you submitted the job.
    chmod o+x $HOME $HOME/.pbs_spool
    However, the files do get transferred as is, so you should find the files as .tg-m.ER and .tg-m.OU in your home directory.

  • I have a job in the queue and it's not running yet--why?

    There can be a variety of issues involved. Here's an FAQ list for some of them.

Error Messages

    I have put all needed libraries with the -L -l flags in the command line when linking my code, why do I still get "unresolved reference" errors for some subroutines from those libraries?

      We have observed that with static libraries (.a), it may be due to where in the link command the -L -l flags are placed, and in the case of multiple libraries, the order of the libraries. Please put the L -l flags at the end of the link command, and if you are linking to multiple libraries, put the most dependent library first and so on, then try again.

  • The output for the mpirun command from inside a batch job just has a "Permission Denied" line for each process.
    OR
    The output for the mpirun command from inside a batch job contains:
       ssh: connect to host tg-master port 22: Connection refused
    

    Make sure that you are specifying the machine file:

       mpirun -machinefile $PBS_NODEFILE  ...
    
    See the sample batch script for usage information.

  • My mpirun output contains the following X11 error messages. What do these mean?
       Warning: No xauth data; using fake authentication data for X11 forwarding.
       /usr/X11R6/bin/xauth:  error in locking authority file /home/ac/jdoe/.Xauthority
    

    These error messages are common and do not normally affect user's jobs (unless the program uses the X Window system for a graphical user interface on the compute nodes). By default, the ssh client on the NCSA TeraGrid is configured to do X11 forwarding (see ssh_config man page for details). mpirun uses ssh to start the MPI program. When an MPI program starts, all the nodes try to read the .Xauthority file at the same time and some nodes are "locked out". This results in the above errors.

    If you do not need X11 forwarding (i.e. you never use ssh from the login node to a compute node or any other system to run a program that uses X11 for the graphical user interface), create a file called~/.ssh/config and enter the line:

       ForwardX11 no
    

  • My mpirun output contains an error message similar to the following:
       FATAL ERROR 18 on MPI node 103 (tg-c083): MPI node 99 ((null)) is 
       unreachable via Myrinet: check the host, cables or mapping
       Small/Ctrl message completion error!
    

    There is a system problem with a particular node. Please report it to help@teragrid.org. Include the following:

    • On which TeraGrid system this problem occurred
    • Batch jobid
    • the job's standard output and standard error file(s) as plain text in the email (instead of attachments)

  • My mpirun output contains the following error message. Is this a system problem?
       Error: Unable to open a GM port !
    

    Yes, this is a system problem. Please report it to help@teragrid.org. Include the following:

    • On which TeraGrid system this problem occurred
    • Batch jobid
    • the job's standard output and standard error file(s) as plain text in the email (instead of attachments)

  • I get the error message "tgusage: Command not found." when I try to use the tgusage command.

      This is probably because you do not have the current system environment.

      If you haven't customized your $HOME/.soft file, delete this file, log off and log back on. A new .soft file with the current system environment will be created for you.

      If you have customized your .soft file, replace the softenv macros in the file (such as @default, @teragrid) with:

         @teragrid-basic
         @globus-4.0
         @teragrid-dev
      
      Then issue the command resoft or log off and log back on for the changes to take effect.

    What does it mean when the command uberftp or mssftp fails with a "Invalid CRL: The available CRL has expired" error message?

      This error is usually caused by the presences of old(expired) *.r0 files(dot R zero) in "$HOME/.globus/certificates/" directory. The full error message would look something like...
         Failed to init security context
         GSS Major Status: Authentication Failed
      
         GSS Minor Status Error Chain:
         globus_gsi_gssapi: SSLv3 handshake problems
         globus_gsi_callback_module: Could not verify credential
         globus_gsi_callback_module: Could not verify credential
         globus_gsi_callback_module: Invalid CRL: The available CRL has expired
      

      To fix the issue, move or delete your "certificates" directory in "$HOME/.globus/" and then try your mssftp or uberftp command again. If you are running these commands on NCSA production resources and if the error message remains after you have moved/deleted your "certificates" directory, please notify the Consulting Group.