| |
|
|
|
|
NCSA TeraGrid IA-64 Linux Cluster: Frequently Asked Questions |
Environment
Files/File Systems
Mass Storage System (UniTree)
Batch System
Error Messages
When I log in, I see error messages similar to the following and
it can't find any commands:
SoftEnv 1.6.2: updating your software environment, one moment...
SoftEnv 1.6.2 system error: could not create the file:
/u/ac/jdoe/.soft.cache.csh
Usually this is a sign that the home quota is exceeded. If you
are seeing "Command not found", you'll need to temporarily use
full paths to commands.
rm and mv are both in /bin.
To check your disk usage:
quota -s
As the first step, try logging off and logging back on. If this doesn'thelp, try removing the SoftEnv cache files .soft.cache.csh and
.soft.cache.sh files in your home directory, and logging off and logging back on. Removing the .soft file in your home directory
will cause a fresh default .soft file to be created the next time you log in, which may help as well.
- I accidentally deleted files in my home directory. Is there any way to get them back?
As noted on the File Systems web page, NCSA does daily backups on home
directories, so the files can be restored from backups. Send email to
help@teragrid.org with the full pathname of the files and the dates
when the files were intact. Be sure to mention that this is for the
NCSA TeraGrid IA-64 Linux cluster.
- Are the mssftp and msscmd utilities available on
the NCSA TeraGrid IA-64 Linux Cluster as on other NCSA HPC systems?
Yes, mssftp and msscmd are available as of Feb 14, 2005. Also note that
the uberftp utility offers the same functionality with the
"-a MSS" authentication.
See the
Permanent Storage section of the documentation for details.
This message is harmless and can safely be ignored. It just means that there
is no interactive shell access for this job.
Use the "-X" option/flag on your qsub line.
Ex.
qsub -I -V -X -l walltime=00:05:00,nodes=1:ppn=2
My job is done. Where is my output?
The standard output and error file(s) for the job should be in the
directory from where the qsub command was issued. If they are not
there check the directory: ~/.pbs_spool. ~/.pbs_spool is the temporary
location for job output files. In order for Torque to move those
temporary files to the final destination, the permissions on your
home directory and $HOME/.pbs_spool directory must have execute
permissions for other users:
chmod o+x $HOME $HOME/.pbs_spool
I get error messages about problems with
copying my stdout and stderr files.
For error messages similar to:
Unable to copy file /home/ac/jdoe/.pbs_spool/123456.tg-m.OU to
jdoe@tg-login3.ncsa.teragrid.org:/home/ac/jdoe/testjob.o123456
>>> error from copy
/home/ac/jdoe/.pbs_spool/123456.tg-m.OU: No such file or directory
>>> end error output
Your home directory and $HOME/.pbs_spool directory need to have world execute
permission for the files to be able to transfer to the files specified in your
PBS -o and -e directives in the directory from which you submitted the job.
chmod o+x $HOME $HOME/.pbs_spool
However, the files do get transferred as is, so you should find the
files as .tg-m.ER and .tg-m.OU in your home directory.
I have a job in the queue and it's not running yet--why?
There can be a variety of issues involved. Here's an FAQ list for some of them.
We have observed that with static libraries (.a), it may be due to where in the link command the -L -l flags are placed, and
in the case of multiple libraries, the order of the libraries.
Please put the L -l flags at the end of the link command, and if you are linking to multiple libraries, put the most dependent library first and so on, then try again.
- The output for the mpirun command from inside a batch job just
has a "Permission Denied" line for each process.
OR
The output for the mpirun command from inside a batch job
contains:
ssh: connect to host tg-master port 22: Connection refused
Make sure that you are specifying the machine file:
mpirun -machinefile $PBS_NODEFILE ...
See the sample batch script for
usage information.
- My mpirun output contains the following X11 error messages.
What do these mean?
Warning: No xauth data; using fake authentication data for X11 forwarding.
/usr/X11R6/bin/xauth: error in locking authority file /home/ac/jdoe/.Xauthority
These error messages are common and do not normally affect
user's jobs (unless the program uses the X Window system
for a graphical user interface on the compute nodes). By default,
the ssh client on the NCSA TeraGrid is configured to do X11
forwarding (see ssh_config man page for details). mpirun uses
ssh to start the MPI program. When an MPI program starts, all
the nodes try to read the .Xauthority file at the same time
and some nodes are "locked out". This results in the above
errors.
If you do not need X11 forwarding (i.e. you never use ssh from the
login node to a compute node or any other system to run a program
that uses X11 for the graphical user interface), create a file
called~/.ssh/config and enter the line:
ForwardX11 no
- My mpirun output contains an error message similar to the following:
FATAL ERROR 18 on MPI node 103 (tg-c083): MPI node 99 ((null)) is
unreachable via Myrinet: check the host, cables or mapping
Small/Ctrl message completion error!
There is a system problem with a particular node. Please report it to help@teragrid.org. Include the following:
- On which TeraGrid system this problem occurred
- Batch jobid
- the job's standard output and standard error file(s)
as plain text in the email (instead of attachments)
- My mpirun output contains the following error message. Is this a system problem?
Error: Unable to open a GM port !
Yes, this is a system problem. Please report it to help@teragrid.org. Include the following:
- On which TeraGrid system this problem occurred
- Batch jobid
- the job's standard output and standard error file(s)
as plain text in the email (instead of attachments)
This is probably because you do not have the current system environment.
If
you haven't customized your $HOME/.soft file, delete this file,
log off and log back on. A new .soft file with the current
system environment will be created for you.
If you have customized your
.soft file, replace the softenv macros in the file (such as
@default, @teragrid)
with:
@teragrid-basic
@globus-4.0
@teragrid-dev
Then issue the command resoft or log off and log back on for the changes
to take effect.
This error is usually caused by the presences of old(expired) *.r0 files(dot R zero)
in "$HOME/.globus/certificates/" directory. The full error message would look
something like...
Failed to init security context
GSS Major Status: Authentication Failed
GSS Minor Status Error Chain:
globus_gsi_gssapi: SSLv3 handshake problems
globus_gsi_callback_module: Could not verify credential
globus_gsi_callback_module: Could not verify credential
globus_gsi_callback_module: Invalid CRL: The available CRL has expired
To fix the issue, move or delete your "certificates" directory in
"$HOME/.globus/" and then try your mssftp or uberftp command again.
If you are running these commands on NCSA production resources and
if the error message remains after you have moved/deleted your "certificates"
directory, please notify the
Consulting Group.
|
|
|
|
|