At NCSA, we run the same job scheduling package with all of our batch systems.
The scheduling of jobs is based on first-in, first-out [FIFO] but with some
important modifications to make sure that all jobs and users get a fair chance
to run. Some of the various scenarios you may observe are explained here.
Q: Why are many machines or resources currently idle?
A: In the time period before a large cpu-count job starts, it always looks
this way. A large cpu-count job will probably be starting very soon and
the idle resources will be used once again.
Q: User X just submitted a job like mine after my job and their job has
started, why?
A: Our scheduler "remembers" how many jobs have been run by each user for
a while. If User X has run few jobs recently, and you've run more, then
User X will get more priority so that they can catch up and everyone gets
a fair-share of the system.
The scheduler will also backfill smaller and shorter jobs to utilize any idle resources.
Q: How can I get a test, debug, or any of my jobs to start sooner--I really
need to try something out now!
A: The debug queue is available for small cpu jobs. In addition, because the scheduler has to let resources sit idle before starting any
large [many cpu] job, you can try to take advantage of that situation by
submitting a job that requests short walltime.
Q: I don't understand why my job hasn't started. How can anyone get work
done when the queues are so busy?
A: See the output of the qs command. It shows the queued jobs along with
the time they've been waiting in the queue. You can compare your jobs to
jobs that have waited a similar time and determine that other users'
experience is in common with yours.
That's probably because your account is expired or overused. Or you may only have a 10 SUs courtesy account which cannot be used to submit batch jobs. You can use the 'tgusage' command to
check your account expiring date and SUs that were allocated and used.
This error message occurs when the jobname specified in the -N
option is > 15 characters long OR if the first character in the name is
non-alphabetic. From the qsub man page:
-N name Declares a name for the job. The name specified may be up to
and including 15 characters in length. It must consist of
printable, non white space characters with the first character
alphabetic.
If your batch script does not specify the
-N option, the
name of the batch script is used as the jobname, so the above limitations
will apply for the batch script name in this case.
This message is harmless and can safely be ignored. It just means that there
is no interactive shell access for this job.
Use the "-X" option/flag on your qsub line.
Ex.
qsub -I -V -X -l walltime=00:05:00,nodes=1:ppn=2
I got an error message: SEEK_SET is #defined but must not be for the C++ binding of MPI. What does that mean?
Users may get such error messages when using +mvapich2-intel and +openmpi-1.2-intel. The problem is that both stdio.h and the MPI C++ interface use SEEK_SET, SEEK_CUR, and SEEK_END. You can try adding
#undef SEEK_SET
#undef SEEK_END
#undef SEEK_CUR
before mpi.h is included, or add the definition
-DMPICH_IGNORE_CXX_SEEK for +mvapich2-intel, or
-DOMPI_IGNORE_CXX_SEEK for +openmpi-1.2-intel
to the command line (this will cause the MPI versions of SEEK_SET etc. to be skipped). (Please also refer to the
MPICH2 FAQ.)
Errors in Nullcomm::Clone with C++.
For +mvapich2-intel users, particularly with older C++ compilers, may see error messages of the form
"error C2555: 'MPI::Nullcomm::Clone' : overriding virtual function differs from
'MPI::Comm::Clone' only by return type or calling convention".
This is caused by the compiler not implementing part of the C++ standard. To work around this problem, add the definition
-DHAVE_NO_VARIABLE_RETURN_TYPE_SUPPORT
to the CXXFLAGS variable or add a
#define HAVE_NO_VARIABLE_RETURN_TYPE_SUPPORT 1
before including mpi.h. (Please also refer to the
MPICH2 FAQ)
My job runs ok in small scale, but fails with segmentation fault when it goes to larger scale. What may be wrong?
As you scale out, MPI is going to need more memory for buffers [depending on your communication pattern]. The application
may also need more memory at scale. You may try some tests with ppn=6 or ppn=4 instead of ppn=8, or specify the
himem resource so that each process can access more memory.
My MPICH-VMI job fails, but does not provide any error messages. How can I diagnose the problem?
-v Verbose level 1 MPIRUN verbose & VMI startup
-vv Verbose level 2 Warning messages
-vvv Verbose level 3 Error messages
-vvvv Verbose level 10: Excess Debug (Everything)
I am using mvapich2-intel; why did I get this error message: mpiexec_abe1134: cannot connect to local mpd (/tmp/mpd2.console_userid)?
To run jobs with mvapich2-intel, please make sure you are using the sample batch script from /usr/local/doc/batch_scripts.
The mpd needs to be set otherwise you might get such kind of error message.
If you also get the error message failed to ping mpd on abe0159(or other node); recvd output={} together with the above error message, please check
the .mpd.conf file in your home directory, it needs to have at least one letter for MPD_SECRETWORD.
I got error messages like: Unable to allocate QP. Unable to get response data from sock_cm! OPENFABRIC Device(fatal):VMI_Buffer_Allocate(): Error registering memory region. when I scale out with VMI. What do those error messages mean?
These error messages are likely to be symptoms of running nodes out of memory. Please keep in mind that MPI itself needs some memory and might need more for larger scale runs. Please try with ppn=6 or ppn=4, or specify the
himem resource.
One possible reason for getting the error is because MPI treats some data as first-class MPI-2 derived datatypes, whereas ROMIO assumes
that they are built on top of MPI-1 derived datatypes. You may want to try VMI2 which is built on top of MPI-1.
My job got an error message while writing output
Error: No space left on device. What does it mean?
The Lustre filesystem is served by several file servers, and if one of them
is full, you may see that message even though the filesystem shows available space. The message comes from the errno.h system header file:
[username@honest2 ~]$ grep ENOSPC /usr/include/asm-x86_64/errno.h
#define ENOSPC 28 /* No space left on device */
You may observe a file server [OST in the output below] close to capacity after
you've seen that error:
[username@honest2 ~]$ lfs df /cfs/scratch
UUID 1K-blocks Used Available Use% Mounted on
scr1_UUID 186017976 11347984 174669992 6% /cfs/scratch[MDT:0]
ost1_UUID 5676562992 2151041964 3525521028 37% /cfs/scratch[OST:0]
ost2_UUID 5676562992 2202281616 3474281376 38% /cfs/scratch[OST:1]
ost3_UUID 5676562992 2895861636 2780701356 51% /cfs/scratch[OST:2]
ost4_UUID 5676562992 2117350856 3559212136 37% /cfs/scratch[OST:3]
ost5_UUID 5676562992 2117267876 3559295116 37% /cfs/scratch[OST:4]
ost6_UUID 5676562992 2131391836 3545171156 37% /cfs/scratch[OST:5]
ost7_UUID 5676562992 2124743176 3551819816 37% /cfs/scratch[OST:6]
ost8_UUID 5676562992 2148473868 3528089124 37% /cfs/scratch[OST:7]
ost9_UUID 5676562992 2232595836 3443967156 39% /cfs/scratch[OST:8]
ost10_UUID 5676562992 2326244020 3350318972 40% /cfs/scratch[OST:9]
ost11_UUID 5676562992 2334170504 3342392488 41% /cfs/scratch[OST:10]
ost12_UUID 5676562992 2229796680 3446766312 39% /cfs/scratch[OST:11]
ost13_UUID 5676562992 2162607760 3513955232 38% /cfs/scratch[OST:12]
ost14_UUID 5676562992 2299783112 3376779880 40% /cfs/scratch[OST:13]
ost15_UUID 5676562992 2163394280 3513168712 38% /cfs/scratch[OST:14]
ost16_UUID 5676562992 2376409332 3300153660 41% /cfs/scratch[OST:15]
filesystem summary: 90825007872 36013414352 54811593520 39% /cfs/scratch
In any event, if you see that error, please report it to consult@ncsa.uiuc.edu.
Why did my MPI code run np (specified in
mpirun) instances of my code on process 0 instead of running on np cores?
We have found that mixing MPI implementations (building your code with one, and running in an
environment of another implementation) can cause this to happen. Please verify
that your environment, build, and batch script commands are consistent.
I have a serial program, and I want to run multiple simulations with it on a set of machines as one batch job. How can I do that?
This job script is an example of how you can run a serial program or command concurrently on a set of machines using ssh. Note, in order to make efficient use of the machines, it's important that the instances of your program on each machine complete in about the same time. Otherwise, machines that finish early will be idle and wasting resources.
#!/bin/sh
#PBS -l nodes=4:ppn=7 # 4 machines , 7 cpus per machine [28 cpus ]
#PBS -l walltime=01:00:00 # Specify job run time limit of 1 hour
#PBS -A abc # Charge job to project abc (recommended for users
# with multiple projects)
#PBS -o testjob.out # Store the standard output and standard error of the
# job in file testjob.out
#PBS -m e # Send mail when job terminates (optional)
#PBS -N testjob # Specify job name (optional)
# This shell script would run a command or set of commands for you on each
# machine in your job.
for host in `cat $PBS_NODEFILE | uniq`
do
( ssh -a -q -x $host "$HOME/bin/a.out.sh $SCR" ) &
# ^^^^^^^^^^^^^^^^^^^^^^ your commands in quotes
done
wait # waits for all the outstanding ssh subshells above to complete
The a.out.sh script could resemble the one below if you wanted to run multiple sets of the same serial program on each machine to use the available cpus.
#!/bin/sh
N=7 # run this many copies per host
SCR=$1
PROGRAM="${HOME}/a.out.serial"
# change to job scratch directory $SCR
cd $SCR
# make a directory for this machine/node and move into it
HOST=`hostname`
mkdir -p $HOST
cd $HOST
for ITERATION in `seq 1 $N`
do
# open a sub shell and setup the serial run there, backgrounding the subshell
(
mkdir $ITERATION
cd $ITERATION
# copy any needed input files to here, untar a bundle here ...
cp $HOME/input.dat .
$PROGRAM > output
) &
# ^ important, do not omit the ampersand
done
# wait until each of subshells from the for loop above completes
wait
Why does my program work with MVAPICH2, but not OpenMPI?
[abe0872:6882] *** An error occurred in MPI_Sendrecv
[abe0872:6882] *** on communicator MPI_COMM_WORLD
[abe0872:6882] *** MPI_ERR_RANK: invalid rank
[abe0872:6882] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
when using OpenMPI, but not in MVAPICH2 if the code is explicitly using
rank -1 in send/recv calls. In MVAPICH2, MPI_PROC_NULL is defined as -1,
but with OpenMPI it's defined as -2, hence the difference in runtime behavior.
The proper thing for the code to do is use MPI_PROC_NULL (instead of -1) when
that is the intention.