Purpose
Checkpoints a single job step.
| Note: | Before you consider using the Checkpoint/Restart function refer to the LoadL.README file in /usr/lpp/LoadL/READMES for information on availability and support of this function. |
Syntax
llckpt { -? | -H | -v | [-k | -u] [-r] [-q] <jobstep>}
Flags
Description
The llckpt command should be used to save the state of the job in the event it does not complete. When a job is checkpointed it can later be restarted from the checkpoint file rather than the beginning of the job. To restart a job from a checkpoint file, the original job command file should be used with the value of the restart_from_ckpt keyword set to yes. The name and location of the checkpoint file should be specified by the ckpt_dir and ckpt_file keywords.
Examples
This example checkpoints the job step 1 that is part of job 12 which was scheduled by the machine named iron. Upon successful completion of checkpoint, the job step will return to the RUNNING state.
llckpt iron.12.1
This example checkpoints the job step 3 that is part of job 14 which was scheduled by the machine named bronze. Upon successful completion of checkpoint the job step will be put on user hold:
llckpt -u bronze.14.3
Results
When the -r option is not used, the llckpt command will wait for the checkpoint to complete. Immediately upon executing the command llckpt iron.12.1 the following message is displayed:
llckpt: The llckpt command will wait for the results of the checkpoint on job step iron.12.1 before returning
Once the checkpoint has successfully completed, the following message is displayed:
llckpt: Checkpoint of job step iron.12.1 completed successfully
If there was a problem taking the checkpoint, the second message would have this form:
llckpt: Checkpoint FAILED for job step iron.12.1 with the following error: primary error code = <numeric error number>, secondary error code = <secondary numeric error/extended numeric error>, error msg len = <length of message>, error msg = <text describing the error>
Where: primary error code is defined by /usr/include/sys/errno.h and secondary error code is defined by /usr/include/sys/chkerror.h.
The -r option is used to return without waiting for the result of a checkpoint. The following output is displayed for the command llckpt -r bronze.14.3:
llckpt: The llckpt command will not wait for the checkpoint of job step bronze.14.3 to complete before returning.
Due to delays in communication between LoadLeveler daemons, status information may not be returned at the same time that checkpoint termination is received. This indicates that the checkpoint has completed but the success or failure status is not known. When this happens, the following message is displayed:
llckpt: Checkpoint of job step iron.12.1 completed. No status information is available.