IBM Books

IBM LoadLeveler for AIX 5L: Using and Administering

Step 14: Enable checkpointing

This section tells you how to set up checkpointing for jobs. Checkpointing is a method of periodically saving the state of a job step so that if the step does not complete it can be restarted from the saved state. When checkpointing is enabled, checkpoints can be initiated from within the application at major milestones, or by the user, administrator or LoadLeveler external to the application. Both serial and parallel job steps can be checkpointed.

Once a job step has been successfully checkpointed, if that step terminates before completion, the checkpoint file can be used to resume the job step from its saved state rather than from the beginning. When a job step terminates and is removed from the LoadLeveler job queue, it can be restarted from the checkpoint file by submitting a new job and setting the restart_from_ckpt = yes job command file keyword. When a job is terminated and remains on the LoadLeveler job queue, such as when a job step is vacated, the job step will automatically be restarted from the latest valid checkpoint file. A job can be vacated as a result of flushing a node, issuing checkpoint and hold, stopping or recycling LoadLeveler or as the result of a node crash.

Checkpoint keywords

The following is a summary of keywords associated with the checkpoint and restart function.

Administration file keyword summary


Table 21. Administration file keyword summary

Keyword Stanza Default Value Description
ckpt_dir Class Initial working directory The location to be used for checkpoint files
ckpt_time_limit Class Unlimited Amount of time a job step can take to checkpoint
Note:For more information on these keywords see Administration file keywords.

Configuration file keyword summary


Table 22. Configuration file keyword summary

Keyword Default Value Description
CKPT_CLEANUP_INTERVAL -1 How frequently, in seconds, the CKPT_CLEANUP_PROGRAM should be run
CKPT_CLEANUP_PROGRAM No default Identify an administrator provided program to be run at the interval specified by CKPT_CLEANUP_INTERVAL
MAX_CKPT_INTERVAL 7200 (2 hours) Maximum interval, in seconds, LoadLeveler will use for checkpointing running job steps
MIN_CKPT_INTERVAL 900 (15 minutes) Initial (and minimum) interval, in seconds, LoadLeveler will use for checkpointing running job steps
Note:For more information on these keywords see Configuration file keywords and LoadLeveler variables.

Job command file keyword summary

Table 23. Job command file keyword summary

Keyword Default Value Description
checkpoint No Indicates if a job step should be enabled for checkpoint
ckpt_dir The value of the ckpt_dir keyword in the class stanza of the administration file The location to be used for checkpoint files
ckpt_file [jobname.]job_step_id.ckpt The base name to be used for checkpoint file
ckpt_time_limit The value of the ckpt_time_limit keyword in the class stanza of the administration file Amount of time a job step can take to checkpoint
restart_from_ckpt No Indicates if a job step is to be restarted from an existing checkpoint file
Note:For more information on these keywords see Job command file keywords.

Naming checkpoint files and directories

At checkpoint time, a checkpoint file and potentially an error file will be created. For jobs which are enabled for checkpoint, a control file may be generated at the time of job submission. The directory which will contain these files must pre-exist and have sufficient space and permissions for these files to be written. The name and location of these files will be controlled through keywords in the job command file or the LoadLeveler configuration. The file name specified is used as a base name from which the actual checkpoint file name is constructed. To prevent another job step from writing over your checkpoint file, make certain that your checkpoint file name is unique. For serial jobs and the master task (POE) of parallel jobs, the checkpoint file name will be <basename>.Tag. For a parallel job, a checkpoint file is created for each task. The checkpoint file name will be <basename>.Taskid.Tag.

The tag is used to differentiate between a current and previous checkpoint file. A control file may be created in the checkpoint directory. This control file contains information LoadLeveler uses for restarting certain jobs. An error file may also be created in the checkpoint directory. The data in this file is in a machine readable format. The information contained in the error file is available in mail, LoadLeveler logs or is output of the checkpoint command. Both of these files are named with the same base name as the checkpoint file with the extensions .cntl and .err, respectively.

See How to checkpoint a job for more information.

Naming checkpoint files for serial and batch parallel jobs

The following describes the order in which keywords are checked to construct the full path name for a serial or batch checkpoint file:

Note that two or more job steps running at the same time cannot both write to the same checkpoint file, since the file will be corrupted.

Naming checkpointing files for interactive parallel jobs

The following describes the order in which keywords and variables are checked to construct the full path name for the checkpoint file for an interactive parallel job.

Note:The keywords ckpt_dir and ckpt_file are not allowed in the command file for an interactive session. If they are present, they will be ignored and the job will be submitted.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]