NAME
checkpoint - Grid Engine checkpointing environment confi-
guration file format
DESCRIPTION
Checkpointing is a facility to save the complete status of
an executing program or job and to restore and restart from
this so called checkpoint at a later point of time if the
original program or job was halted, e.g. through a system
crash.
Grid Engine provides various levels of checkpointing support
(see sge_ckpt(1)). The checkpointing environment described
here is a means to configure the different types of check-
pointing in use for your Grid Engine cluster or parts
thereof. For that purpose you can define the operations
which have to be executed in initiating a checkpoint genera-
tion, a migration of a checkpoint to another host or a res-
tart of a checkpointed application as well as the list of
queues which are eligible for a checkpointing method.
Supporting different operating systems may easily force Grid
Engine to introduce operating system dependencies for the
configuration of the checkpointing configuration file and
updates of the supported operating system versions may lead
to frequently changing implementation details. Please refer
to the file <sge_root>/doc/checkpointing.asc for more infor-
mation.
Please use the -ackpt, -dckpt, -mckpt or -sckpt options to
the qconf(1) command to manipulate checkpointing environ-
ments from the command-line or use the corresponding qmon(1)
dialogue for X-Windows based interactive configuration.
FORMAT
The format of a checkpoint file is defined as follows:
ckpt_name
The name of the checkpointing environment. To be used in the
qsub(1) -ckpt switch or for the qconf(1) options mentioned
above.
interface
The type of checkpointing to be used. Currently, the follow-
ing types are valid:
hibernator
The Hibernator kernel level checkpointing is inter-
faced.
cpr The SGI kernel level checkpointing is used.
cray-ckpt
The Cray kernel level checkpointing is assumed.
transparent
Grid Engine assumes that the jobs submitted with refer-
ence to this checkpointing interface use a checkpoint-
ing library such as provided by the public domain pack-
age Condor.
userdefined
Grid Engine assumes that the jobs submitted with refer-
ence to this checkpointing interface perform their
private checkpointing method.
application-level
Uses all of the interface commands configured in the
checkpointing object like in the case of one of the
kernel level checkpointing interfaces (cpr, cray-ckpt,
etc.) except for the restart_command (see below), which
is not used (even if it is configured) but the job
script is invoked in case of a restart instead.
queue_list
A comma separated list of queues to which checkpointing jobs
belonging to this checkpoint environment have access to.
Alternatively the keyword "all" can be used to give access
to all queues across the cluster
ckpt_command
A command-line type command string to be executed by Grid
Engine in order to initiate a checkpoint.
migr_command
A command-line type command string to be executed by Grid
Engine during a migration of a checkpointing job from one
host to another.
restart_command
A command-line type command string to be executed by Grid
Engine when restarting a previously checkpointed applica-
tion.
clean_command
A command-line type command string to be executed by Grid
Engine in order to cleanup after a checkpointed application
has finished.
ckpt_dir
A file system location to which checkpoints of potentially
considerable size should be stored.
queue_list
Contains a comma or blank separated list of queue names
which are eligible for a job if the checkpointing environ-
ment was specified at the submission of the job.
ckpt_signal
A Unix signal to be sent to a job by Grid Engine to initiate
a checkpoint generation. The value for this field can either
be a symbolic name from the list produced by the -l option
of the kill(1) command or an integer number which must be a
valid signal on the systems used for checkpointing.
when
The points of time when checkpoints are expected to be gen-
erated. Valid values for this parameter are composed by the
letters s, m, x and r and any combinations thereof without
any separating character in between. The same letters are
allowed for the -c option of the qsub(1) command which will
overwrite the definitions in the used checkpointing environ-
ment. The meaning of the letters is defined as follows:
s A job is checkpointed, aborted and if possible migrated
if the corresponding sge_execd(8) is shut down on the
job's machine.
m Checkpoints are generated periodically at the
min_cpu_interval interval defined by the queue (see
queue_conf(5)) in which a job executes.
x A job is checkpointed, aborted and if possible migrated
as soon as the job gets suspended (manually as well as
automatically).
r A job will be rescheduled (not checkpointed) when the
host on which the job currently runs went into unknown
state and the time interval reschedule_unknown (see
sge_conf(5)) defined in the global/local cluster confi-
guration will be exceeded.
RESTRICTIONS
Note, that the functionality of any checkpointing, migration
or restart procedures provided by default with the Grid
Engine distribution as well as the way how they are invoked
in the ckpt_command, migr_command or restart_command parame-
ters of any default checkpointing environments should not be
changed or otherwise the functionality remains the full
responsibility of the administrator configuring the check-
pointing environment. Grid Engine will just invoke these
procedures and evaluate their exit status. If the procedures
do not perform their tasks properly or are not invoked in a
proper fashion, the checkpointing mechanism may behave unex-
pectedly, Grid Engine has no means to detect this.
SEE ALSO
sge_intro(1), sge_ckpt(1), qconf(1), qmod(1), qsub(1),
sge_execd(8).
COPYRIGHT
See sge_intro(1) for a full statement of rights and permis-
sions.
Man(1) output converted with
man2html