NAME
     checkpoint - Grid Engine  checkpointing  environment  confi-
     guration file format

DESCRIPTION
     Checkpointing is a facility to save the complete  status  of
     an  executing program or job and to restore and restart from
     this so called checkpoint at a later point of  time  if  the
     original  program  or job was halted, e.g.  through a system
     crash.

     Grid Engine provides various levels of checkpointing support
     (see  sge_ckpt(1)).  The checkpointing environment described
     here is a means to configure the different types  of  check-
     pointing  in  use  for  your  Grid  Engine  cluster or parts
     thereof. For that purpose  you  can  define  the  operations
     which have to be executed in initiating a checkpoint genera-
     tion, a migration of a checkpoint to another host or a  res-
     tart  of  a  checkpointed application as well as the list of
     queues which are eligible for a checkpointing method.

     Supporting different operating systems may easily force Grid
     Engine  to  introduce  operating system dependencies for the
     configuration of the checkpointing  configuration  file  and
     updates  of the supported operating system versions may lead
     to frequently changing implementation details. Please  refer
     to the directory <sge_root>/ckpt for more information.

     Please use the -ackpt, -dckpt, -mckpt or -sckpt  options  to
     the  qconf(1)  command  to manipulate checkpointing environ-
     ments from the command-line or use the corresponding qmon(1)
     dialogue for X-Windows based interactive configuration.

FORMAT
     The format of a checkpoint file is defined as follows:

  ckpt_name
     The name of the checkpointing environment. To be used in the
     qsub(1)  -ckpt  switch or for the qconf(1) options mentioned
     above.

  interface
     The type of checkpointing to be used. Currently, the follow-
     ing types are valid:

     hibernator
          The Hibernator kernel  level  checkpointing  is  inter-
          faced.

     cpr  The SGI kernel level checkpointing is used.

     cray-ckpt
          The Cray kernel level checkpointing is assumed.

     transparent
          Grid Engine assumes that the jobs submitted with refer-
          ence  to this checkpointing interface use a checkpoint-
          ing library such as provided by the public domain pack-
          age Condor.

     userdefined
          Grid Engine assumes that the jobs submitted with refer-
          ence  to  this  checkpointing  interface  perform their
          private checkpointing method.

     application-level
          Uses all of the interface commands  configured  in  the
          checkpointing  object  like  in  the case of one of the
          kernel level checkpointing interfaces (cpr,  cray-ckpt,
          etc.) except for the restart_command (see below), which
          is not used (even if it  is  configured)  but  the  job
          script is invoked in case of a restart instead.

  ckpt_command
     A command-line type command string to be  executed  by  Grid
     Engine in order to initiate a checkpoint.

  migr_command
     A command-line type command string to be  executed  by  Grid
     Engine  during  a  migration of a checkpointing job from one
     host to another.

  restart_command
     A command-line type command string to be  executed  by  Grid
     Engine  when  restarting  a previously checkpointed applica-
     tion.

  clean_command
     A command-line type command string to be  executed  by  Grid
     Engine  in order to cleanup after a checkpointed application
     has finished.

  ckpt_dir
     A file system location to which checkpoints  of  potentially
     considerable size should be stored.

  ckpt_signal
     A Unix signal to be sent to a job by Grid Engine to initiate
     a checkpoint generation. The value for this field can either
     be a symbolic name from the list produced by the  -l  option
     of  the kill(1) command or an integer number which must be a
     valid signal on the systems used for checkpointing.


  when
     The points of time when checkpoints are expected to be  gen-
     erated.  Valid values for this parameter are composed by the
     letters s, m, x and r and any combinations  thereof  without
     any  separating  character  in between. The same letters are
     allowed for the -c option of the qsub(1) command which  will
     overwrite the definitions in the used checkpointing environ-
     ment.  The meaning of the letters is defined as follows:

     s    A job is checkpointed, aborted and if possible migrated
          if  the  corresponding sge_execd(8) is shut down on the
          job's machine.

     m    Checkpoints   are   generated   periodically   at   the
          min_cpu_interval  interval  defined  by  the queue (see
          queue_conf(5)) in which a job executes.

     x    A job is checkpointed, aborted and if possible migrated
          as  soon as the job gets suspended (manually as well as
          automatically).

     r    A job will be rescheduled (not checkpointed)  when  the
          host  on which the job currently runs went into unknown
          state and the  time  interval  reschedule_unknown  (see
          sge_conf(5)) defined in the global/local cluster confi-
          guration will be exceeded.


RESTRICTIONS
     Note, that the functionality of any checkpointing, migration
     or  restart  procedures  provided  by  default with the Grid
     Engine distribution as well as the way how they are  invoked
     in the ckpt_command, migr_command or restart_command parame-
     ters of any default checkpointing environments should not be
     changed  or  otherwise  the  functionality  remains the full
     responsibility of the administrator configuring  the  check-
     pointing  environment.   Grid  Engine will just invoke these
     procedures and evaluate their exit status. If the procedures
     do  not perform their tasks properly or are not invoked in a
     proper fashion, the checkpointing mechanism may behave unex-
     pectedly, Grid Engine has no means to detect this.

SEE ALSO
     sge_intro(1),  sge_ckpt(1),  qconf(1),   qmod(1),   qsub(1),
     sge_execd(8).

COPYRIGHT
     See sge_intro(1) for a full statement of rights and  permis-
     sions.




Man(1) output converted with man2html