Halt a job¶
There are several mechanisms to instruct a running SCR application to halt. It is often necessary to interact with the resource manager to halt a job.
scr_halt and the halt file¶
The recommended method to stop an SCR application is to use the scr_halt
command.
The command must be run from within the prefix directory,
or otherwise, the prefix directory of the target job must be specified as an argument.
A number of different halt conditions can be specified.
In most cases, the scr_halt
command communicates these conditions to the running
application via the halt.scr
file,
which is stored in the hidden .scr
directory within the prefix directory.
The SCR library reads the halt file when the application calls SCR_Init
and each time the application completes a checkpoint.
If a halt condition is satisfied, all tasks in the application call exit
.
One can disable this behavior by setting the SCR_HALT_ENABLED
parameter to 0.
In this case, the application can determine when to exit by calling SCR_Should_exit
.
Halt after next checkpoint¶
You can instruct an SCR job to halt after completing its next successful checkpoint:
scr_halt
To run scr_halt
from outside of a prefix directory,
specify the target prefix directory like so:
scr_halt /p/lscratcha/user1/simulation123
You can instruct an SCR job to halt after completing some number of checkpoints
via the --checkpoints
option.
For example, to instruct a job to halt after 10 more checkpoints, use the following:
scr_halt --checkpoints 10
If the last of the checkpoints is unsuccessful, the job continues until it completes a successful checkpoint. This ensures that SCR has a successful checkpoint to flush before it halts the job.
Halt before or after a specified time¶
It is possible to instruct an SCR job to halt after a specified time using
the --after
option.
The job will halt on its first successful checkpoint after the specified time.
For example, you can instruct a job to halt after “12:00pm today” via:
scr_halt --after '12:00pm today'
It is also possible to instruct a job to halt before* a specified time
using the --before
option.
For example, you can instruct a job to halt before “8:30am tomorrow” via:
scr_halt --before '8:30am tomorrow'
For the “halt before” condition to be effective,
one must also set the SCR_HALT_SECONDS
parameter.
When SCR_HALT_SECONDS
is set to a positive number,
SCR checks how much time is left before the specified time limit.
If the remaining time in seconds is less than or equal to SCR_HALT_SECONDS
, SCR halts the job.
The value of SCR_HALT_SECONDS
does not affect the “halt after” condition.
It is highly recommended that SCR_HALT_SECONDS
be set
so that the SCR library can impose a default “halt before” condition using the end time
of the job allocation.
This ensures the latest checkpoint can be flushed before the allocation is lost.
It is important to set SCR_HALT_SECONDS
to a value large enough
that SCR has time to completely flush (and rebuild) files before the allocation expires.
Consider that a checkpoint may be taken just before the
remaining time is less than SCR_HALT_SECONDS
.
If a code checkpoints every X seconds and it takes Y seconds
to flush files from the cache and rebuild, set SCR_HALT_SECONDS
= X + Y + Delta,
where Delta is some positive value to provide additional slack time.
One may also set the halt seconds via the --seconds
option to scr_halt
.
Using the scr_halt
command, one can set, change, and unset the halt seconds on a running job.
NOTE: If any scr_halt
commands are specified as part of the batch script before
the first run starts,
one must then use scr_halt
to set the halt seconds for the job rather than
the SCR_HALT_SECONDS
parameter.
The scr_halt
command creates the halt file,
and if a halt file exists before a job starts to run,
SCR ignores any value specified in the SCR_HALT_SECONDS
parameter.
Halt immediately¶
Sometimes, you need to halt an SCR job immediately, and there are two options for this.
You may use the --immediate
option:
scr_halt --immediate
This command first updates the halt file, so that the job will not be restarted once stopped. Then, it kills the current run.
If for some reason the --immediate
option fails to work,
you may manually halt the job. [1]
First, issue a simple scr_halt
so the job will not restart,
and then manually kill the current run using mechanisms provided by the resource manager,
e.g., scancel
for SLURM and apkill
for ALPS.
When using mechanisms provided by the resource manager to kill the
current run, be careful to cancel the job step and not the job allocation.
Canceling the job allocation destroys the cache.
For SLURM, to get the job step id, type: squeue -s
.
Then be sure to include the job id and step id in the scancel
argument.
For example, if the job id is 1234 and the step id is 5, then use the following commands:
scr_halt
scancel 1234.5
Do not just type “scancel 1234
” – be sure to include the job step id.
For ALPS, use apstat
to get the apid of the job step to kill.
Then, follow the steps as described above: execute scr_halt
followed by the kill command apkill <apid>
.
[1] | On Cray/ALPS, scr_halt --immediate is not yet supported. The alternate method described in the text must be used instead. |
Catch a hanging job¶
If an application hangs, SCR may not be given the chance to copy files from cache to the parallel file system before the allocation expires. To avoid losing significant work due to a hang, SCR attempts to detect if a job is hanging, and if so, SCR attempts to kill the job step so that it can be restarted in the allocation.
On some systems, SCR employs the io-watchdog
library for this purpose.
For more information on this tool, see http://code.google.com/p/io-watchdog.
On systems where io-watchdog
is not available,
SCR uses a generic mechanism based on the expected
time between checkpoints as specified by the user. If the time between checkpoints
is longer than expected, SCR assumes the job is hanging.
Two SCR parameters determine how many seconds should pass
between I/O phases in an application, i.e. seconds between
consecutive calls to SCR_Start_checkpoint
.
These are SCR_WATCHDOG_TIMEOUT
and SCR_WATCHDOG_TIMEOUT_PFS
. The first parameter
specifies the time to wait when SCR writes checkpoints to
in-system storage, e.g. SSD or RAM disk, and the second
parameter specifies the time to wait when SCR writes
checkpoints to the parallel file system.
The reason for the two timeouts is that writing to the parallel
file system generally takes much longer than writing to in-system
storage, and so a longer timeout period is useful in that case.
When using this feature, be careful to check that the job does not hang near the end of its allocation time limit, since in this case, SCR may not kill the run with enough time before the allocation ends. If you suspect the job to be hanging and you deem that SCR will not kill the run in sufficient time, manually cancel the run as described above.
Combine, list, change, and unset halt conditions¶
It is possible to specify multiple halt conditions.
To do so, simply list each condition in the same scr_halt
command or issue several commands.
For example, to instruct a job to halt after 10 checkpoints or before “8:30am tomorrow”,
which ever comes earlier, you could issue the following command:
scr_halt --checkpoints 10 --before '8:30am tomorrow'
The following sequence also works:
scr_halt --checkpoints 10
scr_halt --before '8:30am tomorrow'
You may list the current settings in the halt file with the --list
option, e.g.,:
scr_halt --list
You may change a setting by issuing a new command to overwrite the current value.
Finally, you can unset some halt conditions by prepending unset-
to the option names.
See the scr_halt
man page for a full listing of unset options.
For example, to unset the “halt before” condition on a job, type the following:
scr_halt --unset-before
Remove the halt file¶
Sometimes, especially during testing, you may want to run in an existing
allocation after halting a previous run.
When SCR detects a halt file with a satisfied halt condition, it immediately exits.
This is the desired effect when trying to halt a job,
however this mechanism also prevents one from intentionally running in an allocation
after halting a previous run.
Along these lines, know that SCR registers a halt condition whenever
the application calls SCR_Finalize
.
When there is a halt file with a satisfied halt condition,
a message is printed to stdout
to indicate why SCR is halting.
To run in such a case, first remove the satisfied halt conditions.
You can unset the conditions or reset them to appropriate values.
Another approach is to remove the halt file via the --remove
option.
This deletes the halt file, which effectively removes all halt conditions.
For example, to remove the halt file from a job, type:
scr_halt --remove