Run a job

In addition to the SCR library, one may optionally include SCR commands in a job script. These commands are most useful on systems where failures are common. The SCR commands prepare the cache, scavenge files from cache to the parallel file system, and check that the scavenged dataset is complete among other things. The commands also automate the process of relaunching a job after failure, including logic to detect to exclude failed nodes.

There are several SCR commands, most of which are located in the /bin directory where SCR is installed.

For best performance, one should: 1) inform the batch system that the allocation should remain available even after a failure and 2) replace the command to execute the application with an SCR wrapper script. The precise set of options and commands to use depends on the system resource manager.

Supported platforms

At the time of this writing, SCR supports specific combinations of resource managers and job launchers. The descriptions for using SCR in this section apply to these specific configurations, however the following description is helpful to understand how to run SCR on any system. Please contact us for help in porting SCR to other platforms. (See Section Support and Additional Information for contact information).

Jobs and job steps

First, we differentiate between a job allocation and a job step. Our terminology originates from the SLURM resource manager, but the concepts apply generally across other resource managers.

When a job is scheduled resources on a system, the batch script executes inside of a job allocation. The job allocation consists of a set of nodes, a time limit, and a job id. The job id can be obtained by executing the squeue command on SLURM, the apstat command on ALPS, and the bjobs command on LSF.

Within a job allocation, a user may run one or more job steps. One runs a job step using srun on SLURM, aprun on ALPS, or jsrun on LSF. Each job step is assigned its own step id.

On SLURM, job step ids start at 0 and increment with each issued job step. Job step ids can be obtained by passing the -s option to squeue. A fully qualified name of a SLURM job step consists of: jobid.stepid. For instance, the name 1234.5 refers to step id 5 of job id 1234.

On ALPS, each job step has a unique id that can be obtained through apstat.

Tolerating node failures

Before running an SCR job, it is recommended to configure the job allocation to withstand node failures. By default, most resource managers terminate the job allocation if a node fails. SCR requires the job allocation to remain active in order to restart the job or to scavenge files. To enable the job allocation to continue past node failures, one must specify the appropriate flags from the table below.

SCR job allocation flags

LSF batch script	`#BSUB -env "all, LSB_DJOB_COMMFAIL_ACTION=KILL_TASKS"`
LSF interactive	`bsub -env "all, LSB_DJOB_COMMFAIL_ACTION=KILL_TASKS" ...`
MOAB batch script	`#MSUB -l resfailpolicy=ignore`
MOAB interactive	`qsub -I ... -l resfailpolicy=ignore`
SLURM batch script	`#SBATCH --no-kill`
SLURM interactive	`salloc --no-kill ...`

The SCR wrapper scripts

The easiest way to integrate SCR into a batch script is to set some environment variables and to replace the job run command with an SCR wrapper script.

Example bash job scripts are located in the share/examples directory of an SCR installation:

SLURM:  scr_srun.sh [srun_options]  <prog> [prog_args ...]
SLURM:  scr_srun_loop.sh [srun_options]  <prog> [prog_args ...]
LSF:    scr_jsrun.sh [jsrun_options] <prog> [prog_args ...]
LSF:    scr_jsrun_loop.sh [jsrun_options] <prog> [prog_args ...]

These scripts are customized for different resource managers. They include the resource manager flag that one needs to set to tolerate node failures. They call scr_prerun to prepare an allocation before starting a run, they launch a run with srun or jsrun, and they call scr_postrun to scavange any cached datasets after the run completes.

The scripts whose name ends with _loop additionally include logic to relaunch a failed job up to some fixed number of times within the allocation. Between launches, the scripts call scr_list_down_nodes to detect and avoid failed nodes for the relaunch. One may allocate spare nodes when using these scripts.

These job scripts serve as templates that one can modify as needed. However, they can often be used as drop in replacements for the launch command in an existing job script.

In addition to the example bash job scripts, an scr_run command is in the /bin directory of an SCR installation. This is a python script that, like the bash scripts, executes scr_prerun, scr_postrun, and it optionally relaunches a run after detecting and excluding any failed nodes. When using scr_run, one must specify the job launcher as the first argument:

SLURM: scr_run srun [srun_options]  <prog> [prog_args ...]
LSF:   scr_run jsrun [jsrun_options]  <prog> [prog_args ...]

Using the SCR wrapper scripts

All wrapper scripts must run from within a job allocation. The commands must know the SCR prefix directory. By default, this is assumed to be the current working directory. For the scr_run script, one may specify a different prefix directory by setting the SCR_PREFIX parameter.

When using any of these scripts, it is recommended to set the SCR_HALT_SECONDS parameter so that the job allocation does not expire before datasets can be flushed (Section Halt a job).

By default, the scr_run script does not restart an application after the first job step exits. To automatically restart a job step within the current allocation, set the SCR_RUNS environment variable to the maximum number of runs to attempt. For an unlimited number of attempts, set this variable to -1.

After a job step exits, the wrapper script checks whether it should restart the job. If so, the script sleeps for some time to give nodes in the allocation a chance to clean up. Then it checks that there are sufficient healthy nodes remaining in the allocation. By default, the script assumes the next run requires the same number of nodes as the previous run, which is recorded in a file written by the SCR library. If this file cannot be read, the command assumes the application requires all nodes in the allocation. Alternatively, one may override these heuristics and precisely specify the number of nodes needed by setting the SCR_MIN_NODES environment variable to the number of required nodes.

See Section Configure a job for additional common SCR configuration settings.

For applications that cannot invoke the SCR wrapper scripts as described here, one should examine the logic contained within the script and duplicate the necessary parts in the job batch script. In particular, one should invoke scr_postrun for scavenge support.

Example batch script for using scavenge, but no restart

An example SLURM batch script with scr_srun.sh is shown below

#!/bin/bash
#SBATCH --no-kill

# halt if there is less than an hour remaining (3600 seconds)
export SCR_HALT_SECONDS=3600

# run the job with scr_srun
scr_run.sh -n512 -N64 ./my_job

Example batch script for using scavenge and restart

An example SLURM batch script with scr_srun_loop.sh is shown below

#!/bin/bash
#SBATCH --no-kill

# halt if there is less than an hour remaining (3600 seconds)
export SCR_HALT_SECONDS=3600

# run the job with scr_srun, will run up to 5 times
scr_run_loop.sh -n512 -N64 ./my_job

Example SLURM batch script with `scr_run` using scavenge and restart

#!/bin/bash
#SBATCH --no-kill
#SBATCH --nodes 66

# above, tell SLURM to not kill the job allocation upon a node failure
# also note that the job requested 2 spares -- it uses 64 nodes but allocated 66

# specify where datasets should be written
export SCR_PREFIX=/parallel/file/system/username/run1

# instruct SCR to flush to the file system every 20 checkpoints
export SCR_FLUSH=20

# halt if there is less than an hour remaining (3600 seconds)
export SCR_HALT_SECONDS=3600

# attempt to run the job up to 3 times
export SCR_RUNS=3

# run the job with scr_srun
scr_run srun -n512 -N64 ./my_job

Managing SCR jobs with Python scripts

These scripts define and use an scrjob package to manage an SCR job. Most files in this directory are installed to /libexec/python of an SCR installation.

NOTE: Though a setup.py exists, it is not currently functional.

/commands - scripts that use scrjob, installed to /bin of an SCR installation
/scrjob - scrjob package files
/tests - tests for the scrjob package

The following scripts are not typically invoked by a user, and they are not considered to be part of the SCR user interface. However, these scripts are helpful for debugging and testing.

scr_check_node.py - Execute on each compute node to check access to cache and control directories
scr_ckpt_interval.py - Given an SCR log file, compute estimate for optimal interval between checkpoints
scr_env.py - Print various values from an allocation environment
scr_hostlist.py - Manipulate a hostlist string, to expand, compress, and subtract nodes
scr_inspect.py - Execute on each compute node to determine whether a dataset can be scavenged (not used)
scr_kill_jobstep.py - Given the specified launcher and jobstepid, call JobLauncher.kill_jobstep(jobstepid)
scr_list_dir.py - Print the SCR control or cache directories
scr_poststage.py - Initiate a poststage operation where supported (non-functional)
scr_scavenge.py - Execute a scavenge operation to copy files from cache to the prefix directory

ClusterShell (optional)

The scrjob package can use the ClusterShell module if available. This can be enabled by setting USE_CLUSTERSHELL = True in scrjob/config.py. ClusterShell is not used if it is not found or if it is disabled.

ClusterShell is useful as:

an scr_hostlist replacement for manipulating host lists
a pdsh replacement for running commands on a set of compute nodes

Installation

pip install ClusterShell

To configure ClusterShell, see:

ClusterShell config docs
man clush.conf

Node groups can be bound together by differing lists. Library defaults may need to be overridden for identifying nodes (see bottom of the config.html link) The ClusterShell NodeSet class supports more operations than scr_hostlist. See the NodeSet class: ClusterShell.NodeSet.NodeSet (ClusterShell Usage ).

SCR User Commands

Along with other binaries, these scripts are copied to the /bin directory of an SCR installation. The scripts are installed with their executable bit set, and the .py suffix is dropped. These are user commands that are typically invoked interactively or from a batch job script. Detailed usage for the scripts in this directory is provided in the SCR user documentation.

SCR commands

scr_prerun - Execute before the first SCR job in an allocation
scr_postrun - Execute after the final SCR job in an allocation; scavenges any cached datasets
scr_list_down_nodes - Reports list of currently failed nodes in an allocation, if any
scr_should_exit - Indicates whether one should stop launching SCR runs within an allocation; checks for active halt condition, insufficient nodes, or in sufficient time
scr_halt - View/edit/remove conditions in the halt file

scr_run

The scr_run script provides a high-level wrapper around the above scipts. It can automatically relaunch a job and avoid down nodes after detecting a failure, and it scavenges any cached datasets before exiting the allocation.

Usage:

Typical:

scr_run <launcher> <launcher args> <program> <program args>

Extended options:

scr_run <launcher> [-rc|--run-cmd]=<run command> [-rs|--restart-cmd]=<restart command> <launcher args>

Using extended options the commands will be modified as follows:

If not restarting or scr_have_restart returns None: <launcher> <launcher args> <run command>

If restarting and the most recent checkpoint name is identified, instances of SCR_CKPT_NAME will be replaced with the checkpoint name in the restart command: <launcher> <launcher args> <restart command>