Configure a job

The default SCR configuration suffices for many Linux clusters. However, significant performance improvement or additional functionality may be gained via custom configuration.

Setting parameters

SCR searches the following locations in the following order for a parameter value, taking the first value it finds.

Environment variables,
User configuration file,
Values set with SCR_Config,
System configuration file,
Compile-time constants.

A convenient method to set an SCR parameter is through an environment variable, e.g.,:

export SCR_CACHE_SIZE=2

In cases where SCR parameters need to be set based on the run time configuration of the application, the application can call SCR_Config, e.g.,:

SCR_Config("SCR_CACHE_SIZE=2");

Section Configure SCR for application settings lists common use cases for SCR_Config.

SCR also offers two configuration files: a user configuration file and a system configuration file. The user configuration file is useful for parameters that may need to vary by job, while the system configuration file is useful for parameters that apply to all jobs.

To find a user configuration file, SCR looks for a file named .scrconf in the prefix directory. Alternatively, one may specify the name and location of the user configuration file by setting the SCR_CONF_FILE environment variable at run time, e.g.,:

export SCR_CONF_FILE=~/myscr.conf

The location of the system configuration file is hard-coded into SCR at build time. This defaults to <install>/etc/scr.conf. One may choose a different path using the SCR_CONFIG_FILE CMake option, e.g.,:

cmake -DSCR_CONFIG_FILE=/path/to/scr.conf ...

To set an SCR parameter in a configuration file, list the parameter name followed by its value separated by an ‘=’ sign. Blank lines are ignored, and any characters following the ‘#’ comment character are ignored. For example, a configuration file may contain something like the following:

>>: cat ~/myscr.conf
# set the halt seconds to one hour
SCR_HALT_SECONDS=3600

# set SCR to flush every 20 checkpoints
SCR_FLUSH=20

One can include environment variable expressions in the value of SCR configuration parameters. SCR interpolates the value of the environment variable at run time before setting the parameter. This is useful for some parameters like storage paths, which may only be defined within the allocation environment, e.g.,:

# SLURM system that creates a /dev/shm directory for each job
SCR_CNTL_BASE=/dev/shm/$SLURM_JOBID
SCR_CACHE_BASE=/dev/shm/$SLURM_JOBID

Common configurations

This section describes some common configuration values. These parameters can be set using any of the methods described above.

Enable debug messages

SCR can print informational messages about its operations, timing, and bandwidth:

SCR_DEBUG=1

This setting is recommended during development and debugging.

Specify the job output directory

By default, SCR uses the current working directory as its prefix directory. If one needs to specify a different path, set SCR_PREFIX:

SCR_PREFIX=/job/output/dir

It is common to set SCR_PREFIX to be the top-level output directory of the application.

Specify which checkpoint to load

By default, SCR attempts to load the most recent checkpoint. If one wants to specify a particular checkpoint, one can name which checkpoint to load by setting SCR_CURRENT:

SCR_CURRENT=ckptname

The value for the name must match the string that was given as the dataset name during the call to SCR_Start_output in which the checkpoint was created.

File-per-process vs shared access

Applications achieve the highest performance when only a single process accesses each file within a dataset. This mode is termed file-per-process. In that situation, SCR can keep files in cache locations that might include node-local storage.

SCR also supports applications that require shared access to files, where more than one process writes to or reads from a given file. This mode is termed shared access. To support shared access to a file, SCR locates files in global storage like the parallel file system.

Regardless of the type of file access, one can only use cache when there is sufficient capacity to store the application files and associated SCR redundancy data.

There are several common SCR configurations depending on the needs of the application.

Write file-per-process, read file-per-process

In this mode, an application uses file-per-process mode both while writing its dataset during checkpoint/output and while reading its dataset during restart. So long as there is sufficient cache capacity, SCR can use cache including node-local storage for both operations. To configure SCR for this mode:

SCR_CACHE_BYPASS=0

One must set SCR_CACHE_BYPASS=0 to instruct SCR to use cache.

Write file-per-process, read with shared access

It is somewhat common for an application to write datasets using file-per-process mode but then require shared access mode to read its checkpoint files during restart. For example, there might be a top-level file that all processes read. In this case, SCR can be configured to use cache like node-local storage while writing, but it must be configured to move files to the prefix directory for restarts:

SCR_CACHE_BYPASS=0
SCR_GLOBAL_RESTART=1

Setting SCR_GLOBAL_RESTART=1 instructs SCR to rebuild any cached datasets during SCR_Init and then flush them to the prefix directory to read during the restart phase.

Write with shared access

If an application requires shared access mode while writing its dataset, SCR must be configured to locate files on a global file system. In this case, it is best to use the global file system both for writing datasets during checkpoint/output and for reading files during restart:

SCR_CACHE_BYPASS=1

Setting SCR_CACHE_BYPASS=1 instructs SCR to locate files within the prefix directory for both checkpoint/output and restart phases.

Cache bypass mode must also be used when the cache capacity is insufficient to store the application files and SCR redundancy data.

Because cache bypass mode is the most portable across different systems and applications, it is enabled by default.

Change checkpoint flush frequency

By default, SCR flushes any dataset marked as SCR_FLAG_OUTPUT, and it flushes every 10th checkpoint. To flush non-output checkpoint datasets at a different rate, one can set SCR_FLUSH. For example, to flush every checkpoint:

SCR_FLUSH=1

Change cache location

By default, SCR uses /dev/shm as its cache base. One can use a different cache location by setting SCR_CACHE_BASE. For example, one might target a path that points to a node-local SSD:

SCR_CACHE_BASE=/ssd

This parameter is useful in runs that use a single cache location. When using multiple cache directories within a single run, one can define store and checkpoint descriptors as described later.

Change control and cache location

At times, one may need to set both the control and cache directories. For example, some sites configure SLURM to create a path to temporary storage for each allocation:

SCR_CNTL_BASE=/tmp/$SLURM_JOBID
SCR_CACHE_BASE=/tmp/$SLURM_JOBID

Another use case is when one needs to run multiple, independent SCR jobs within a single allocation. This is somewhat common in automated testing frameworks that run many different test cases in parallel within a single resource allocation. To support this, one can configure each run to use its own control and cache directories:

# for test case 1
SCR_CNTL_BASE=/dev/shm/test1
SCR_CACHE_BASE=/dev/shm/test1

# for test case 2
SCR_CNTL_BASE=/dev/shm/test2
SCR_CACHE_BASE=/dev/shm/test2

Increase cache size

When using cache, SCR stores at most one dataset by default. One can increase this limit with SCR_CACHE_SIZE, e.g., to cache up to two datasets:

SCR_CACHE_SIZE=2

It is recommended to use a cache size of at least 2 when possible.

Change redundancy schemes

By default, SCR uses the XOR redundancy scheme to withstand node failures. One can change the scheme using the SCR_COPY_TYPE parameter. For example, to use Reed-Solomon to withstand up to two failures per set:

SCR_COPY_TYPE=RS

In particular, on stable systems where one is using SCR primarily for its asynchronous flush capability rather than for its fault tolerance, it may be best to use SINGLE:

SCR_COPY_TYPE=SINGLE

It is possible to use multiple redundancy schemes in a single job. For this, one must specify checkpoint descriptors as described in Group, store, and checkpoint descriptors.

Enable asynchronous flush

By default, SCR flushes datasets synchronously. In this mode, the SCR API call that initiates the flush does not return until the flush completes. One can configure SCR to use asynchronous flushes instead, in which case the flush is started during one SCR API call, and it may be finalized in a later SCR API call. To enable asynchronous flushes, one should both set SCR_FLUSH_ASYNC=1 and specify a flush type like PTHREAD:

SCR_FLUSH_ASYNC=1
SCR_FLUSH_TYPE=PTHREAD

Restart with a different number of processes

To restart an application with a different number of processes than used to save the checkpoint, one must follow the steps listed in Restart without SCR. Additionally, one should set the following:

SCR_FLUSH_ON_RESTART=1
SCR_FETCH=0

Group, store, and checkpoint descriptors

SCR must have information about process groups, storage devices, and redundancy schemes. SCR defines defaults that are sufficient in most cases.

By default, SCR creates a group of all processes in the job called WORLD and another group of all processes on the same compute node called NODE.

For storage, SCR requires that all processes be able to access the prefix directory, and it assumes that /dev/shm is storage local to each compute node.

SCR defines a default checkpoint descriptor that caches datasets in /dev/shm and protects against compute node failure using the XOR redundancy scheme.

The above defaults provide reasonable settings for Linux clusters. If necessary, one can define custom settings via group, store, and checkpoint descriptors in configuration files.

If more groups are needed, they can be defined in configuration files with entries like the following:

GROUPS=host1  POWER=psu1  SWITCH=0
GROUPS=host2  POWER=psu1  SWITCH=1
GROUPS=host3  POWER=psu2  SWITCH=0
GROUPS=host4  POWER=psu2  SWITCH=1

Group descriptor entries are identified by a leading GROUPS key. Each line corresponds to a single compute node, where the hostname of the compute node is the value of the GROUPS key. There must be one line for every compute node in the allocation. It is recommended to specify groups in the system configuration file, since these group definitions often apply to all jobs on the system.

The remaining values on the line specify a set of group name / value pairs. The group name is the string to be referenced by store and checkpoint descriptors. The value can be an arbitrary character string. All nodes that specify the same value are placed in the same group. Each unique value defines a distinct group.

In the above example, there are four compute nodes: host1, host2, host3, and host4. There are two groups defined: POWER and SWITCH. Nodes host1 and host2 belong to one POWER group (psu1), and nodes host3 and host4 belong to another (psu2). For the SWITCH group, nodes host1 and host3 belong to one group (0), and nodes host2 and host4 belong to another (1).

Additional storage can be described in configuration files with entries like the following:

STORE=/dev/shm      GROUP=NODE   COUNT=1
STORE=/ssd          GROUP=NODE   COUNT=3  FLUSH=PTHREAD
STORE=/dev/persist  GROUP=NODE   COUNT=1  ENABLED=1  MKDIR=0
STORE=/p/lscratcha  GROUP=WORLD

Store descriptor entries are identified by a leading STORE key. Each line corresponds to a class of storage devices. The value associated with the STORE key is the directory prefix of the storage device. This directory prefix also serves as the name of the store descriptor. All compute nodes must be able to access their respective storage device via the specified directory prefix.

The remaining values on the line specify properties of the storage class. The GROUP key specifies the group of processes that share a device. Its value must specify a group name. The GROUP key is optional, and it defaults to NODE if not specified. The COUNT key specifies the maximum number of datasets that can be kept in the associated storage. The user should be careful to set this appropriately depending on the storage capacity and the application dataset size. The COUNT key is optional, and it defaults to the value of the SCR_CACHE_SIZE parameter if not specified. The ENABLED key enables (1) or disables (0) the store descriptor. This key is optional, and it defaults to 1 if not specified. The MKDIR key specifies whether the device supports the creation of directories (1) or not (0). This key is optional, and it defaults to 1 if not specified. The FLUSH key specifies the transfer type to use when flushing datasets from that storage location. This key is optional, and it defaults to the value of the SCR_FLUSH_TYPE if not specified.

In the above example, there are four storage devices specified: /dev/shm, /ssd, /dev/persist, and /p/lscratcha. The storage at /dev/shm, /ssd, and /dev/persist specify the NODE group, which means that they are node-local storage. Processes on the same compute node access the same device. The storage at /p/lscratcha specifies the WORLD group, which means that all processes in the job can access the device. In other words, it is a globally accessible file system.

One can define checkpoint descriptors in a configuration file. This is especially useful when more than one checkpoint descriptor is needed in a single job. Example checkpoint descriptor entries look like the following:

# instruct SCR to use the CKPT descriptors from the config file
SCR_COPY_TYPE=FILE

# enable datasets to be stored in cache
SCR_CACHE_BYPASS=0

# the following instructs SCR to run with three checkpoint configurations:
# - save every 8th checkpoint to /ssd using the PARTNER scheme
# - save every 4th checkpoint (not divisible by 8) and any output dataset
#   to /ssd using RS a set size of 8
# - save all other checkpoints (not divisible by 4 or 8) to /dev/shm using XOR with
#   a set size of 16
CKPT=0 INTERVAL=1 GROUP=NODE   STORE=/dev/shm TYPE=XOR     SET_SIZE=16
CKPT=1 INTERVAL=4 GROUP=NODE   STORE=/ssd     TYPE=RS      SET_SIZE=8  SET_FAILURES=3 OUTPUT=1
CKPT=2 INTERVAL=8 GROUP=SWITCH STORE=/ssd     TYPE=PARTNER BYPASS=1

First, one must set the SCR_COPY_TYPE parameter to FILE. Otherwise, SCR uses an implied checkpoint descriptor that is defined using various SCR parameters including SCR_GROUP, SCR_CACHE_BASE, SCR_COPY_TYPE, and SCR_SET_SIZE.

To store datasets in cache, one must set SCR_CACHE_BYPASS=0 to disable bypass mode. When bypass is enabled, all datasets are written directly to the parallel file system.

Checkpoint descriptor entries are identified by a leading CKPT key. The values of the CKPT keys must be numbered sequentially starting from 0. The INTERVAL key specifies how often a descriptor is to be applied. For each checkpoint, SCR selects the descriptor having the largest interval value that evenly divides the internal SCR checkpoint iteration number. It is necessary that one descriptor has an interval of 1. This key is optional, and it defaults to 1 if not specified. The GROUP key lists the failure group, i.e., the name of the group of processes that are likely to fail at the same time. This key is optional, and it defaults to the value of the SCR_GROUP parameter if not specified. The STORE key specifies the directory in which to cache the checkpoint. This key is optional, and it defaults to the value of the SCR_CACHE_BASE parameter if not specified. The TYPE key identifies the redundancy scheme to be applied. This key is optional, and it defaults to the value of the SCR_COPY_TYPE parameter if not specified. The BYPASS key indicates whether to bypass cache and access data files directly on the parallel file system (1) or whether to store them in cache (0). In either case, redundancy is applied to internal SCR metadata using the specified descriptor settings. This key is optional, and it defaults to the value of the SCR_CACHE_BYPASS parameter if not specified.

Other keys may exist depending on the selected redundancy scheme. For XOR and RS schemes, the SET_SIZE key specifies the minimum number of processes to include in each redundancy set. This defaults to the value of SCR_SET_SIZE if not specified. For RS, the SET_FAILURES key specifies the maximum number of failures to tolerate within each redundancy set. If not specified, this defaults to the value of SCR_SET_FAILURES.

One checkpoint descriptor can be marked with the OUTPUT key. This indicates that the descriptor should be selected to store datasets that the application flags with SCR_FLAG_OUTPUT. The OUTPUT key is optional, and it defaults to 0. If there is no descriptor with the OUTPUT key defined and if the dataset is also a checkpoint, SCR chooses the checkpoint descriptor according to the normal policy. Otherwise, if there is no descriptor with the OUTPUT key defined and if the dataset is not a checkpoint, SCR uses the checkpoint descriptor having an interval of 1.

If one does not explicitly define a checkpoint descriptor, the default SCR descriptor can be defined in pseudocode as:

CKPT=0 INTERVAL=1 GROUP=$SCR_GROUP STORE=$SCR_CACHE_BASE TYPE=$SCR_COPY_TYPE SET_SIZE=$SCR_SET_SIZE BYPASS=$SCR_CACHE_BYPASS

If those parameters are not set otherwise, this defaults to the following:

CKPT=0 INTERVAL=1 GROUP=NODE STORE=/dev/shm TYPE=XOR SET_SIZE=8 BYPASS=1

Example using SINGLE and XOR

On many systems, application failures are more common than node failures. The SINGLE redundancy scheme is sufficient to recover from application failures, and it is much faster than other redundancy schemes like XOR. If there is room to store multiple checkpoints in cache, one can configure SCR to use SINGLE and XOR in the same run. For an application failure, SCR can restart the job from the most recent checkpoint, but if a node fails, SCR can fallback to the most recent XOR checkpoint. The following entries configure SCR to encode every 10th checkpoint with XOR but use SINGLE for all others:

# instruct SCR to use the CKPT descriptors from the config file
SCR_COPY_TYPE=FILE

# enable datasets to be stored in cache
SCR_CACHE_BYPASS=0

# define distinct paths for SINGLE and XOR
STORE=/dev/shm/single COUNT=1
STORE=/dev/shm/xor    COUNT=1

# save every 10th checkpoint using XOR
# save all other checkpoints using SINGLE
CKPT=0 INTERVAL=1  STORE=/dev/shm/single TYPE=SINGLE
CKPT=1 INTERVAL=10 STORE=/dev/shm/xor    TYPE=XOR

This configures SCR to write all checkpoints within /dev/shm, but separate directories are used for SINGLE and XOR. By defining distinct STORE locations for each redundancy type, SCR always deletes an older checkpoint of the same type before writing a new checkpoint.

SCR parameters

The table in this section specifies the full set of SCR configuration parameters.

SCR parameters
Name	Default	Description
`SCR_DEBUG`	0	Set to 1 or 2 for increasing verbosity levels of debug messages.
`SCR_CHECKPOINT_INTERVAL`	0	Set to positive number of times `SCR_Need_checkpoint` should be called before returning 1. This provides a simple way to set a periodic checkpoint frequency within an application.
`SCR_CHECKPOINT_SECONDS`	0	Set to positive number of seconds to specify minimum time between consecutive checkpoints as guided by `SCR_Need_checkpoint`.
`SCR_CHECKPOINT_OVERHEAD`	0.0	Set to positive floating-point value to specify maximum percent overhead allowed for checkpointing operations as guided by `SCR_Need_checkpoint`.
`SCR_CNTL_BASE`	`/dev/shm`	Specify the default base directory SCR should use to store its runtime control metadata. The control directory should be in fast, node-local storage like RAM disk.
`SCR_HALT_EXIT`	0	Whether SCR should call `exit()` when it detects an active halt condition. When enabled, SCR can exit the job during `SCR_Init` and `SCR_Complete_output` after each successful checkpoint. Set to 1 to enable.
`SCR_HALT_SECONDS`	0	Set to a positive integer to instruct SCR to halt the job if the remaining time in the current job allocation is less than the specified number of seconds.
`SCR_GROUP`	`NODE`	Specify name of default failure group.
`SCR_COPY_TYPE`	`XOR`	Set to one of: `SINGLE`, `PARTNER`, `XOR`, `RS`, or `FILE`.
`SCR_CACHE_BASE`	`/dev/shm`	Specify the default base directory SCR should use to cache datasets.
`SCR_CACHE_SIZE`	1	Set to a non-negative integer to specify the maximum number of checkpoints SCR should keep in cache. SCR will delete the oldest checkpoint from cache before saving another in order to keep the total count below this limit.
`SCR_CACHE_BYPASS`	1	Specify bypass mode. When enabled, data files are directly read from and written to the parallel file system, bypassing the cache. Even in bypass mode, internal SCR metadata corresponding to the dataset is stored in cache. Set to 0 to direct SCR to store datasets in cache.
`SCR_CACHE_PURGE`	0	Whether to delete all datasets from cache during `SCR_Init`. Enabling this setting may be useful for test and development while integrating SCR in an application.
`SCR_SET_SIZE`	8	Specify the minimum number of processes to include in an redundancy set. So long as there are sufficient failure groups, each redundancy set will be at least the minimum size. If not, redundancy sets will be as large as possible, but they may be smaller than the minimum size. Increasing this value can decrease the amount of storage required to cache the dataset. However, a higher value can require more time to rebuild lost files, and it increases the likelihood of encountering a catastrophic failure.
`SCR_SET_FAILURES`	2	Specify the number of failures to tolerate in each set while using the RS scheme. Increasing this value enables one to tolerate more failures per set, but it increases redundancy storage and encoding costs.
`SCR_PREFIX`	$PWD	Specify the prefix directory on the parallel file system where datasets should be read from and written to.
`SCR_PREFIX_SIZE`	0	Specify number of checkpoints to keep in the prefix directory. SCR deletes older checkpoints as new checkpoints are flushed to maintain a sliding window of the specified size. Set to 0 to keep all checkpoints. Checkpoints marked with `SCR_FLAG_OUTPUT` are not deleted.
`SCR_PREFIX_PURGE`	0	Set to 1 to delete all datasets from the prefix directory (both checkpoint and output) during `SCR_Init`.
`SCR_CURRENT`	N/A	Name of checkpoint to mark as current and attempt to load during a new run during `SCR_Init`.
`SCR_DISTRIBUTE`	1	Set to 0 to disable cache rebuild during `SCR_Init`.
`SCR_FETCH`	1	Set to 0 to disable SCR from fetching files from the parallel file system during `SCR_Init`.
`SCR_FETCH_BYPASS`	0	Set to 1 to read files directly from the parallel file system during fetch.
`SCR_FETCH_WIDTH`	256	Specify the number of processes that may read simultaneously from the parallel file system.
`SCR_FLUSH`	10	Specify the number of checkpoints between periodic flushes to the parallel file system. Set to 0 to disable periodic flushes.
`SCR_FLUSH_ASYNC`	0	Set to 1 to enable asynchronous flush methods (if supported).
`SCR_FLUSH_POSTSTAGE`	0	Set to 1 to finalize asynchronous flushes using the scr_poststage script, rather than in SCR_Finalize(). This can be used to start a checkpoint flush near the end of your job, and have it run “in the background” after your job finishes. This is currently only supported by the IBM Burst Buffer API (BBAPI). To use this, you need to make sure to specify scr_poststage as your 2nd-half post-stage script in bsub to finalize the transfers. See examples/test_scr_poststage for a detailed example.
`SCR_FLUSH_TYPE`	`SYNC`	Specify the flush transfer method. Set to one of: `SYNC`, `PTHREAD`, `BBAPI`, or `DATAWARP`.
`SCR_FLUSH_WIDTH`	256	Specify the number of processes that may write simultaneously to the parallel file system.
`SCR_FLUSH_ON_RESTART`	0	Set to 1 to force SCR to flush datasets during restart. This is useful for applications that restart without using the SCR Restart API. Typically, one should also set `SCR_FETCH=0` when enabling this option.
`SCR_GLOBAL_RESTART`	0	Set to 1 to flush checkpoints to and restart from the prefix directory during `SCR_Init`. This is needed by applications that use the SCR Restart API but require a global file system to restart, e.g., because multiple processes read the same file.
`SCR_RUNS`	1	Specify the maximum number of times the `scr_srun` command should attempt to run a job within an allocation. Set to -1 to specify an unlimited number of times.
`SCR_MIN_NODES`	N/A	Specify the minimum number of nodes required to run a job.
`SCR_EXCLUDE_NODES`	N/A	Specify a set of nodes, using SLURM node range syntax, which should be excluded from runs. This is useful to avoid particular problematic nodes. Nodes named in this list that are not part of a the current job allocation are silently ignored.
`SCR_LOG_ENABLE`	0	Whether to enable any form of logging of SCR events.
`SCR_LOG_TXT_ENABLE`	1	Whether to log SCR events to text file in prefix directory at `$SCR_PREFIX/.scr/log`. `SCR_LOG_ENABLE` must be set to 1 for this parameter to be active.
`SCR_LOG_SYSLOG_ENABLE`	1	Whether to log SCR events to syslog. `SCR_LOG_ENABLE` must be set to 1 for this parameter to be active.
`SCR_LOG_DB_ENABLE`	0	Whether to log SCR events to MySQL database. `SCR_LOG_ENABLE` must be set to 1 for this parameter to be active.
`SCR_LOG_DB_DEBUG`	0	Whether to print MySQL statements as they are executed.
`SCR_LOG_DB_HOST`	N/A	Hostname of MySQL server
`SCR_LOG_DB_NAME`	N/A	Name of SCR MySQL database.
`SCR_LOG_DB_USER`	N/A	Username of SCR MySQL user.
`SCR_LOG_DB_PASS`	N/A	Password for SCR MySQL user.
`SCR_MPI_BUF_SIZE`	131072	Specify the number of bytes to use for internal MPI send and receive buffers when computing redundancy data or rebuilding lost files.
`SCR_FILE_BUF_SIZE`	1048576	Specify the number of bytes to use for internal buffers when copying files between the parallel file system and the cache.
`SCR_WATCHDOG_TIMEOUT`	N/A	Set to the expected time (seconds) for checkpoint writes to in-system storage (see Catch a hanging job).
`SCR_WATCHDOG_TIMEOUT_PFS`	N/A	Set to the expected time (seconds) for checkpoint writes to the parallel file system (see Catch a hanging job).