Configure a job
The default SCR configuration suffices for many Linux clusters. However, significant performance improvement or additional functionality may be gained via custom configuration.
Setting parameters
SCR searches the following locations in the following order for a parameter value, taking the first value it finds.
Environment variables,
User configuration file,
Values set with
SCR_Config
,System configuration file,
Compile-time constants.
A convenient method to set an SCR parameter is through an environment variable, e.g.,:
export SCR_CACHE_SIZE=2
In cases where SCR parameters need to be set based
on the run time configuration of the application,
the application can call SCR_Config
, e.g.,:
SCR_Config("SCR_CACHE_SIZE=2");
Section Configure SCR for application settings lists common use cases for SCR_Config
.
SCR also offers two configuration files: a user configuration file and a system configuration file. The user configuration file is useful for parameters that may need to vary by job, while the system configuration file is useful for parameters that apply to all jobs.
To find a user configuration file,
SCR looks for a file named .scrconf
in the prefix directory.
Alternatively, one may specify the name and location of the user configuration file
by setting the SCR_CONF_FILE
environment variable at run time, e.g.,:
export SCR_CONF_FILE=~/myscr.conf
The location of the system configuration file is hard-coded into SCR at build time.
This defaults to <install>/etc/scr.conf
.
One may choose a different path using the SCR_CONFIG_FILE
CMake option, e.g.,:
cmake -DSCR_CONFIG_FILE=/path/to/scr.conf ...
To set an SCR parameter in a configuration file, list the parameter name followed by its value separated by an ‘=’ sign. Blank lines are ignored, and any characters following the ‘#’ comment character are ignored. For example, a configuration file may contain something like the following:
>>: cat ~/myscr.conf
# set the halt seconds to one hour
SCR_HALT_SECONDS=3600
# set SCR to flush every 20 checkpoints
SCR_FLUSH=20
One can include environment variable expressions in the value of SCR configuration parameters. SCR interpolates the value of the environment variable at run time before setting the parameter. This is useful for some parameters like storage paths, which may only be defined within the allocation environment, e.g.,:
# SLURM system that creates a /dev/shm directory for each job
SCR_CNTL_BASE=/dev/shm/$SLURM_JOBID
SCR_CACHE_BASE=/dev/shm/$SLURM_JOBID
Common configurations
This section describes some common configuration values. These parameters can be set using any of the methods described above.
Enable debug messages
SCR can print informational messages about its operations, timing, and bandwidth:
SCR_DEBUG=1
This setting is recommended during development and debugging.
Specify the job output directory
By default, SCR uses the current working directory as its prefix directory.
If one needs to specify a different path, set SCR_PREFIX
:
SCR_PREFIX=/job/output/dir
It is common to set SCR_PREFIX
to be the top-level output directory
of the application.
Specify which checkpoint to load
By default, SCR attempts to load the most recent checkpoint.
If one wants to specify a particular checkpoint,
one can name which checkpoint to load by setting SCR_CURRENT
:
SCR_CURRENT=ckptname
The value for the name must match the string that was given as the dataset name
during the call to SCR_Start_output
in which the checkpoint was created.
Write file-per-process, read file-per-process
In this mode, an application uses file-per-process mode both while writing its dataset during checkpoint/output and while reading its dataset during restart. So long as there is sufficient cache capacity, SCR can use cache including node-local storage for both operations. To configure SCR for this mode:
SCR_CACHE_BYPASS=0
One must set SCR_CACHE_BYPASS=0
to instruct SCR to use cache.
Change checkpoint flush frequency
By default, SCR flushes any dataset marked as SCR_FLAG_OUTPUT
,
and it flushes every 10th checkpoint.
To flush non-output checkpoint datasets at a different rate,
one can set SCR_FLUSH
.
For example, to flush every checkpoint:
SCR_FLUSH=1
Change cache location
By default, SCR uses /dev/shm
as its cache base.
One can use a different cache location
by setting SCR_CACHE_BASE
.
For example, one might target a path
that points to a node-local SSD:
SCR_CACHE_BASE=/ssd
This parameter is useful in runs that use a single cache location. When using multiple cache directories within a single run, one can define store and checkpoint descriptors as described later.
Change control and cache location
At times, one may need to set both the control and cache directories. For example, some sites configure SLURM to create a path to temporary storage for each allocation:
SCR_CNTL_BASE=/tmp/$SLURM_JOBID
SCR_CACHE_BASE=/tmp/$SLURM_JOBID
Another use case is when one needs to run multiple, independent SCR jobs within a single allocation. This is somewhat common in automated testing frameworks that run many different test cases in parallel within a single resource allocation. To support this, one can configure each run to use its own control and cache directories:
# for test case 1
SCR_CNTL_BASE=/dev/shm/test1
SCR_CACHE_BASE=/dev/shm/test1
# for test case 2
SCR_CNTL_BASE=/dev/shm/test2
SCR_CACHE_BASE=/dev/shm/test2
Increase cache size
When using cache, SCR stores at most one dataset by default.
One can increase this limit with SCR_CACHE_SIZE
,
e.g., to cache up to two datasets:
SCR_CACHE_SIZE=2
It is recommended to use a cache size of at least 2 when possible.
Change redundancy schemes
By default, SCR uses the XOR
redundancy scheme
to withstand node failures.
One can change the scheme using the SCR_COPY_TYPE
parameter.
For example, to use Reed-Solomon to withstand up to two failures per set:
SCR_COPY_TYPE=RS
In particular, on stable systems where one is using SCR primarily for
its asynchronous flush capability rather than for its fault tolerance,
it may be best to use SINGLE
:
SCR_COPY_TYPE=SINGLE
It is possible to use multiple redundancy schemes in a single job. For this, one must specify checkpoint descriptors as described in Group, store, and checkpoint descriptors.
Enable asynchronous flush
By default, SCR flushes datasets synchronously.
In this mode, the SCR API call that initiates the flush
does not return until the flush completes.
One can configure SCR to use asynchronous flushes instead,
in which case the flush is started during one SCR API call,
and it may be finalized in a later SCR API call.
To enable asynchronous flushes,
one should both set SCR_FLUSH_ASYNC=1
and specify a flush type like PTHREAD
:
SCR_FLUSH_ASYNC=1
SCR_FLUSH_TYPE=PTHREAD
Restart with a different number of processes
To restart an application with a different number of processes than used to save the checkpoint, one must follow the steps listed in Restart without SCR. Additionally, one should set the following:
SCR_FLUSH_ON_RESTART=1
SCR_FETCH=0
Group, store, and checkpoint descriptors
SCR must have information about process groups, storage devices, and redundancy schemes. SCR defines defaults that are sufficient in most cases.
By default, SCR creates a group of all processes in the job called WORLD
and another group of all processes on the same compute node called NODE
.
For storage,
SCR requires that all processes be able to access the prefix directory,
and it assumes that /dev/shm
is storage local to each compute node.
SCR defines a default checkpoint descriptor that
caches datasets in /dev/shm
and protects against
compute node failure using the XOR
redundancy scheme.
The above defaults provide reasonable settings for Linux clusters. If necessary, one can define custom settings via group, store, and checkpoint descriptors in configuration files.
If more groups are needed, they can be defined in configuration files with entries like the following:
GROUPS=host1 POWER=psu1 SWITCH=0
GROUPS=host2 POWER=psu1 SWITCH=1
GROUPS=host3 POWER=psu2 SWITCH=0
GROUPS=host4 POWER=psu2 SWITCH=1
Group descriptor entries are identified by a leading GROUPS
key.
Each line corresponds to a single compute node,
where the hostname of the compute node is the value of the GROUPS
key.
There must be one line for every compute node in the allocation.
It is recommended to specify groups in the system configuration file,
since these group definitions often apply to all jobs on the system.
The remaining values on the line specify a set of group name / value pairs. The group name is the string to be referenced by store and checkpoint descriptors. The value can be an arbitrary character string. All nodes that specify the same value are placed in the same group. Each unique value defines a distinct group.
In the above example, there are four compute nodes:
host1
, host2
, host3
, and host4
.
There are two groups defined: POWER
and SWITCH
.
Nodes host1
and host2
belong to one POWER
group (psu1
),
and nodes host3
and host4
belong to another (psu2
).
For the SWITCH
group,
nodes host1
and host3
belong to one group (0
),
and nodes host2
and host4
belong to another (1
).
Additional storage can be described in configuration files with entries like the following:
STORE=/dev/shm GROUP=NODE COUNT=1
STORE=/ssd GROUP=NODE COUNT=3 FLUSH=PTHREAD
STORE=/dev/persist GROUP=NODE COUNT=1 ENABLED=1 MKDIR=0
STORE=/p/lscratcha GROUP=WORLD
Store descriptor entries are identified by a leading STORE
key.
Each line corresponds to a class of storage devices.
The value associated with the STORE
key is the
directory prefix of the storage device.
This directory prefix also serves as the name of the store descriptor.
All compute nodes must be able to access their respective storage
device via the specified directory prefix.
The remaining values on the line specify properties of the storage class.
The GROUP
key specifies the group of processes that share a device.
Its value must specify a group name.
The GROUP
key is optional, and it defaults to NODE
if not specified.
The COUNT
key specifies the maximum number of datasets
that can be kept in the associated storage.
The user should be careful to set this appropriately
depending on the storage capacity and the application dataset size.
The COUNT
key is optional, and it defaults to the value
of the SCR_CACHE_SIZE
parameter if not specified.
The ENABLED
key enables (1) or disables (0) the store descriptor.
This key is optional, and it defaults to 1 if not specified.
The MKDIR
key specifies whether the device supports the
creation of directories (1) or not (0).
This key is optional, and it defaults to 1 if not specified.
The FLUSH
key specifies the transfer type to use when
flushing datasets from that storage location.
This key is optional, and it defaults to the value of the SCR_FLUSH_TYPE
if not specified.
In the above example, there are four storage devices specified:
/dev/shm
, /ssd
, /dev/persist
, and /p/lscratcha
.
The storage at /dev/shm
, /ssd
, and /dev/persist
specify the NODE
group, which means that they are node-local storage.
Processes on the same compute node access the same device.
The storage at /p/lscratcha
specifies the WORLD
group,
which means that all processes in the job can access the device.
In other words, it is a globally accessible file system.
One can define checkpoint descriptors in a configuration file. This is especially useful when more than one checkpoint descriptor is needed in a single job. Example checkpoint descriptor entries look like the following:
# instruct SCR to use the CKPT descriptors from the config file
SCR_COPY_TYPE=FILE
# enable datasets to be stored in cache
SCR_CACHE_BYPASS=0
# the following instructs SCR to run with three checkpoint configurations:
# - save every 8th checkpoint to /ssd using the PARTNER scheme
# - save every 4th checkpoint (not divisible by 8) and any output dataset
# to /ssd using RS a set size of 8
# - save all other checkpoints (not divisible by 4 or 8) to /dev/shm using XOR with
# a set size of 16
CKPT=0 INTERVAL=1 GROUP=NODE STORE=/dev/shm TYPE=XOR SET_SIZE=16
CKPT=1 INTERVAL=4 GROUP=NODE STORE=/ssd TYPE=RS SET_SIZE=8 SET_FAILURES=3 OUTPUT=1
CKPT=2 INTERVAL=8 GROUP=SWITCH STORE=/ssd TYPE=PARTNER BYPASS=1
First, one must set the SCR_COPY_TYPE
parameter to FILE
.
Otherwise, SCR uses an implied checkpoint descriptor that is defined
using various SCR parameters including SCR_GROUP
, SCR_CACHE_BASE
,
SCR_COPY_TYPE
, and SCR_SET_SIZE
.
To store datasets in cache,
one must set SCR_CACHE_BYPASS=0
to disable bypass mode.
When bypass is enabled, all datasets are written directly to the parallel file system.
Checkpoint descriptor entries are identified by a leading CKPT
key.
The values of the CKPT
keys must be numbered sequentially starting from 0.
The INTERVAL
key specifies how often a descriptor is to be applied.
For each checkpoint,
SCR selects the descriptor having the largest interval value that evenly
divides the internal SCR checkpoint iteration number.
It is necessary that one descriptor has an interval of 1.
This key is optional, and it defaults to 1 if not specified.
The GROUP
key lists the failure group,
i.e., the name of the group of processes that are likely to fail at the same time.
This key is optional, and it defaults to the value of the
SCR_GROUP
parameter if not specified.
The STORE
key specifies the directory in which to cache the checkpoint.
This key is optional, and it defaults to the value of the
SCR_CACHE_BASE
parameter if not specified.
The TYPE
key identifies the redundancy scheme to be applied.
This key is optional, and it defaults to the value of the
SCR_COPY_TYPE
parameter if not specified.
The BYPASS
key indicates whether to bypass cache
and access data files directly on the parallel file system (1)
or whether to store them in cache (0). In either case,
redundancy is applied to internal SCR metadata using the specified
descriptor settings.
This key is optional, and it defaults to the value of the
SCR_CACHE_BYPASS
parameter if not specified.
Other keys may exist depending on the selected redundancy scheme.
For XOR
and RS
schemes, the SET_SIZE
key specifies
the minimum number of processes to include in each redundancy set.
This defaults to the value of SCR_SET_SIZE
if not specified.
For RS
, the SET_FAILURES
key specifies
the maximum number of failures to tolerate within each redundancy set.
If not specified, this defaults to the value of SCR_SET_FAILURES
.
One checkpoint descriptor can be marked with the OUTPUT
key.
This indicates that the descriptor should be selected to store datasets
that the application flags with SCR_FLAG_OUTPUT
.
The OUTPUT
key is optional, and it defaults to 0.
If there is no descriptor with the OUTPUT
key defined
and if the dataset is also a checkpoint,
SCR chooses the checkpoint descriptor according to the normal policy.
Otherwise, if there is no descriptor with the OUTPUT
key defined
and if the dataset is not a checkpoint,
SCR uses the checkpoint descriptor having an interval of 1.
If one does not explicitly define a checkpoint descriptor, the default SCR descriptor can be defined in pseudocode as:
CKPT=0 INTERVAL=1 GROUP=$SCR_GROUP STORE=$SCR_CACHE_BASE TYPE=$SCR_COPY_TYPE SET_SIZE=$SCR_SET_SIZE BYPASS=$SCR_CACHE_BYPASS
If those parameters are not set otherwise, this defaults to the following:
CKPT=0 INTERVAL=1 GROUP=NODE STORE=/dev/shm TYPE=XOR SET_SIZE=8 BYPASS=1
Example using SINGLE and XOR
On many systems, application failures are more common than node failures.
The SINGLE
redundancy scheme is sufficient to recover from application failures,
and it is much faster than other redundancy schemes like XOR
.
If there is room to store multiple checkpoints in cache,
one can configure SCR to use SINGLE
and XOR
in the same run.
For an application failure, SCR can restart the job from the most recent checkpoint,
but if a node fails, SCR can fallback to the most recent XOR
checkpoint.
The following entries configure SCR to encode every 10th checkpoint with XOR
but use SINGLE
for all others:
# instruct SCR to use the CKPT descriptors from the config file
SCR_COPY_TYPE=FILE
# enable datasets to be stored in cache
SCR_CACHE_BYPASS=0
# define distinct paths for SINGLE and XOR
STORE=/dev/shm/single COUNT=1
STORE=/dev/shm/xor COUNT=1
# save every 10th checkpoint using XOR
# save all other checkpoints using SINGLE
CKPT=0 INTERVAL=1 STORE=/dev/shm/single TYPE=SINGLE
CKPT=1 INTERVAL=10 STORE=/dev/shm/xor TYPE=XOR
This configures SCR to write all checkpoints within /dev/shm
,
but separate directories are used for SINGLE
and XOR
.
By defining distinct STORE
locations for each redundancy type,
SCR always deletes an older checkpoint of the same type before writing a new checkpoint.
SCR parameters
The table in this section specifies the full set of SCR configuration parameters.
Name |
Default |
Description |
---|---|---|
|
0 |
Set to 1 or 2 for increasing verbosity levels of debug messages. |
|
0 |
Set to positive number of times |
|
0 |
Set to positive number of seconds to specify minimum time between consecutive checkpoints as guided by |
|
0.0 |
Set to positive floating-point value to specify maximum percent overhead allowed for checkpointing operations as guided by |
|
|
Specify the default base directory SCR should use to store its runtime control metadata. The control directory should be in fast, node-local storage like RAM disk. |
|
0 |
Whether SCR should call |
|
0 |
Set to a positive integer to instruct SCR to halt the job if the remaining time in the current job allocation is less than the specified number of seconds. |
|
|
Specify name of default failure group. |
|
|
Set to one of: |
|
|
Specify the default base directory SCR should use to cache datasets. |
|
1 |
Set to a non-negative integer to specify the maximum number of checkpoints SCR should keep in cache. SCR will delete the oldest checkpoint from cache before saving another in order to keep the total count below this limit. |
|
1 |
Specify bypass mode. When enabled, data files are directly read from and written to the parallel file system, bypassing the cache. Even in bypass mode, internal SCR metadata corresponding to the dataset is stored in cache. Set to 0 to direct SCR to store datasets in cache. |
|
0 |
Whether to delete all datasets from cache during |
|
8 |
Specify the minimum number of processes to include in an redundancy set. So long as there are sufficient failure groups, each redundancy set will be at least the minimum size. If not, redundancy sets will be as large as possible, but they may be smaller than the minimum size. Increasing this value can decrease the amount of storage required to cache the dataset. However, a higher value can require more time to rebuild lost files, and it increases the likelihood of encountering a catastrophic failure. |
|
2 |
Specify the number of failures to tolerate in each set while using the RS scheme. Increasing this value enables one to tolerate more failures per set, but it increases redundancy storage and encoding costs. |
|
$PWD |
Specify the prefix directory on the parallel file system where datasets should be read from and written to. |
|
0 |
Specify number of checkpoints to keep in the prefix directory.
SCR deletes older checkpoints as new checkpoints are flushed to maintain a sliding window of the specified size.
Set to 0 to keep all checkpoints.
Checkpoints marked with |
|
0 |
Set to 1 to delete all datasets from the prefix directory (both checkpoint and output) during |
|
N/A |
Name of checkpoint to mark as current and attempt to load during a new run during |
|
1 |
Set to 0 to disable cache rebuild during |
|
1 |
Set to 0 to disable SCR from fetching files from the parallel file system during |
|
0 |
Set to 1 to read files directly from the parallel file system during fetch. |
|
256 |
Specify the number of processes that may read simultaneously from the parallel file system. |
|
10 |
Specify the number of checkpoints between periodic flushes to the parallel file system. Set to 0 to disable periodic flushes. |
|
0 |
Set to 1 to enable asynchronous flush methods (if supported). |
|
0 |
Set to 1 to finalize asynchronous flushes using the scr_poststage script, rather than in SCR_Finalize(). This can be used to start a checkpoint flush near the end of your job, and have it run “in the background” after your job finishes. This is currently only supported by the IBM Burst Buffer API (BBAPI). To use this, you need to make sure to specify scr_poststage as your 2nd-half post-stage script in bsub to finalize the transfers. See examples/test_scr_poststage for a detailed example. |
|
|
Specify the flush transfer method. Set to one of: |
|
256 |
Specify the number of processes that may write simultaneously to the parallel file system. |
|
0 |
Set to 1 to force SCR to flush datasets during restart.
This is useful for applications that restart without using the SCR Restart API.
Typically, one should also set |
|
0 |
Set to 1 to flush checkpoints to and restart from the prefix directory during |
|
1 |
Specify the maximum number of times the |
|
N/A |
Specify the minimum number of nodes required to run a job. |
|
N/A |
Specify a set of nodes, using SLURM node range syntax, which should be excluded from runs. This is useful to avoid particular problematic nodes. Nodes named in this list that are not part of a the current job allocation are silently ignored. |
|
0 |
Whether to enable any form of logging of SCR events. |
|
1 |
Whether to log SCR events to text file in prefix directory at |
|
1 |
Whether to log SCR events to syslog.
|
|
0 |
Whether to log SCR events to MySQL database.
|
|
0 |
Whether to print MySQL statements as they are executed. |
|
N/A |
Hostname of MySQL server |
|
N/A |
Name of SCR MySQL database. |
|
N/A |
Username of SCR MySQL user. |
|
N/A |
Password for SCR MySQL user. |
|
131072 |
Specify the number of bytes to use for internal MPI send and receive buffers when computing redundancy data or rebuilding lost files. |
|
1048576 |
Specify the number of bytes to use for internal buffers when copying files between the parallel file system and the cache. |
|
N/A |
Set to the expected time (seconds) for checkpoint writes to in-system storage (see Catch a hanging job). |
|
N/A |
Set to the expected time (seconds) for checkpoint writes to the parallel file system (see Catch a hanging job). |