Scalable Checkpoint / Restart (SCR) User Guide¶
The Scalable Checkpoint / Restart (SCR) library enables MPI applications to utilize distributed storage on Linux clusters to attain high file I/O bandwidth for checkpointing, restarting, and writing large datasets. With SCR, jobs run more efficiently, recompute less work upon a failure, and reduce load on shared resources like the parallel file system. It provides the most benefit to large-scale jobs that write large datasets. SCR utilizes tiered storage in a cluster to provide applications with the following capabilities:
- guidance for the optimal checkpoint frequency,
- scalable checkpoint bandwidth,
- scalable restart bandwidth,
- scalable output bandwidth,
- asynchronous data transfers to the parallel file system.
SCR originated as a production-level implementation of a multi-level checkpoint system of the type analyzed by [Vaidya] SCR caches checkpoints in scalable storage, which is faster but less reliable than the parallel file system. It applies a redundancy scheme to the cache such that checkpoints can be recovered after common system failures. It also copies a subset of checkpoints to the parallel file system to recover from less common but more severe failures. In many failure cases, a job can be restarted from a checkpoint in cache, and writing and reading datasets in cache can be orders of magnitude faster than the parallel file system.

Aggregate write bandwidth on Coastal
When writing a cached dataset to the parallel file system, SCR can transfer data asynchronously. The application may continue once the data has been written to the cache while SCR copies the data to the parallel file system in the background. SCR supports general output datasets in addition to checkpoint datasets.
SCR consists of two components: a library and a set of commands. The application registers its dataset files with the SCR API, and the library maintains the dataset cache. The SCR commands are typically invoked from the job batch script. They are used to prepare the cache before a job starts, automate the process of restarting a job, and copy datasets from cache to the parallel file system upon a failure.
[Vaidya] | “A Case for Two-Level Recovery Schemes”, Nitin H. Vaidya, IEEE Transactions on Computers, 1998, http://doi.ieeecomputersociety.org/10.1109/12.689645. |
Support¶
The main repository for SCR is located at:
From this site, you can download the source code and manuals for the current release of SCR.
For information about the project including active research efforts, please visit:
https://computation.llnl.gov/project/scr
To contact the developers of SCR for help with using or porting SCR, please visit:
https://computation.llnl.gov/project/scr/contact.php
There you will find links to join our discussion mailing list for help topics, and our announcement list for getting notifications of new SCR releases.
Contents¶
Quick Start¶
In this quick start guide, we assume that you already have a basic understanding of SCR and how it works on HPC systems. We will walk through a bare bones example to get you started quickly. For more in-depth information, please see subsequent sections in this user’s guide.
Obtaining the SCR Source¶
The latest version of the SCR source code is kept at github: https://github.com/LLNL/scr. It can be cloned or you may download a release tarball.
Building SCR¶
SCR has several dependencies. A C compiler, MPI, CMake, and pdsh are
required dependencies. The others are optional, and when they are
not available some features of SCR may not be available.
SCR uses the standard mpicc compiler wrapper in its build, so you will
need to have it in your PATH
. We assume the minimum set of
dependencies in this quick start guide, which can be automatically
obtained with either Spack or CMake. For more help on installing
dependencies of SCR or building SCR, please see Section Build SCR.
Spack¶
The most automated way to build SCR is to use the Spack package manager (https://github.com/spack/spack). SCR and all of its dependencies exist in a Spack package. After downloading Spack, simply type:
spack install scr
This will download and install SCR and its dependencies automatically.
CMake¶
To get started with CMake (version 2.8 or higher), the quick version of building SCR is:
git clone git@github.com:llnl/scr.git
mkdir build
mkdir install
cd build
cmake -DCMAKE_INSTALL_PREFIX=../install ../scr
make
make install
make test
Since pdsh is required,
the WITH_PDSH_PREFIX
should be passed to CMake
if it is installed in a non-standard location.
On most systems, MPI should automatically be detected.
Building the SCR test_api
Example¶
After installing SCR, go to the installation directory, <install dir>
above. In the <install dir>/share/scr/examples
directory
you will find the example programs supplied with SCR. For this quick start
guide, we will use the test_api
program. Build it by executing:
make test_api
Upon successful build, you will have an executable in your directory called
test_api
. You can use this test program to get a feel for how
SCR works and to ensure that your build of SCR is working.
Running the SCR test_api
Example¶
A quick test of your SCR installation can be done by setting a few environment variables in an interactive job allocation. The following assumes you are running on a SLURM-based system. If you are not using SLURM, then you will need to modify the allocation and run commands according to the resource manager you are using.
First, obtain a few compute nodes for testing.
Here we will allocate 4 compute nodes on a
system with a queue for debugging called pdebug
:
salloc -N 4 -p pdebug
Once you have the four compute nodes, you can experiment with SCR
using the test_api
program. First set a few environment variables.
We’re using csh in this example; you’ll need to update the commands if
you are using a different shell.:
# make sure the SCR library is in your library path
setenv LD_LIBRARY_PATH ${SCR_INSTALL}/lib
# tell SCR to not flush to the parallel file system periodically
setenv SCR_FLUSH 0
Now, we can run a simple test to see if your SCR installation is working. Here we’ll run a 4-process run on 4 nodes:
srun -n4 -N4 ./test_api
Assuming all goes well, you should see output similar to the following
>>: srun -N 4 -n 4 ./test_api
Init: Min 0.033856 s Max 0.033857 s Avg 0.033856 s
No checkpoint to restart from
At least one rank (perhaps all) did not find its checkpoint
Completed checkpoint 1.
Completed checkpoint 2.
Completed checkpoint 3.
Completed checkpoint 4.
Completed checkpoint 5.
Completed checkpoint 6.
FileIO: Min 52.38 MB/s Max 52.39 MB/s Avg 52.39 MB/s Agg 209.55 MB/s
If you did not see output similar to this, there is likely a problem with your environment set up or build of SCR. Please see the detailed sections of this user guide for more help or email us (See the Support and Contacts section of this user guide.)
If you want to get into more depth, in the SCR source directory,
you will find a directory called testing
. In this directory,
there are various scripts we use for testing our code. Perhaps the most
useful for getting started are the TESTING.csh
or TESTING.sh
files, depending on your shell preference.
Getting SCR into Your Application¶
Here we give a simple example of integrating SCR into an application
to write checkpoints. Further sections in the user guide give more
details and demonstrate how to perform restart with SCR.
You can also look at the source of the test_api
program and
other programs in the examples directory.
int main(int argc, char* argv[]) {
MPI_Init(argc, argv);
/* Call SCR_Init after MPI_Init */
SCR_Init();
for(int t = 0; t < TIMESTEPS; t++)
{
/* ... Do work ... */
int flag;
/* Ask SCR if we should take a checkpoint now */
SCR_Need_checkpoint(&flag);
if (flag)
checkpoint();
}
/* Call SCR_Finalize before MPI_Finalize */
SCR_Finalize();
MPI_Finalize();
return 0;
}
void checkpoint() {
/* Tell SCR that you are getting ready to start a checkpoint phase */
SCR_Start_checkpoint();
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
char file[256];
/* create your checkpoint file name */
sprintf(file, "rank_%d.ckpt", rank);
/* Call SCR_Route_file to request a new file name (scr_file) that will cause
your application to write the file to a fast tier of storage, e.g.,
a burst buffer */
char scr_file[SCR_MAX_FILENAME];
SCR_Route_file(file, scr_file);
/* Use the new file name to perform your checkpoint I/O */
FILE* fs = fopen(scr_file, "w");
if (fs != NULL) {
fwrite(state, ..., fs);
fclose(fs);
}
/* Tell SCR that you are done with your checkpoint phase */
SCR_Complete_checkpoint(1);
return;
}
Final Thoughts¶
This was a really quick introduction to building and running with SCR. For more information, please look at the more detailed sections in the rest of this user guide or contact us with questions.
Assumptions¶
A number of assumptions are made in the SCR implementation. If any of these assumptions do not hold for a particular application, that application cannot use SCR. If this is the case, or if you have any questions, please notify the SCR developers. The goal is to expand the implementation to support a large number of applications.
- The code must be an MPI application.
- The code must read and write datasets as a file per process in a globally-coordinated fashion.
- A process having a particular MPI rank is only guaranteed access to its own dataset files, i.e., a process of a given MPI rank may not access dataset files written by a process having a different MPI rank within the same run or across different runs.
- To use the scalable restart capability, a job must be restarted with the same number of processes as used to write the checkpoint, and each process must only access the files it wrote during the checkpoint. Note that this may limit the effectiveness of the library for codes that are capable of restarting from a checkpoint with a different number of processes than were used to write the checkpoint. Such codes can often still benefit from the scalable checkpoint capability, but not the scalable restart – they must fall back to restarting from the parallel file system.
- It must be possible to store the dataset files from all processes in the same directory. In particular, all files belonging to a given dataset must have distinct names.
- SCR maintains a set of meta data files, which it stores in a subdirectory of the directory that contains the application dataset files. The application must allow for these SCR meta data files to coexist with its own files.
- Files cannot contain data that span multiple datasets. In particular, there is no support for appending data of the current dataset to a file containing data from a previous dataset. Each dataset must be self-contained.
- On some systems, datasets are cached in RAM disk. This restricts usage of SCR on those machines to applications whose memory footprint leaves sufficient room to store the dataset files in memory simultaneously with the running application. The amount of storage needed depends on the number of cached datasets and the redundancy scheme used. See Section Scalable checkpoint for details.
- SCR occasionally flushes files from cache to the parallel file system. All files must reside under a top-level directory on the parallel file system called the “prefix” directory that is specified by the application. Under that prefix directory, the application may use file and subdirectory trees. One constraint is that no two datasets can write to the same file. See Section Control, cache, and prefix directories for details.
- Time limits should be imposed so that the SCR library has sufficient time to flush files from cache to the parallel file system before the resource allocation expires. Additionally, care should be taken so that the run does not stop in the middle of a checkpoint. See Section Halt a job for details.
Concepts¶
This section discusses concepts one should understand about the SCR library implementation including how it interacts with file systems.
Jobs, allocations, and runs¶
A large-scale simulation often must be restarted multiple times in order to run to completion.
It may be interrupted due to a failure,
or it may be interrupted due to time limits imposed by the resource scheduler.
We use the term allocation to refer to an assigned set of compute resources
that are available to the user for a period of time.
A resource manager typically assigns an identifier to each resource allocation,
which we refer to as the allocation id.
SCR uses the allocation id in some directory and file names.
Within an allocation, a user may execute a simulation one or more times.
We call each execution a run.
For MPI applications, each run corresponds to a single invocation
of mpirun
or its equivalent.
Finally, multiple allocations may be required to complete a given simulation.
We refer to this series of one or more allocations as a job.
To summarize,
one or more runs occur within an allocation,
and one or more allocations occur within a job.
Group, store, and redundancy descriptors¶
The SCR library must group processes of the parallel job in various ways. For example, if power supply failures are common, it is necessary to identify the set of processes that share a power supply. Similarly, it is necessary to identify all processes that can access a given storage device, such as an SSD mounted on a compute node. To represent these groups, the SCR library uses a group descriptor. Details of group descriptors are given in Section Group, store, and checkpoint descriptors.
Each group is given a unique name.
The library creates two groups by default: NODE
and WORLD
.
The NODE
group consists of all processes on the same compute node,
and WORLD
consists of all processes in the run.
The user or system administrator can create additional groups
via configuration files (Section Configure a job).
The SCR library must also track details about each class of storage it can access. For each available storage class, SCR needs to know the associated directory prefix, the group of processes that share a device, the capacity of the device, and other details like whether the associated file system can support directories. SCR tracks this information in a store descriptor. Each store descriptor refers to a group descriptor, which specifies how processes are grouped with respect to that class of storage. For a given storage class, it is assumed that all compute nodes refer to the class using the same directory prefix. Each store descriptor is referenced by its directory prefix.
The library creates one store descriptor by default: /tmp
.
The assumption is made that /tmp
is mounted as a local file system
on each compute node.
On Linux clusters, /tmp
is often RAM disk or a local hard drive.
Additional store descriptors can be defined by the user or
system administrator in configuration files (Section Configure a job).
Finally, SCR defines redundancy descriptors to associate a redundancy scheme with a class of storage devices and a group of processes that are likely to fail at the same time. It also tracks details about the particular redundancy scheme used, and the frequency with which it should be applied. Redundancy descriptors reference both store and group descriptors.
The library creates a default redundancy descriptor.
It assumes that processes on the same node are likely to fail at the same time.
It also assumes that datasets can be cached in /tmp
,
which is assumed to be storage local to each compute node.
It applies an XOR
redundancy scheme using a group size of 8.
Additional redundancy descriptors may be defined by the user
or system administrator in configuration files (Section Configure a job).
Control, cache, and prefix directories¶
SCR manages numerous files and directories to cache datasets and to record its internal state. There are three fundamental types of directories: control, cache, and prefix directories. For a detailed illustration of how these files and directories are arranged, see the example presented in Section Example of SCR files and directories.
The control directory is where SCR writes files to store internal state about the current run. This directory is expected to be stored in node-local storage. SCR writes multiple, small files in the control directory, and it may access these files frequently. It is best to configure this directory to be stored in a node-local RAM disk.
To construct the full path of the control directory, SCR incorporates a control base directory name along with the user name and allocation id associated with the resource allocation. This enables multiple users, or multiple jobs by the same user, to run at the same time without conflicting for the same control directory. The control base directory is hard-coded into the SCR library at configure time, but this value may be overridden via a system configuration file. The user may not change the control base directory.
SCR directs the application to write dataset files to subdirectories within a cache directory. SCR also stores its redundancy data in these subdirectories. The device serving the cache directory must be large enough to hold the data for one or more datasets plus the associated redundancy data. Multiple cache directories may be utilized in the same run, which enables SCR to use more than one class of storage within a run (e.g., RAM disk and SSD). Cache directories should be located on scalable storage.
To construct the full path of a cache directory, SCR incorporates a cache base directory name with the user name and the allocation id associated with the resource allocation. A set of valid cache base directories is hard-coded into the SCR library at configure time, but this set can be overridden in a system configuration file. Out of this set, the user may select a subset of cache base directories to use during a run. A cache directory may be the same as the control directory.
The user must configure the maximum number of datasets that SCR should keep in each cache directory. It is up to the user to ensure that the capacity of the device associated with the cache directory is large enough to hold the specified number of datasets.
SCR refers to each application checkpoint or output set as a dataset. SCR assigns a unique sequence number to each dataset called the dataset id. It assigns dataset ids starting from 1 and counts up with each successive dataset written by the application. Within a cache directory, a dataset is written to its own subdirectory called the dataset directory.
Finally, the prefix directory is a directory on the parallel file system that the user specifies. SCR copies datasets to the prefix directory for permanent storage (Section Fetch, flush, and scavenge). The prefix directory should be accessible from all compute nodes, and the user must ensure that the prefix directory is unique for each job. For each dataset stored in the prefix directory, SCR creates and manages a dataset directory. The dataset directory holds all SCR redundancy files and meta data associated with a particular dataset. SCR maintains an index file within the prefix directory, which records information about each dataset stored there.
Note that the term “dataset directory” is overloaded. In some cases, we use this term to refer to a directory in cache and in other cases we use the term to refer to a directory within the prefix directory on the parallel file system. In any particular case, the meaning should be clear from the context.
Example of SCR files and directories¶
To illustrate how files and directories are arranged in SCR,
consider the example shown in Figure Example SCR directories.
In this example, a user named “user1
”
runs a 4-task MPI job with one task per compute node.
The base directory for the control directory is /tmp
,
the base directory for the cache directory is /ssd
,
and the prefix directory is /p/lscratchb/user1/simulation123
.
The control and cache directories are storage devices local to the compute node.

Example SCR directories
The full path of the control directory is /tmp/user1/scr.1145655
.
This is derived from the concatenation of the control base directory (/tmp
),
the user name (user1
), and the allocation id (1145655
).
SCR keeps files to persist its internal state in the control directory,
including filemap files as shown.
Similarly, the cache directory is /ssd/user1/scr.1145655
,
which is derived from the concatenation of the cache base directory (/ssd
),
the user name (user1
), and the allocation id (1145655
).
Within the cache directory, SCR creates a subdirectory for each dataset.
In this example, there are two datasets with ids 17 and 18.
The application dataset files and SCR redundancy files
are stored within their corresponding dataset directory.
On the node running MPI rank 0,
there is one application dataset file (rank_0.ckpt
)
and one XOR redundancy data file (1_of_4_in_0.xor
).
Finally, the full path of the prefix directory is /p/lscratchb/user1/simulation123
.
This is a path on the parallel file system that is specified by the user.
It is unique to the particular simulation the user is running (simulation123
).
The prefix directory contains dataset directories.
It also contains a hidden .scr
directory where SCR writes
the index file to record info for each of the datasets (Section Manage datasets).
The SCR library writes other files to this hidden directory,
including the “halt” file (Section Halt a job).
While the user provides the prefix directory,
SCR defines the name of each dataset directory to be “scr.dataset.<id>
”
where <id>
is the dataset id.
In this example, there are multiple datasets stored on the parallel file system
corresponding to dataset ids 10, 12, and 18.
Within each dataset directory, SCR stores the files written by the application.
SCR also creates a hidden .scr
subdirectory,
and this hidden directory contains redundancy files and other SCR files
that are specific to the dataset.
Scalable checkpoint¶
In practice, it is common for multiple processes to fail at the same time, but most often this happens because those processes depend on a single, failed component. It is not common for multiple, independent components to fail simultaneously. By expressing the groups of processes that are likely to fail at the same time, the SCR library can apply redundancy schemes to withstand common, multi-process failures. We refer to a set of processes likely to fail at the same time as a failure group.
SCR must also know which groups of processes share a given storage device. This is useful so the group can coordinate its actions when accessing the device. For instance, if a common directory must be created before each process writes a file, a single process can create the directory and then notify the others. We refer to a set of processes that share a storage device as a storage group.
Users and system administrators can pass information about failure and storage groups to SCR in descriptors defined in configuration files (See Section Group, store, and checkpoint descriptors). Given this knowledge of failure and storage groups, the SCR library implements three redundancy schemes which trade off performance, storage space, and reliability:
Single
- each checkpoint file is written to storage accessible to the local processPartner
- each checkpoint file is written to storage accessible to the local process, and a full copy of each file is written to storage accessible to a partner process from another failure groupXOR
- each checkpoint file is written to storage accessible to the local process, XOR parity data are computed from checkpoints of a set of processes from different failure groups, and the parity data are stored among the set.
With Single
, SCR writes each checkpoint file in storage accessible to the local process.
It requires sufficient space to store the maximum checkpoint file size.
This scheme is fast, but it cannot withstand failures that disable the storage device.
For instance, when using node-local storage,
this scheme cannot withstand failures that disable the node,
such as when a node loses power or its network connection.
However, it can withstand failures that kill the application processes
but leave the node intact, such as application bugs and file I/O errors.
With Partner
, SCR writes checkpoint files to storage accessible to the local process,
and it also copies each checkpoint file to storage accessible to a partner process from another failure group.
This scheme is slower than Single
, and it requires twice the storage space.
However, it is capable of withstanding failures that disable a storage device.
In fact, it can withstand failures of multiple devices,
so long as a device and the device holding the copy do not fail simultaneously.
With XOR
, SCR defines sets of processes
where members within a set are selected from different failure groups.
The processes within a set collectively compute XOR parity data which is
stored in files along side the application checkpoint files.
This algorithm is based on the work found in [Gropp],
which in turn was inspired by RAID5 [Patterson].
This scheme can withstand multiple failures so long as two processes from the
same set do not fail simultaneously.
Computationally, XOR
is more expensive than Partner
,
but it requires less storage space.
Whereas Partner
must store two full checkpoint files,
XOR
stores one full checkpoint file plus one XOR parity segment,
where the segment size is roughly \(1/(N-1)\) times the size of a checkpoint file for a set of size N.
Larger sets demand less storage,
but they also increase the probability that two processes in the same set will fail simultaneously.
Larger sets may also increase the cost of recovering files in the event of a failure.
[Patterson] | “A Case for Redundant Arrays of Inexpensive Disks (RAID)”, D. Patterson, G. Gibson, and R. Katz, Proc. of 1988 ACM SIGMOD Conf. on Management of Data, 1988, http://web.mit.edu/6.033/2015/wwwdocs/papers/Patterson88.pdf. |
[Gropp] | “Providing Efficient I/O Redundancy in MPI Environments”, William Gropp, Robert Ross, and Neill Miller, Lecture Notes in Computer Science, 3241:7786, September 2004. 11th European PVM/MPI Users Group Meeting, 2004, http://www.mcs.anl.gov/papers/P1178.pdf. |
Scalable restart¶
So long as a failure does not violate the redundancy scheme, a job can restart within the same resource allocation using the cached checkpoint files. This saves the cost of writing checkpoint files out to the parallel file system only to read them back during the restart. In addition, SCR provides support for the use of spare nodes. A job can allocate more nodes than it needs and use the extra nodes to fill in for any failed nodes during a restart. SCR includes a set of scripts which encode much of the restart logic (Section Run a job).
Upon encountering a failure,
SCR relies on the MPI library, the resource manager, or some other external service
to kill the current run.
After the run is killed,
and if there are sufficient healthy nodes remaining,
the same job can be restarted within the same allocation.
In practice, such a restart typically amounts to issuing another
“mpirun
” in the job batch script.
Of the set of nodes used by the previous run, the restarted run should use as many of the same nodes as it can to maximize the number of files available in cache. A given MPI rank in the restarted run does not need to run on the same node that it ran on in the previous run. SCR distributes cached files among processes according to the process mapping of the restarted run.
By default, SCR inspects the cache for existing checkpoints when a job starts.
It attempts to rebuild all datasets in cache,
and then it attempts to restart the job from the most recent checkpoint.
If a checkpoint fails to rebuild, SCR deletes it from cache.
To disable restarting from cache, set the SCR_DISTRIBUTE
parameter to 0.
When disabled, SCR deletes all files from cache and restarts from a checkpoint on the parallel file system.
An example restart scenario is illustrated in Figure Scalable restart
in which a 4-node job using the Partner
scheme allocates 5 nodes
and successfully restarts within the allocation after a node fails.

Example restart after a failed node with Partner
Catastrophic failures¶
There are some failures from which the SCR library cannot recover. In such cases, the application is forced to fall back to the latest checkpoint successfully written to the parallel file system. Such catastrophic failures include the following:
- Multiple node failure which violates the redundancy scheme. If multiple nodes fail in a pattern which violates the cache redundancy scheme, data are irretrievably lost.
- Failure during a checkpoint. Due to cache size limitations, some applications can only fit one checkpoint in cache at a time. For such cases, a failure may occur after the library has deleted the previous checkpoint but before the next checkpoint has completed. In this case, there is no valid checkpoint in cache to recover.
- Failure of the node running the job batch script. The logic at the end of the allocation to scavenge the latest checkpoint from cache to the parallel file system executes as part of the job batch script. If the node executing this script fails, the scavenge logic will not execute and the allocation will terminate without copying the latest checkpoint to the parallel file system.
- Parallel file system outage. If the application fails when writing output due to an outage of the parallel file system, the scavenge logic may also fail when it attempts to copy files to the parallel file system.
There are other catastrophic failure cases not listed here. Checkpoints must be written to the parallel file system with some moderate frequency so as not to lose too much work in the event of a catastrophic failure. Section Fetch, flush, and scavenge provides details on how to configure SCR to make occasional writes to the parallel file system.
By default, the current implementation stores only the most recent checkpoint in cache.
One can change the number of checkpoints stored in cache by setting
the SCR_CACHE_SIZE
parameter.
If space is available, it is recommended to increase this value to at least 2.
Fetch, flush, and scavenge¶
SCR manages the transfer of datasets between the prefix directory on the parallel file system and the cache. We use the term fetch to refer to the action of copying a dataset from the parallel file system to cache. When transferring data in the other direction, there are two terms used: flush and scavenge. Under normal circumstances, the library directly copies files from cache to the parallel file system, and this direct transfer is known as a flush. However, sometimes a run is killed before the library can complete this transfer. In these cases, a set of SCR commands is executed after the final run to ensure that the latest checkpoint is copied to the parallel file system before the allocation expires. We say that these scripts scavenge the latest checkpoint.
Each time an SCR job starts, SCR first inspects the cache and attempts to distribute files for a scalable restart as discussed in Section Scalable restart. If the cache is empty or the distribute operation fails or is disabled, SCR attempts to fetch a checkpoint from the prefix directory to fill the cache. SCR reads the index file and attempts to fetch the most recent checkpoint, or otherwise the checkpoint that is marked as current within the index file. For a given checkpoint, SCR records whether the fetch attempt succeeds or fails in the index file. SCR does not attempt to fetch a checkpoint that is marked as being incomplete nor does it attempt to fetch a checkpoint for which a previous fetch attempt has failed. If SCR attempts but fails to fetch a checkpoint, it prints an error and continues the run.
To disable the fetch operation, set the SCR_FETCH
parameter to 0.
If an application disables the fetch feature,
the application is responsible for reading its checkpoint set directly from
the parallel file system upon a restart.
To withstand catastrophic failures,
it is necessary to write checkpoint sets out to the parallel file system with some moderate frequency.
In the current implementation,
the SCR library writes a checkpoint set out to the parallel file system after every 10 checkpoints.
This frequency can be configured by setting the SCR_FLUSH
parameter.
When this parameter is set, SCR decrements a counter with each successful checkpoint.
When the counter hits 0, SCR writes the current checkpoint set out to the file system and resets the counter
to the value specified in SCR_FLUSH
.
SCR preserves this counter between scalable restarts,
and when used in conjunction with SCR_FETCH
,
it also preserves this counter between fetch and flush operations
such that it is possible to maintain periodic checkpoint writes across runs.
Set SCR_FLUSH
to 0 to disable periodic writes in SCR.
If an application disables the periodic flush feature,
the application is responsible for writing occasional checkpoint sets to the parallel file system.
By default, SCR computes and stores a CRC32 checksum value for each checkpoint file during a flush.
It then uses the checksum to verify the integrity of each file as it is read back into cache during a fetch.
If data corruption is detected, SCR falls back to fetch an earlier checkpoint set.
To disable this checksum feature, set the SCR_CRC_ON_FLUSH
parameter to 0.
Build SCR¶
Dependencies¶
SCR has several dependencies. A C compiler, MPI, CMake, and pdsh are required dependencies. The others are optional, and when they are not available some features of SCR may not be available.
- CMake, Version 2.8+
- Compiler (C, C++, and Fortran)
- MPI
- pdsh (https://github.com/grondo/pdsh)
- DTCMP (optional, used for user-defined directory structure feature)(https://github.com/llnl/dtcmp)
- LWGRP (optional, used for user-defined directory structure feature) (https://github.com/LLNL/lwgrp)
- libyogrt (optional, used for determining length of time left in the current allocation) (https://github.com/llnl/libyogrt)
- MySQL (optional, used for logging SCR activities)
Spack¶
The most automated way to build SCR is to use the Spack package manager (https://github.com/spack/spack). SCR and all of its dependencies exist in a Spack package. After downloading Spack, simply type:
spack install scr
This will install the DTCMP, LWGRP, and pdsh packages (and possibly an MPI and a C compiler if needed).
CMake¶
To get started with CMake (version 2.8 or higher), the quick version of building SCR is:
git clone git@github.com:llnl/scr.git
mkdir build
mkdir install
cd build
cmake -DCMAKE_INSTALL_PREFIX=../install ../scr
make
make install
make test
Some useful CMake command line options are:
-DCMAKE_INSTALL_PREFIX=[path]
: Place to install the SCR library-DCMAKE_BUILD_TYPE=[Debug/Release]
: Build with debugging or optimizations-DBUILD_PDSH=[OFF/ON]
: CMake can automatically download and build the PDSH dependency-DWITH_PDSH_PREFIX=[path to PDSH]
: Path to an existing PDSH installation (should not be used withBUILD_PDSH
)-DWITH_DTCMP_PREFIX=[path to DTCMP]
-DWITH_YOGRT_PREFIX=[path to YOGRT]
-DSCR_ASYNC_API=[CRAY_DW/INTEL_CPPR/IBM_BBAPI/NONE]
-DSCR_RESOURCE_MANAGER=[SLURM/APRUN/PMIX/LSF/NONE]
-DSCR_CNTL_BASE=[path]
: Path to SCR Control directory, defaults to/tmp
-DSCR_CACHE_BASE=[path]
: Path to SCR Cache directory, defaults to/tmp
-DSCR_CONFIG_FILE=[path]
: Path to SCR system configuration file, defaults to/etc/scr/scr.conf
To change the SCR control directory, one must either set -DSCR_CNTL_BASE
at build time
or one must specify SCR_CNTL_BASE
in the SCR system configuration file.
These paths are hardcoded into the SCR library and scripts during the build process.
It is not possible to specify the control directory through environment variables or the user configuration file.
Unlike the control directory, the SCR cache directory can be specified at run time through either environment variables or the user configuration file.
SCR API¶
SCR is designed to support MPI applications that write application-level checkpoints and output datasets. Both types of datasets (checkpoints and output) must be stored as a file-per-process, and they must be accessed in a globally-coordinated fashion. In a given dataset, each process may actually write zero or more files, but the current implementation assumes that each process writes roughly the same amount of data.
Parallel file systems allow any process in an MPI job to read/write any byte of a file at any time. However, most applications do not require this full generality. SCR supplies API calls that enable the application to specify limits on its data access in both time and space. Start and complete calls indicate when an application needs to write or read its data. Data cannot be accessed outside of these markers. Additionally, each MPI process may only read files written by itself or another process having the same MPI rank in a previous run. An MPI process cannot read files written by a process having a different MPI rank.
The API is designed to be simple, scalable, and portable. It consists of a small number of function calls to wrap existing application I/O logic. Unless otherwise stated, SCR functions are collective, meaning all processes must call the function synchronously. The underlying implementation may or may not be synchronous, but to be portable, an application must treat a collective call as though it is synchronous. This constraint enables the SCR implementation to utilize the full resources of the job in a collective manner to optimize performance at critical points such as computing redundancy data.
In the sections below, we show the function prototypes for C and Fortran, respectively.
Applications written in C should include “scr.h
”,
and Fortran should include “scrf.h
”.
All functions return SCR_SUCCESS
if successful.
General API¶
SCR_Init¶
int SCR_Init();
SCR_INIT(IERROR)
INTEGER IERROR
Initialize the SCR library.
This function must be called after MPI_Init
, and
it is good practice to call this function immediately after MPI_Init
.
A process should only call SCR_Init
once during its execution.
No other SCR calls are valid until a process has returned from SCR_Init
.
SCR_Finalize¶
int SCR_Finalize();
SCR_FINALIZE(IERROR)
INTEGER IERROR
Shut down the SCR library.
This function must be called before MPI_Finalize
, and
it is good practice to call this function just before MPI_Finalize
.
A process should only call SCR_Finalize
once during its execution.
If SCR_FLUSH
is enabled,
SCR_Finalize
flushes any datasets to the prefix directory if necessary.
It updates the halt file to indicate that SCR_Finalize
has been called.
This halt condition prevents the job from restarting (Section Halt a job).
SCR_Get_version¶
char* SCR_Get_version(void);
SCR_GET_VERSION(VERSION, IERROR)
CHARACTER*(*) VERSION
INTEGER IERROR
This function returns a string that indicates the version number of SCR that is currently in use.
SCR_Should_exit¶
int SCR_Should_exit(int* flag);
SCR_SHOULD_EXIT(FLAG, IERROR)
INTEGER FLAG, IERROR
SCR_Should_exit
provides a portable way for an application
to determine whether it should halt its execution.
This function is passed a pointer to an integer in flag
.
Upon returning from SCR_Should_exit
,
flag
is set to the value 1
if the application should stop,
and it is set to 0
otherwise.
The call returns the same value in flag
on all processes.
It is recommended to call this function after each checkpoint.
Since datasets in cache may be deleted by the system at the end of an allocation,
it is critical for a job to stop early enough to leave time to copy datasets
from cache to the parallel file system before the allocation expires.
By default, the SCR library automatically calls exit
at certain points.
This works especially well in conjunction with the SCR_HALT_SECONDS
parameter.
However, this default behavior does not provide the application a chance to exit cleanly.
SCR can be configured to avoid an automatic exit using the SCR_HALT_ENABLED
parameter.
This call also enables a running application to react to external commands.
For instance, if the application has been instructed to halt using the scr_halt
command,
then SCR_Should_exit
relays that information.
SCR_Route_file¶
int SCR_Route_file(const char* name, char* file);
SCR_ROUTE_FILE(NAME, FILE, IERROR)
CHARACTER*(*) NAME, FILE
INTEGER IERROR
When files are under control of SCR, they may
be written to or exist on different levels of the storage hierarchy
at different points in time.
For example, a checkpoint might be written first to the RAM disk of
a compute node and then later transferred to a burst buffer or the
parallel file system by SCR. In order for an application to discover where
a file should be written to or read from, we provide the
SCR_Route_file
routine.
A process calls SCR_Route_file
to obtain the full path and file name
it must use to access a file under SCR control.
The name of the file that the process intends to access must be passed in the name
argument.
A pointer to a character buffer of at least SCR_MAX_FILENAME
bytes must be passed in file
.
When a call to SCR_Route_file
returns,
the full path and file name to access the file named in name
is written
to the buffer pointed to by file
.
The process must use the character string returned in file
to access the file.
A process does not need to create any directories listed in the string returned in file
.
The SCR implementation creates any necessary directories before returning from the call.
A call to SCR_Route_file
is local to the calling process; it is not a collective call.
As of version 1.2.2, SCR_Route_file
can be succesfully called at any point during application execution.
If it is called outside of a Start/Complete pair, the original file path is simply copied to the return string.
SCR_Route_file
has special behaviour when called within a Start/Complete pair
for restart, checkpoint, or output.
Within a restart operation, the input parameter name
only requires a file name.
No path component is needed.
SCR will return a full path to the file from the most recent checkpoint having the same name.
It will return an error if no file by that name exists.
Within checkpoint and output operations, the input parameter name
also specifies
the final path on the parallel file system.
The caller may provide either absolute or relative path components in name
.
If the path is relative, SCR prepends the current working directory to name
at the time that SCR_Route_file
is called.
With either an absolute or relative path, all paths must resolve to a location
within the subtree rooted at the SCR prefix directory.
In the current implementation,
SCR only changes the directory portion of name
.
It extracts the base name of the file by removing any directory components in name
.
Then it prepends a directory to the base file name
and returns the full path and file name in file
.
Checkpoint API¶
Here we describe the SCR API functions that are used for writing checkpoints.
SCR_Need_checkpoint¶
int SCR_Need_checkpoint(int* flag);
SCR_NEED_CHECKPOINT(FLAG, IERROR)
INTEGER FLAG, IERROR
Since the failure frequency and the cost of checkpointing vary across platforms,
SCR_Need_checkpoint
provides a portable way for an application
to determine whether a checkpoint should be taken.
This function is passed a pointer to an integer in flag
.
Upon returning from SCR_Need_checkpoint
,
flag
is set to the value 1
if a checkpoint should be taken,
and it is set to 0
otherwise.
The call returns the same value in flag
on all processes.
SCR_Start_checkpoint¶
int SCR_Start_checkpoint();
SCR_START_CHECKPOINT(IERROR)
INTEGER IERROR
Inform SCR that a new checkpoint is about to start.
A process must call this function before it opens any files belonging to the new checkpoint.
SCR_Start_checkpoint
must be called by all processes,
including processes that do not write files as part of the checkpoint.
This function should be called as soon as possible when initiating a checkpoint.
The SCR implementation uses this call as the starting point to time the cost of the
checkpoint in order to optimize the checkpoint frequency via SCR_Need_checkpoint
.
Each call to SCR_Start_checkpoint
must be followed by a corresponding call
to SCR_Complete_checkpoint
.
In the current implementation, SCR_Start_checkpoint
holds all processes
at an MPI_Barrier
to ensure that all processes are ready to start the
checkpoint before it deletes cached files from a previous checkpoint.
SCR_Complete_checkpoint¶
int SCR_Complete_checkpoint(int valid);
SCR_COMPLETE_CHECKPOINT(VALID, IERROR)
INTEGER VALID, IERROR
Inform SCR that all files for the current checkpoint are complete (i.e., done writing and closed)
and whether they are valid (i.e., written without error).
A process must close all checkpoint files before calling SCR_Complete_checkpoint
.
SCR_Complete_checkpoint
must be called by all processes,
including processes that did not write any files as part of the checkpoint.
The parameter valid
should be set to 1
if either the calling process wrote
all of its files successfully or it wrote no files during the checkpoint.
Otherwise, the process should call SCR_Complete_checkpoint
with valid
set to 0
.
SCR will determine whether all processes wrote their checkpoint files successfully.
The SCR implementation uses this call as the stopping point to time the cost of the checkpoint
that started with the preceding call to SCR_Start_checkpoint
.
Each call to SCR_Complete_checkpoint
must be preceded by a corresponding call
to SCR_Start_checkpoint
.
In the current implementation,
SCR applies the redundancy scheme during SCR_Complete_checkpoint
.
Before returning from the function,
MPI rank 0 determines whether the job should be halted
and signals this condition to all other ranks (Section Halt a job).
If the job should be halted, rank 0 records a reason in the halt file,
and then all tasks call exit
, unless the auto exit feature is disabled.
Restart API¶
Here we describe the SCR API functions used for restarting applications.
SCR_Have_restart¶
int SCR_Have_restart(int* flag, char* name);
SCR_HAVE_RESTART(FLAG, NAME, IERROR)
INTEGER FLAG
CHARACTER*(*) NAME
INTEGER IERROR
This function indicates whether SCR has a checkpoint available for the application to read.
This function is passed a pointer to an integer in flag
.
Upon returning from SCR_Have_restart
,
flag
is set to the value 1
if a checkpoint is available,
and it is set to 0
otherwise.
The call returns the same value in flag
on all processes.
A pointer to a character buffer of at least SCR_MAX_FILENAME
bytes can be passed in name
.
If there is a checkpoint, and if that checkpoint was assigned a name when it was created,
SCR_Have_restart
returns the name of that checkpoint in name
.
The value returned in name
is the same string that was passed to SCR_Start_output
when the checkpoint was created.
In C, one may optionally pass NULL
to this function to avoid returning the name.
The same value is returned in name
on all processes.
SCR_Start_restart¶
int SCR_Start_restart(char* name);
SCR_START_RESTART(NAME, IERROR)
CHARACTER*(*) NAME
INTEGER IERROR
This function informs SCR that a restart operation is about to start.
A process must call this function before it opens any files belonging to the restart.
SCR_Start_restart
must be called by all processes,
including processes that do not read files as part of the restart.
SCR returns the name of loaded checkpoint in name
.
A pointer to a character buffer of at least SCR_MAX_FILENAME
bytes can be passed in name
.
The value returned in name
is the same string that was passed to SCR_Start_output
when the checkpoint was created.
In C, one may optionally pass NULL
to this function to avoid returning the name.
The same value is returned in name
on all processes.
One may only call SCR_Start_restart
when SCR_Have_restart
indicates that there is a checkpoint to read.
SCR_Start_restart
returns the same value in name as the preceding call to SCR_Have_restart
.
Each call to SCR_Start_restart
must be followed by a corresponding call
to SCR_Complete_restart
.
SCR_Complete_restart¶
int SCR_Complete_restart(int valid);
SCR_COMPLETE_RESTART(VALID, IERROR)
INTEGER VALID, IERROR
This call informs SCR that the process has finished reading its checkpoint files.
A process must close all restart files before calling SCR_Complete_restart
.
SCR_Complete_restart
must be called by all processes,
including processes that did not read any files as part of the restart.
The parameter valid
should be set to 1
if either the calling process read
all of its files successfully or it read no files during the checkpoint.
Otherwise, the process should call SCR_Complete_restart
with valid
set to 0
.
SCR will determine whether all processes read their checkpoint files
successfully based on the values supplied in the valid
parameter.
If any process failed to read its checkpoint files, then SCR will abort.
Each call to SCR_Complete_restart
must be preceded by a corresponding call
to SCR_Start_restart
.
Output API¶
As of SCR version 1.2.0, SCR has the ability to manage application output datasets in addition to checkpoint datasets. Using a combination of bit flags, a dataset can be designated as a checkpoint, output, or both. The checkpoint property means that the dataset can be used to restart the application. The output property means that the dataset must be written to the prefix directory. This enables an application to utilize asynchronous transfers to the parallel file system for both its checkpoints and large output sets, so that it can return to computation while the dataset migrates to the parallel file system in the background.
If a user specifies that a dataset is a checkpoint only, then the dataset will be managed with the SCR Output API as it would be if the SCR Checkpoint API were used. In particular, SCR may delete the checkpoint when a more recent checkpoint is established.
If a user specifies that a dataset is for output only, the dataset will first be cached on a tier of storage specified in the configuration file for the run and protected with the corresponding redundancy scheme. Then, the dataset will be moved to the prefix directory. When the transfer to the prefix directory is complete, the cached copy of the output dataset will be deleted.
If the user specifies that the dataset is both output and checkpoint, then SCR will use a hybrid approach. Files in the dataset will be cached and redundancy schemes will be used to protect the files. The dataset will be copied to the prefix directory, but it will also be kept in cache according to the policy set in the configuration for checkpoints. For example, if the user has set the configuration to keep three checkpoints in cache, then the dataset will be preserved until it is replaced by a newer checkpoint after three more checkpoint phases.
SCR_Start_output¶
int SCR_Start_output(char* name, int flags);
SCR_START_OUTPUT(NAME, FLAGS, IERROR)
CHARACTER*(*) NAME
INTEGER FLAGS, IERROR
Inform SCR that a new output phase is about to start.
A process must call this function before it opens any files belonging to the dataset.
SCR_Start_output
must be called by all processes,
including processes that do not write files as part of the dataset.
The caller can provide a name for the dataset in name
.
This name is used in two places.
First, for checkpoints, it is returned as the name value in the SCR Restart API.
Second, it is exposed to the user when listing datasets using the scr_index
command,
and the user may specify the name as a command line argument at times.
For this reason, it is recommended to use short but meaningful names that are easy to type.
The name value must be less than SCR_MAX_FILENAME
characters.
All processes should provide identical values in name
.
In C, the application may pass NULL
for name
in which case SCR generates a default name for the dataset based on its internal dataset id.
The dataset can be output, a checkpoint, or both.
The caller specifies these properties using
SCR_FLAG_OUTPUT
and SCR_FLAG_CHECKPOINT
bit flags.
Additionally, a SCR_FLAG_NONE
flag is defined for initializing variables.
In C, these values can be combined with the |
bitwise OR operator.
In Fortran, these values can be added together using the +
sum operator.
Note that with Fortran, the values should be used at most once in the addition.
All processes should provide identical values in flags
.
This function should be called as soon as possible when initiating a dataset output.
It is used internally within SCR for timing the cost of output operations.
Each call to SCR_Start_output
must be followed by a corresponding call
to SCR_Complete_output
.
In the current implementation, SCR_Start_output
holds all processes
at an MPI_Barrier
to ensure that all processes are ready to start the
output before it deletes cached files from a previous checkpoint.
SCR_Complete_output¶
int SCR_Complete_output(int valid);
SCR_COMPLETE_OUTPUT(VALID, IERROR)
INTEGER VALID, IERROR
Inform SCR that all files for the current dataset output are complete (i.e., done writing and closed)
and whether they are valid (i.e., written without error).
A process must close all files in the dataset before calling SCR_Complete_output
.
SCR_Complete_output
must be called by all processes,
including processes that did not write any files as part of the output.
The parameter valid
should be set to 1
if either the calling process wrote
all of its files successfully or it wrote no files during the output phase.
Otherwise, the process should call SCR_Complete_output
with valid
set to 0
.
SCR will determine whether all processes wrote their output files successfully.
Each call to SCR_Complete_output
must be preceded by a corresponding call
to SCR_Start_output
.
For the case of checkpoint datasets,
SCR_Complete_output
behaves similarly to SCR_Complete_checkpoint
.
Space/time semantics¶
SCR imposes the following semantics:
- A process of a given MPI rank may only access files previously written by itself or by processes having the same MPI rank in prior runs. We say that a rank “owns” the files it writes. A process is never guaranteed access to files written by other MPI ranks.
- During a checkpoint,
a process may only access files of the current checkpoint
between calls to
SCR_Start_checkpoint()
andSCR_Complete_checkpoint()
. Once a process callsSCR_Complete_checkpoint()
it is no longer guaranteed access to any file that it registered as part of that checkpoint viaSCR_Route_file()
. - During a restart,
a process may only access files from its “most recent” checkpoint,
and it must access those files between calls to
SCR_Start_restart()
andSCR_Complete_restart()
. Once a process callsSCR_Complete_restart()
it is no longer guaranteed access to its restart files. SCR selects which checkpoint is considered to be the “most recent”.
These semantics enable SCR to cache files on devices that are not globally visible to all processes, such as node-local storage. Further, these semantics enable SCR to move, reformat, or delete files as needed, such that it can manage this cache.
SCR API state transitions¶

SCR API State Transition Diagram
Figure SCR API State Transition Diagram illustrates the internal states in
SCR and which API calls can be used from within each state.
The application must call SCR_Init
before it may call any other SCR function,
and it may not call SCR functions after calling SCR_Finalize
.
Some calls transition SCR from one state to another as shown by the edges between states.
Other calls are only valid when in certain states as shown in the boxes.
For example, SCR_Route_file
is only valid within the Checkpoint, Restart, or Output states.
All SCR functions are implicitly collective across MPI_COMM_WORLD
,
except for SCR_Route_file
and SCR_Get_version
.
Integrate SCR¶
This section provides details on how to integrate the SCR API into an application.
Using the SCR API¶
Before adding calls to the SCR library, consider that an application has existing checkpointing code that looks like the following
int main(int argc, char* argv[]) {
MPI_Init(&argc, &argv);
/* initialize our state from checkpoint file */
state = restart();
for (t = 0; t < TIMESTEPS; t++) {
/* ... do work ... */
/* every so often, write a checkpoint */
if (t % CHECKPOINT_FREQUENCY == 0)
checkpoint();
}
MPI_Finalize();
return 0;
}
void checkpoint() {
/* rank 0 creates a directory on the file system,
* and then each process saves its state to a file */
/* get rank of this process */
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
/* rank 0 creates directory on parallel file system */
if (rank == 0) mkdir(checkpoint_dir);
/* hold all processes until directory is created */
MPI_Barrier(MPI_COMM_WORLD);
/* build file name of checkpoint file for this rank */
char checkpoint_file[256];
sprintf(checkpoint_file, "%s/rank_%d.ckpt",
checkpoint_dir, rank
);
/* each rank opens, writes, and closes its file */
FILE* fs = fopen(checkpoint_file, "w");
if (fs != NULL) {
fwrite(checkpoint_data, ..., fs);
fclose(fs);
}
/* wait for all files to be closed */
MPI_Barrier(MPI_COMM_WORLD);
/* rank 0 updates the pointer to the latest checkpoint */
FILE* fs = fopen("latest", "w");
if (fs != NULL) {
fwrite(checkpoint_dir, ..., fs);
fclose(fs);
}
}
void* restart() {
/* rank 0 broadcasts directory name to read from,
* and then each process reads its state from a file */
/* get rank of this process */
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
/* rank 0 reads and broadcasts checkpoint directory name */
char checkpoint_dir[256];
if (rank == 0) {
FILE* fs = fopen("latest", "r");
if (fs != NULL) {
fread(checkpoint_dir, ..., fs);
fclose(fs);
}
}
MPI_Bcast(checkpoint_dir, sizeof(checkpoint_dir), MPI_CHAR, ...);
/* build file name of checkpoint file for this rank */
char checkpoint_file[256];
sprintf(checkpoint_file, "%s/rank_%d.ckpt",
checkpoint_dir, rank
);
/* each rank opens, reads, and closes its file */
FILE* fs = fopen(checkpoint_file, "r");
if (fs != NULL) {
fread(state, ..., fs);
fclose(fs);
}
return state;
}
There are three steps to consider when integrating the SCR API into an application: Init/Finalize, Checkpoint, and Restart. One may employ the scalable checkpoint capability of SCR without the scalable restart capability. While it is most valuable to utilize both, some applications cannot use the scalable restart.
The following code exemplifies the changes necessary to integrate SCR. Each change is numbered for further discussion below.
Init/Finalize¶
You must add calls to SCR_Init
and SCR_Finalize
in order to start up and shut down the library.
The SCR library uses MPI internally,
and all calls to SCR must be from within a well defined MPI environment,
i.e., between MPI_Init
and MPI_Finalize
.
It is recommended to call SCR_Init
immediately after MPI_Init
and to call SCR_Finalize
just before MPI_Finalize
.
For example, modify the source to look something like this
int main(int argc, char* argv[]) {
MPI_Init(&argc, &argv);
/**** change #1 ****/
SCR_Init();
/**** change #2 ****/
int have_restart;
SCR_Have_restart(&have_restart, NULL);
if (have_restart)
state = restart();
else
state = new_run_state;
for (t = 0; t < TIMESTEPS; t++) {
/* ... do work ... */
/**** change #3 ****/
int need_checkpoint;
SCR_Need_checkpoint(&need_checkpoint);
if (need_checkpoint)
checkpoint();
}
/**** change #4 ****/
SCR_Finalize();
MPI_Finalize();
return 0;
}
First, as shown in change #1,
one must call SCR_Init()
to initialize the SCR library before it can be used.
SCR uses MPI, so SCR must be initialized after MPI has been initialized.
Similarly, as shown in change #4,
one should shut down the SCR library by calling SCR_Finalize()
.
This must be done before calling MPI_Finalize()
.
Internally, SCR duplicates MPI_COMM_WORLD
during SCR_Init
,
so MPI messages from the SCR library do not mix with messages sent by the application.
Some applications contain multiple calls to MPI_Finalize
.
In such cases, be sure to account for each call.
The same applies to MPI_Init
if there are multiple calls to this function.
In change #2, the application can call SCR_Have_restart()
to determine
whether there is a checkpoint to read in.
If so, it calls its restart function, otherwise it assumes it is starting from scratch.
This should only be called if the application is using the scalable restart feature of SCR.
As shown in change #3,
the application may rely on SCR to determine when to
checkpoint by calling SCR_Need_checkpoint()
.
SCR can be configured with information on failure rates and checkpoint costs
for the particular host platform, so this function provides a portable
method to guide an application toward an optimal checkpoint frequency.
For this, the application should call SCR_Need_checkpoint
at each natural opportunity it has to checkpoint, e.g., at the end of each time step,
and then initiate a checkpoint when SCR advises it to do so.
An application may ignore the output of SCR_Need_checkpoint
,
and it does not have to call the function at all.
The intent of SCR_Need_checkpoint
is to provide a portable way for
an application to determine when to checkpoint across platforms with different
reliability characteristics and different file system speeds.
Checkpoint¶
To actually write a checkpoint, there are three steps.
First, the application must call SCR_Start_checkpoint
to define the start boundary of a new checkpoint.
It must do this before it opens any file belonging to the new checkpoint.
Then, the application must call SCR_Route_file
for each file
that it will write in order to register the file with SCR and to
determine the full path and file name to open each file.
Finally, it must call SCR_Complete_checkpoint
to define the end boundary of the checkpoint.
If a process does not write any files during a checkpoint,
it must still call SCR_Start_checkpoint
and SCR_Complete_checkpoint
as these functions are collective.
All files registered through a call to SCR_Route_file
between a given
SCR_Start_checkpoint
and SCR_Complete_checkpoint
pair are considered to
be part of the same checkpoint file set.
Some example SCR checkpoint code looks like the following
void checkpoint() {
/* each process saves its state to a file */
/**** change #5 ****/
SCR_Start_checkpoint();
/* get rank of this process */
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
/**** change #6 ****/
/*
if (rank == 0)
mkdir(checkpoint_dir);
// hold all processes until directory is created
MPI_Barrier(MPI_COMM_WORLD);
*/
/* build file name of checkpoint file for this rank */
char checkpoint_file[256];
sprintf(checkpoint_file, "%s/rank_%d.ckpt",
checkpoint_dir, rank
);
/**** change #7 ****/
char scr_file[SCR_MAX_FILENAME];
SCR_Route_file(checkpoint_file, scr_file);
/**** change #8 ****/
/* each rank opens, writes, and closes its file */
FILE* fs = fopen(scr_file, "w");
if (fs != NULL) {
fwrite(checkpoint_data, ..., fs);
fclose(fs);
}
/**** change #9 ****/
/*
// wait for all files to be closed
MPI_Barrier(MPI_COMM_WORLD);
// rank 0 updates the pointer to the latest checkpoint
FILE* fs = fopen("latest", "w");
if (fs != NULL) {
fwrite(checkpoint_dir, ..., fs);
fclose(fs);
}
*/
/**** change #10 ****/
SCR_Complete_checkpoint(valid);
/**** change #11 ****/
/* Check whether we should stop */
int should_exit;
SCR_Should_exit(&should_exit);
if (should_exit) {
exit(0);
}
}
As shown in change #5, the application must inform SCR when it is starting a new checkpoint
by calling SCR_Start_checkpoint()
.
Similarly, it must inform SCR when it has completed the checkpoint
with a corresponding call to SCR_Complete_checkpoint()
as shown in change #10.
When calling SCR_Complete_checkpoint()
, each process sets the valid
flag to indicate
whether it wrote all of its checkpoint files successfully.
SCR manages checkpoint directories,
so the mkdir
operation is removed in change #6.
Additionally, the application can rely on SCR to track the latest checkpoint,
so the logic to track the latest checkpoint is removed in change #9.
Between the call to SCR_Start_checkpoint()
and SCR_Complete_checkpoint()
,
the application must register each of its checkpoint files by calling
SCR_Route_file()
as shown in change #7.
SCR “routes” the file by replacing any leading directory
on the file name with a path that points to another directory in which SCR caches data for the checkpoint.
As shown in change #8,
the application must use the exact string returned by SCR_Route_file()
to open
its checkpoint file.
Also note how the application can call SCR_Should_exit
after a checkpoint to determine whether it is time to stop shown in change #11.
This is important so that an application stops with sufficient
time remaining to copy datasets from cache to the parallel file system
before the allocation expires.
Restart with SCR¶
There are two options to access files during a restart: with and without SCR. If an application is designed to restart such that each MPI task only needs access to the files it wrote during the previous checkpoint, then the application can utilize the scalable restart capability of SCR. This enables the application to restart from a cached checkpoint in the existing resource allocation, which saves the cost of writing to and reading from the parallel file system.
To use SCR for restart, the application can call SCR_Have_restart
to determine whether SCR has a previous checkpoint loaded.
If there is a checkpoint available, the application
can call SCR_Start_restart
to tell SCR that a restart operation is beginning.
Then, the application must call SCR_Route_file
to determine the
full path and file name to each of its checkpoint files that it will read for restart.
The input file name to SCR_Route_file
does not need a path during restart,
as SCR will identify the file just based on its file name.
After the application reads in its checkpoint files, it must call
SCR_Complete_restart
to indicate that it has completed reading its checkpoint files.
Some example SCR restart code may look like the following
void* restart() {
/* each process reads its state from a file */
/**** change #12 ****/
SCR_Start_restart(NULL);
/* get rank of this process */
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
/**** change #13 ****/
/*
// rank 0 reads and broadcasts checkpoint directory name
char checkpoint_dir[256];
if (rank == 0) {
FILE* fs = fopen("latest", "r");
if (fs != NULL) {
fread(checkpoint_dir, ..., fs);
fclose(fs);
}
}
MPI_Bcast(checkpoint_dir, sizeof(checkpoint_dir), MPI_CHAR, ...);
*/
/**** change #14 ****/
/* build file name of checkpoint file for this rank */
char checkpoint_file[256];
sprintf(checkpoint_file, "rank_%d.ckpt",
rank
);
/**** change #15 ****/
char scr_file[SCR_MAX_FILENAME];
SCR_Route_file(checkpoint_file, scr_file);
/**** change #16 ****/
/* each rank opens, reads, and closes its file */
FILE* fs = fopen(scr_file, "r");
if (fs != NULL) {
fread(state, ..., fs);
fclose(fs);
}
/**** change #17 ****/
SCR_Complete_restart(valid);
return state;
}
As shown in change #12,
the application calls SCR_Start_restart()
to inform SCR that it is beginning its restart.
SCR automatically loads the most recent checkpoint,
so the application logic to identify the latest checkpoint is removed in change #13.
During a restart, the application only needs the file name,
so the checkpoint directory can be dropped from the path in change #14.
Instead, the application gets the path to use to open the checkpoint file
via a call to SCR_Route_file()
in change #15.
It then uses that path to open the file for reading in change #16.
After the process has read each of its checkpoint files,
it informs SCR that it has completed reading its data with a call
to SCR_Complete_restart()
in change #17.
When calling SCR_Complete_restart()
, each process sets the valid
flag to indicate
whether it read all of its checkpoint files successfully.
Restart without SCR¶
If the application does not use SCR for restart,
it should not make calls to SCR_Have_restart
,
SCR_Start_restart
, SCR_Route_file
, or
SCR_Complete_restart
during the restart.
Instead, it should access files directly from the parallel file system.
When restarting without SCR,
the value of the SCR_FLUSH
counter will not be preserved between restarts.
The counter will be reset to its upper limit with each restart.
Thus, each restart may introduce some fixed offset in a series of periodic SCR flushes.
When not using SCR for restart, one should set the SCR_FLUSH_ON_RESTART
parameter to 1
,
which will cause SCR to flush any cached checkpoint to the file system during SCR_Init
.
Building with the SCR library¶
To compile and link with the SCR library,
add the flags in Table~ref{table:build_flags} to your compile and link lines.
The value of the variable SCR_INSTALL_DIR
should be the path
to the installation directory for SCR.
SCR build flags
Compile Flags | -I$(SCR_INSTALL_DIR)/include |
C Dynamic Link Flags | -L$(SCR_INSTALL_DIR)/lib64 -lscr -Wl,-rpath,$(SCR_INSTALL_DIR)/lib64 |
C Static Link Flags | -L$(SCR_INSTALL_DIR)/lib64 -lscr -lz |
Fortran Dynamic Link Flags | -L$(SCR_INSTALL_DIR)/lib64 -lscrf -Wl,-rpath,$(SCR_INSTALL_DIR)/lib64 |
Fortran Static Link Flags | -L$(SCR_INSTALL_DIR)/lib64 -lscrf -lz |
Run a job¶
In addition to the SCR library, one must properly configure the batch system and include a set of SCR commands in the job batch script. In particular, one must: 1) inform the batch system that the allocation should remain available even after a failure and 2) replace the command to execute the application with an SCR wrapper script. The precise set of options and commands to use depends on the system resource manager.
The SCR commands prepare the cache, scavenge files from cache to the parallel file system,
and check that the scavenged dataset is complete among other things.
These commands are located in the /bin
directory where SCR is installed.
There are numerous SCR commands.
Any command not mentioned in this document is
not intended to be executed by users.
Supported platforms¶
At the time of this writing, SCR supports specific combinations of resource managers and job launchers. The descriptions for using SCR in this section apply to these specific configurations, however the following description is helpful to understand how to run SCR on any system. Please contact us for help in porting SCR to other platforms. (See Section Support for contact information).
Jobs and job steps¶
First, we differentiate between a job allocation and a job step. Our terminology originates from the SLURM resource manager, but the principles apply generally across SCR-supported resource managers.
When a job is scheduled resources on a system,
the batch script executes inside of a job allocation.
The job allocation consists of a set of nodes, a time limit, and a job id.
The job id can be obtained by executing the squeue
command
on SLURM, the apstat
command on ALPS, and the bjobs
command on LSF.
Within a job allocation, a user may run one or more job steps,
each of which is invoked by a call to srun
on SLURM, aprun
on ALPS, or mpirun
on LSF.
Each job step is assigned its own step id.
On SLURM, within each job allocation, job step ids start at 0 and increment with each issued job step.
Job step ids can be obtained by passing the -s
option to squeue
.
A fully qualified name of a SLURM job step consists of: jobid.stepid
.
For instance, the name 1234.5
refers to step id 5 of job id 1234.
On ALPS, each job step within an allocation has a unique id that can be obtained
through apstat
.
Ignoring node failures¶
Before running an SCR job, it is necessary to configure the job allocation to withstand node failures. By default, most resource managers terminate the job allocation if a node fails, however SCR requires the job allocation to remain active in order to restart the job or to scavenge files. To enable the job allocation to continue past node failures, one must specify the appropriate flags from the table below.
SCR job allocation flags
MOAB batch script | #MSUB -l resfailpolicy=ignore |
MOAB interactive | qsub -I ... -l resfailpolicy=ignore |
SLURM batch script | #SBATCH --no-kill |
SLURM interactive | salloc --no-kill ... |
LSF batch script | #BSUB -env "all, LSB_DJOB_COMMFAIL_ACTION=KILL_TASKS" |
LSF interactive | bsub -env "all, LSB_DJOB_COMMFAIL_ACTION=KILL_TASKS" ... |
The SCR wrapper script¶
The easiest way to integrate SCR into a batch script is to set some environment variables and to replace the job run command with an SCR wrapper script. The SCR wrapper script includes logic to restart an application within an job allocation, and it scavenges files from cache to the parallel file system at the end of an allocation.:
SLURM: scr_srun [srun_options] <prog> [prog_args ...]
ALPS: scr_aprun [aprun_options] <prog> [prog_args ...]
LSF: scr_mpirun [mpirun_options] <prog> [prog_args ...]
The SCR wrapper script must run from within a job allocation.
Internally, the command must know the prefix directory.
By default, it uses the current working directory.
One may specify a different prefix directory by setting the SCR_PREFIX
parameter.
It is recommended to set the SCR_HALT_SECONDS
parameter so that the job allocation does not expire before
datasets can be flushed (Section Halt a job).
By default, the SCR wrapper script does not restart an application after the first job step exits.
To automatically restart a job step within the current allocation,
set the SCR_RUNS
environment variable to the maximum number of runs to attempt.
For an unlimited number of attempts, set this variable to -1
.
After a job step exits, the wrapper script checks whether it should restart the job.
If so, the script sleeps for some time to give nodes in the allocation a chance to clean up.
Then, it checks that there are sufficient healthy nodes remaining in the allocation.
By default, the wrapper script assumes the next run requires the same number of nodes as the previous run,
which is recorded in a file written by the SCR library.
If this file cannot be read, the command assumes the application requires all nodes in the allocation.
Alternatively, one may override these heuristics and precisely specify the number of nodes needed
by setting the SCR_MIN_NODES
environment variable to the number of required nodes.
Some applications cannot run via wrapper scripts.
For applications that cannot invoke the SCR wrapper script as described here,
one should examine the logic contained in the script and duplicate the necessary parts
in the job batch script.
In particular, one should invoke scr_postrun
for scavenge support.
Example batch script for using SCR restart capability¶
An example MOAB / SLURM batch script with scr_srun
is shown below
#!/bin/bash
#MSUB -l partition=atlas
#MSUB -l nodes=66
#MSUB -l resfailpolicy=ignore
# above, tell MOAB to not kill the job allocation upon a node failure
# also note that the job requested 2 spares -- it uses 64 nodes but allocated 66
# specify where datasets should be written
export SCR_PREFIX=/my/parallel/file/system/username/run1/checkpoints
# instruct SCR to flush to the file system every 20 checkpoints
export SCR_FLUSH=20
# halt if there is less than an hour remaining (3600 seconds)
export SCR_HALT_SECONDS=3600
# attempt to run the job up to 3 times
export SCR_RUNS=3
# run the job with scr_srun
scr_srun -n512 -N64 ./my_job
Configure a job¶
The default SCR configuration suffices for many Linux clusters. However, significant performance improvement or additional functionality may be gained via custom configuration.
Setting parameters¶
SCR searches the following locations in the following order for a parameter value, taking the first value it finds.
- Environment variables,
- User configuration file,
- System configuration file,
- Compile-time constants.
Some parameters, such as the location of the control directory, cannot be specified by the user. Such parameters must be either set in the system configuration file or hard-coded into SCR as compile-time constants.
To find a user configuration file,
SCR looks for a file named .scrconf
in the prefix directory (note the leading dot).
Alternatively, one may specify the name and location of the user configuration file
by setting the SCR_CONF_FILE
environment variable at run time, e.g.,:
export SCR_CONF_FILE=~/myscr.conf
The location of the system configuration file is hard-coded into SCR at build time.
This defaults to /etc/scr/scr.conf
.
One may set this using the SCR_CONFIG_FILE
option with cmake, e.g.,:
cmake -DSCR_CONFIG_FILE=/path/to/scr.conf ...
To set an SCR parameter in a configuration file, list the parameter name followed by its value separated by an ‘=’ sign. Blank lines are ignored, and any characters following the ‘#’ comment character are ignored. For example, a configuration file may contain something like the following:
>>: cat ~/myscr.conf
# set the halt seconds to one hour
SCR_HALT_SECONDS=3600
# set SCR to flush every 20 checkpoints
SCR_FLUSH=20
Group, store, and checkpoint descriptors¶
SCR must have information about process groups, storage devices, and redundancy schemes. The defaults provide reasonable settings for Linux clusters, but one can define custom settings via group, store, and checkpoint descriptors in configuration files.
SCR must know which processes are likely to fail
at the same time (failure groups) and which processes access a common
storage device (storage groups).
By default, SCR creates a group of all processes in the job called WORLD
and another group of all processes on the same compute node called NODE
.
If more groups are needed, they can be defined in configuration files
with entries like the following:
GROUPS=host1 POWER=psu1 SWITCH=0
GROUPS=host2 POWER=psu1 SWITCH=1
GROUPS=host3 POWER=psu2 SWITCH=0
GROUPS=host4 POWER=psu2 SWITCH=1
Group descriptor entries are identified by a leading GROUPS
key.
Each line corresponds to a single compute node,
where the hostname is the value of the GROUPS
key.
There must be one line for every compute node in the allocation.
It is recommended to specify groups in the system configuration file.
The remaining values on the line specify a set of group name / value pairs. The group name is the string to be referenced by store and checkpoint descriptors. The value can be an arbitrary character string. The only requirement is that for a given group name, nodes that form a group must provide identical strings the value.
In the above example, there are four compute nodes: host1, host2, host3, and host4.
There are two groups defined: POWER
and SWITCH
.
Nodes host1 and host2 belong to the same POWER
group, as do nodes host3 and host4.
For the SWITCH
group, nodes host1 and host3 belong to the same group,
as do nodes host2 and host4.
In addition to groups,
SCR must know about the storage devices available on a system.
SCR requires that all processes be able to access the prefix directory,
and it assumes that /tmp
is storage local to each compute node.
Additional storage can be described in configuration files
with entries like the following:
STORE=/tmp GROUP=NODE COUNT=1
STORE=/ssd GROUP=NODE COUNT=3
STORE=/dev/persist GROUP=NODE COUNT=1 ENABLED=1 MKDIR=0
STORE=/p/lscratcha GROUP=WORLD
Store descriptor entries are identified by a leading STORE
key.
Each line corresponds to a class of storage devices.
The value associated with the STORE
key is the
directory prefix of the storage device.
This directory prefix also serves as the name of the store descriptor.
All compute nodes must be able to access their respective storage
device via the specified directory prefix.
The remaining values on the line specify properties of the storage class.
The GROUP
key specifies the group of processes that share a device.
Its value must specify a group name.
The COUNT
key specifies the maximum number of checkpoints
that can be kept in the associated storage.
The user should be careful to set this appropriately
depending on the storage capacity and the application checkpoint size.
The COUNT
key is optional, and it defaults to the value
of the SCR_CACHE_SIZE
parameter if not specified.
The ENABLED
key enables (1) or disables (0) the store descriptor.
This key is optional, and it defaults to 1 if not specified.
The MKDIR
key specifies whether the device supports the
creation of directories (1) or not (0).
This key is optional, and it defaults to 1 if not specified.
In the above example, there are four storage devices specified:
/tmp
, /ssd
, /dev/persist
, and /p/lscratcha
.
The storage at /tmp
, /ssd
, and /dev/persist
specify the NODE
group, which means that they are node-local storage.
Processes on the same compute node access the same device.
The storage at /p/lscratcha
specifies the WORLD
group,
which means that all processes in the job can access the device.
In other words, it is a globally accessible file system.
Finally, SCR must be configured with redundancy schemes.
By default, SCR protects against single compute node failures using XOR
,
and it caches one checkpoint in /tmp
.
To specify something different,
edit a configuration file to include checkpoint and output descriptors.
These descriptors look like the following:
# instruct SCR to use the CKPT descriptors from the config file
SCR_COPY_TYPE=FILE
# the following instructs SCR to run with three checkpoint configurations:
# - save every 8th checkpoint to /ssd using the PARTNER scheme
# - save every 4th checkpoint (not divisible by 8) to /ssd using XOR with
# a set size of 8
# - save all other checkpoints (not divisible by 4 or 8) to /tmp using XOR with
# a set size of 16
CKPT=0 INTERVAL=1 GROUP=NODE STORE=/tmp TYPE=XOR SET_SIZE=16
CKPT=1 INTERVAL=4 GROUP=NODE STORE=/ssd TYPE=XOR SET_SIZE=8 OUTPUT=1
CKPT=2 INTERVAL=8 GROUP=SWITCH STORE=/ssd TYPE=PARTNER
CKPT=0 INTERVAL=1 GROUP=NODE STORE=/tmp TYPE=XOR SET_SIZE=16
First, one must set the SCR_COPY_TYPE
parameter to “FILE
”.
Otherwise, an implied checkpoint descriptor is constructed using various SCR parameters
including SCR_GROUP
, SCR_CACHE_BASE
,
SCR_COPY_TYPE
, and SCR_SET_SIZE
.
Checkpoint descriptor entries are identified by a leading CKPT
key.
The values of the CKPT
keys must be numbered sequentially starting from 0.
The INTERVAL
key specifies how often a descriptor is to be applied.
For each checkpoint,
SCR selects the descriptor having the largest interval value that evenly
divides the internal SCR checkpoint iteration number.
It is necessary that one descriptor has an interval of 1.
This key is optional, and it defaults to 1 if not specified.
The GROUP
key lists the failure group,
i.e., the name of the group of processes likely to fail.
This key is optional, and it defaults to the value of the
SCR_GROUP
parameter if not specified.
The STORE
key specifies the directory in which to cache the checkpoint.
This key is optional, and it defaults to the value of the
SCR_CACHE_BASE
parameter if not specified.
The TYPE
key identifies the redundancy scheme to be applied.
This key is optional, and it defaults to the value of the
SCR_COPY_TYPE
parameter if not specified.
Other keys may exist depending on the selected redundancy scheme.
For XOR
schemes, the SET_SIZE
key specifies
the minimum number of processes to include in each XOR
set.
One checkpoint descriptor can be marked with the OUTPUT
key.
This indicates that the descriptor should be selected to store datasets
that the application flags with SCR_FLAG_OUTPUT
.
The OUTPUT
key is optional, and it defaults to 0.
If there is no descriptor with the OUTPUT
key defined
and if the dataset is also a checkpoint,
SCR will choose the checkpoint descriptor according to the normal policy.
Otherwise, if there is no descriptor with the OUTPUT
key defined
and if the dataset is not a checkpoint,
SCR will use the checkpoint descriptor having interval of 1.
SCR parameters¶
The table in this section specifies the full set of SCR configuration parameters.
Name | Default | Description |
---|---|---|
SCR_HALT_SECONDS |
0 | Set to a positive integer to instruct SCR to halt the job after completing a successful checkpoint if the remaining time in the current job allocation is less than the specified number of seconds. |
SCR_HALT_ENABLED |
1 | Whether SCR should halt a job by calling exit() . Set to 0 to disable
in which case the application is responsible for stopping. |
SCR_GROUP |
NODE |
Specify name of failure group. |
SCR_COPY_TYPE |
XOR |
Set to one of: SINGLE , PARTNER , XOR , or FILE . |
SCR_CACHE_BASE |
/tmp |
Specify the base directory SCR should use to cachecheckpoints. |
SCR_CACHE_SIZE |
1 | Set to a non-negative integer to specify the maximum number of checkpoints SCR should keep in cache. SCR will delete the oldest checkpoint from cache before saving another in order to keep the total count below this limit. |
SCR_SET_SIZE |
8 | Specify the minimum number of processes to include in an XOR set. Increasing this value decreases the amount of storage required to cache the checkpoint data. However, higher values have an increased likelihood of encountering a catastrophic error. Higher values may also require more time to reconstruct lost files from redundancy data. |
SCR_PREFIX |
$PWD | Specify the prefix directory on the parallel file system where checkpoints should be read from and written to. |
SCR_CHECKPOINT_SECONDS |
0 | Set to positive number of seconds to specify minimum time between consecutive checkpoints as guided by SCR_Need_checkpoint . |
SCR_CHECKPOINT_OVERHEAD |
0.0 | Set to positive percentage to specify maximum overhead allowed for checkpointing operations as guided by SCR_Need_checkpoint . |
SCR_DISTRIBUTE |
1 | Set to 0 to disable file distribution during SCR_Init . |
SCR_FETCH |
1 | Set to 0 to disable SCR from fetching files from the parallel file system during SCR_Init . |
SCR_FETCH_WIDTH |
256 | Specify the number of processes that may read simultaneously from the parallel file system. |
SCR_FLUSH |
10 | Specify the number of checkpoints between periodic SCR flushes to the parallel file system. Set to 0 to disable periodic flushes. |
SCR_FLUSH_ASYNC |
0 | Set to 1 to enable asynchronous flush methods (if supported). |
SCR_FLUSH_WIDTH |
256 | Specify the number of processes that may write simultaneously to the parallel file system. |
SCR_FLUSH_ON_RESTART |
0 | Set to 1 to force SCR to flush a checkpoint during restart. This is useful for codes that must restart from the parallel file system. |
SCR_PRESERVE_DIRECTORIES |
1 | Whether SCR should preserve the application directory structure in prefix directory in flush and scavenge operations. Set to 0 to rely on SCR-defined directory layouts. |
SCR_RUNS |
1 | Specify the maximum number of times the scr_srun command should attempt to run a job within an allocation. Set to -1 to specify an unlimited number of times. |
SCR_MIN_NODES |
N/A | Specify the minimum number of nodes required to run a job. |
SCR_EXCLUDE_NODES |
N/A | Specify a set of nodes, using SLURM node range syntax, which should be excluded from runs. This is useful to avoid particular nodes while waiting for them to be fixed by system administrators. Nodes in this list which are not in the current allocation are silently ignored. |
SCR_MPI_BUF_SIZE |
131072 | Specify the number of bytes to use for internal MPI send and receive buffers when computing redundancy data or rebuilding lost files. |
SCR_FILE_BUF_SIZE |
1048576 | Specify the number of bytes to use for internal buffers when copying files between the parallel file system and the cache. |
SCR_CRC_ON_COPY |
0 | Set to 1 to enable CRC32 checks when copying files during the redundancy scheme. |
SCR_CRC_ON_DELETE |
0 | Set to 1 to enable CRC32 checks when deleting files from cache. |
SCR_CRC_ON_FLUSH |
1 | Set to 0 to disable CRC32 checks during fetch and flush operations. |
SCR_DEBUG |
0 | Set to 1 or 2 for increasing verbosity levels of debug messages. |
SCR_WATCHDOG_TIMEOUT |
N/A | Set to the expected time (seconds) for checkpoint writes to in-system storage (See Section Catch a hanging job). |
SCR_WATCHDOG_TIMEOUT_PFS |
N/A | Set to the expected time (seconds) for checkpoint writes to the parallel file system (See Section Catch a hanging job). |
Halt a job¶
There are several mechanisms to instruct a running SCR application to halt. It is often necessary to interact with the resource manager to halt a job.
scr_halt and the halt file¶
The recommended method to stop an SCR application is to use the scr_halt
command.
The command must be run from within the prefix directory,
or otherwise, the prefix directory of the target job must be specified as an argument.
A number of different halt conditions can be specified.
In most cases, the scr_halt
command communicates these conditions to the running
application via the halt.scr
file,
which is stored in the hidden .scr
directory within the prefix directory.
The SCR library reads the halt file when the application calls SCR_Init
and each time the application completes a checkpoint.
If a halt condition is satisfied, all tasks in the application call exit
.
One can disable this behavior by setting the SCR_HALT_ENABLED
parameter to 0.
In this case, the application can determine when to exit by calling SCR_Should_exit
.
Halt after next checkpoint¶
You can instruct an SCR job to halt after completing its next successful checkpoint:
scr_halt
To run scr_halt
from outside of a prefix directory,
specify the target prefix directory like so:
scr_halt /p/lscratcha/user1/simulation123
You can instruct an SCR job to halt after completing some number of checkpoints
via the --checkpoints
option.
For example, to instruct a job to halt after 10 more checkpoints, use the following:
scr_halt --checkpoints 10
If the last of the checkpoints is unsuccessful, the job continues until it completes a successful checkpoint. This ensures that SCR has a successful checkpoint to flush before it halts the job.
Halt before or after a specified time¶
It is possible to instruct an SCR job to halt after a specified time using
the --after
option.
The job will halt on its first successful checkpoint after the specified time.
For example, you can instruct a job to halt after “12:00pm today” via:
scr_halt --after '12:00pm today'
It is also possible to instruct a job to halt before* a specified time
using the --before
option.
For example, you can instruct a job to halt before “8:30am tomorrow” via:
scr_halt --before '8:30am tomorrow'
For the “halt before” condition to be effective,
one must also set the SCR_HALT_SECONDS
parameter.
When SCR_HALT_SECONDS
is set to a positive number,
SCR checks how much time is left before the specified time limit.
If the remaining time in seconds is less than or equal to SCR_HALT_SECONDS
, SCR halts the job.
The value of SCR_HALT_SECONDS
does not affect the “halt after” condition.
It is highly recommended that SCR_HALT_SECONDS
be set
so that the SCR library can impose a default “halt before” condition using the end time
of the job allocation.
This ensures the latest checkpoint can be flushed before the allocation is lost.
It is important to set SCR_HALT_SECONDS
to a value large enough
that SCR has time to completely flush (and rebuild) files before the allocation expires.
Consider that a checkpoint may be taken just before the
remaining time is less than SCR_HALT_SECONDS
.
If a code checkpoints every X seconds and it takes Y seconds
to flush files from the cache and rebuild, set SCR_HALT_SECONDS
= X + Y + Delta,
where Delta is some positive value to provide additional slack time.
One may also set the halt seconds via the --seconds
option to scr_halt
.
Using the scr_halt
command, one can set, change, and unset the halt seconds on a running job.
NOTE: If any scr_halt
commands are specified as part of the batch script before
the first run starts,
one must then use scr_halt
to set the halt seconds for the job rather than
the SCR_HALT_SECONDS
parameter.
The scr_halt
command creates the halt file,
and if a halt file exists before a job starts to run,
SCR ignores any value specified in the SCR_HALT_SECONDS
parameter.
Halt immediately¶
Sometimes, you need to halt an SCR job immediately, and there are two options for this.
You may use the --immediate
option:
scr_halt --immediate
This command first updates the halt file, so that the job will not be restarted once stopped. Then, it kills the current run.
If for some reason the --immediate
option fails to work,
you may manually halt the job. [1]
First, issue a simple scr_halt
so the job will not restart,
and then manually kill the current run using mechanisms provided by the resource manager,
e.g., scancel
for SLURM and apkill
for ALPS.
When using mechanisms provided by the resource manager to kill the
current run, be careful to cancel the job step and not the job allocation.
Canceling the job allocation destroys the cache.
For SLURM, to get the job step id, type: squeue -s
.
Then be sure to include the job id and step id in the scancel
argument.
For example, if the job id is 1234 and the step id is 5, then use the following commands:
scr_halt
scancel 1234.5
Do not just type “scancel 1234
” – be sure to include the job step id.
For ALPS, use apstat
to get the apid of the job step to kill.
Then, follow the steps as described above: execute scr_halt
followed by the kill command apkill <apid>
.
[1] | On Cray/ALPS, scr_halt --immediate is not yet supported. The alternate method described in the text must be used instead. |
Catch a hanging job¶
If an application hangs, SCR may not be given the chance to copy files from cache to the parallel file system before the allocation expires. To avoid losing significant work due to a hang, SCR attempts to detect if a job is hanging, and if so, SCR attempts to kill the job step so that it can be restarted in the allocation.
On some systems, SCR employs the io-watchdog
library for this purpose.
For more information on this tool, see http://code.google.com/p/io-watchdog.
On systems where io-watchdog
is not available,
SCR uses a generic mechanism based on the expected
time between checkpoints as specified by the user. If the time between checkpoints
is longer than expected, SCR assumes the job is hanging.
Two SCR parameters determine how many seconds should pass
between I/O phases in an application, i.e. seconds between
consecutive calls to SCR_Start_checkpoint
.
These are SCR_WATCHDOG_TIMEOUT
and SCR_WATCHDOG_TIMEOUT_PFS
. The first parameter
specifies the time to wait when SCR writes checkpoints to
in-system storage, e.g. SSD or RAM disk, and the second
parameter specifies the time to wait when SCR writes
checkpoints to the parallel file system.
The reason for the two timeouts is that writing to the parallel
file system generally takes much longer than writing to in-system
storage, and so a longer timeout period is useful in that case.
When using this feature, be careful to check that the job does not hang near the end of its allocation time limit, since in this case, SCR may not kill the run with enough time before the allocation ends. If you suspect the job to be hanging and you deem that SCR will not kill the run in sufficient time, manually cancel the run as described above.
Combine, list, change, and unset halt conditions¶
It is possible to specify multiple halt conditions.
To do so, simply list each condition in the same scr_halt
command or issue several commands.
For example, to instruct a job to halt after 10 checkpoints or before “8:30am tomorrow”,
which ever comes earlier, you could issue the following command:
scr_halt --checkpoints 10 --before '8:30am tomorrow'
The following sequence also works:
scr_halt --checkpoints 10
scr_halt --before '8:30am tomorrow'
You may list the current settings in the halt file with the --list
option, e.g.,:
scr_halt --list
You may change a setting by issuing a new command to overwrite the current value.
Finally, you can unset some halt conditions by prepending unset-
to the option names.
See the scr_halt
man page for a full listing of unset options.
For example, to unset the “halt before” condition on a job, type the following:
scr_halt --unset-before
Remove the halt file¶
Sometimes, especially during testing, you may want to run in an existing
allocation after halting a previous run.
When SCR detects a halt file with a satisfied halt condition, it immediately exits.
This is the desired effect when trying to halt a job,
however this mechanism also prevents one from intentionally running in an allocation
after halting a previous run.
Along these lines, know that SCR registers a halt condition whenever
the application calls SCR_Finalize
.
When there is a halt file with a satisfied halt condition,
a message is printed to stdout
to indicate why SCR is halting.
To run in such a case, first remove the satisfied halt conditions.
You can unset the conditions or reset them to appropriate values.
Another approach is to remove the halt file via the --remove
option.
This deletes the halt file, which effectively removes all halt conditions.
For example, to remove the halt file from a job, type:
scr_halt --remove
Manage datasets¶
SCR records the status of datasets that are on the parallel file system in the index.scr
file.
This file is written to the hidden .scr
directory within the prefix directory.
The library updates the index file as an application runs and during scavenge operations.
While restarting a job, the SCR library reads the index file during SCR_Init
to determine which checkpoints are available.
The library attempts to restart with the most recent checkpoint and works backwards
until it successfully fetches a valid checkpoint.
SCR does not fetch any checkpoint marked as “incomplete” or “failed”.
A checkpoint is marked as incomplete if it was determined to be invalid during the flush or scavenge.
Additionally, the library marks a checkpoint as failed if it detected a problem
during a previous fetch attempt (e.g., detected data corruption).
In this way, the library avoids invalid or problematic checkpoints.
One may list or modify the contents of the index file via the scr_index
command.
The scr_index
command must run within the prefix directory,
or otherwise, one may specify a prefix directory using the “--prefix
” option.
The default behavior of scr_index
is to list the contents of the index file, e.g.:
>>: scr_index
DSET VALID FLUSHED NAME
* 18 YES 2014-01-14T11:26:06 ckpt.18
12 YES 2014-01-14T10:28:23 ckpt.12
6 YES 2014-01-14T09:27:15 ckpt.6
When listing datasets, the internal SCR dataset id is shown, followed by a field indicating whether the dataset is valid, the time it was flushed to the parallel file system, and finally the dataset name.
One checkpoint may also be marked as “current”.
When restarting a job, the SCR library starts from the current dataset and works backwards.
The current dataset is denoted with a leading *
character.
One can change the current checkpoint using the --current
option,
providing the dataset name as an argument.:
scr_index --current ckpt.12
In most cases, the SCR library or the SCR commands add all necessary entries to the index file.
However, there are cases where they may fail.
In particular, if the scr_postrun
command successfully scavenges a dataset
but the resource allocation ends before the command can rebuild missing files,
an entry may be missing from the index file.
In such cases, one may manually add the corresponding entry
using the “--add
” option.
When adding a new dataset to the index file,
the scr_index
command checks whether the files in a dataset
constitute a complete and valid set.
It rebuilds missing files if there are sufficient redundant data,
and it writes the summary.scr
file for the dataset if needed.
One must provide the SCR dataset id as an argument.
To obtain the SCR dataset id value, lookup the trailing integer on the names of scr.dataset
subdirectories
in the hidden .scr
directory within the prefix directory.:
scr_index --add 50
One may remove entries from the index file using the “--remove
” option.
This operation does not delete the corresponding dataset files.
It only deletes the entry from the index.scr
file.:
scr_index --remove ckpt.50
This is useful if one deletes a dataset from the parallel file system and then wishes to update the index.