Scalable Checkpoint / Restart (SCR) User Guide

The Scalable Checkpoint / Restart (SCR) library enables MPI applications to utilize distributed storage on Linux clusters to attain high file I/O bandwidth for checkpointing, restarting, and writing large datasets. With SCR, jobs run more efficiently, recompute less work upon a failure, and reduce load on shared resources like the parallel file system. It provides the most benefit to large-scale jobs that write large datasets. SCR utilizes tiered storage in a cluster to provide applications with the following capabilities:

  • guidance for the optimal checkpoint frequency,
  • scalable checkpoint bandwidth,
  • scalable restart bandwidth,
  • scalable output bandwidth,
  • asynchronous data transfers to the parallel file system.

SCR originated as a production-level implementation of a multi-level checkpoint system of the type analyzed by [Vaidya] SCR caches checkpoints in scalable storage, which is faster but less reliable than the parallel file system. It applies a redundancy scheme to the cache such that checkpoints can be recovered after common system failures. It also copies a subset of checkpoints to the parallel file system to recover from less common but more severe failures. In many failure cases, a job can be restarted from a checkpoint in cache, and writing and reading datasets in cache can be orders of magnitude faster than the parallel file system.

_images/aggr_bw.png

Aggregate write bandwidth on Coastal

When writing a cached dataset to the parallel file system, SCR can transfer data asynchronously. The application may continue once the data has been written to the cache while SCR copies the data to the parallel file system in the background. SCR supports general output datasets in addition to checkpoint datasets.

SCR consists of two components: a library and a set of commands. The application registers its dataset files with the SCR API, and the library maintains the dataset cache. The SCR commands are typically invoked from the job batch script. They are used to prepare the cache before a job starts, automate the process of restarting a job, and copy datasets from cache to the parallel file system upon a failure.

[Vaidya]“A Case for Two-Level Recovery Schemes”, Nitin H. Vaidya, IEEE Transactions on Computers, 1998, http://doi.ieeecomputersociety.org/10.1109/12.689645.

Support

The main repository for SCR is located at:

https://github.com/LLNL/scr.

From this site, you can download the source code and manuals for the current release of SCR.

For information about the project including active research efforts, please visit:

https://computation.llnl.gov/project/scr

To contact the developers of SCR for help with using or porting SCR, please visit:

https://computation.llnl.gov/project/scr/contact.php

There you will find links to join our discussion mailing list for help topics, and our announcement list for getting notifications of new SCR releases.

Contents

Quick Start

In this quick start guide, we assume that you already have a basic understanding of SCR and how it works on HPC systems. We will walk through a bare bones example to get you started quickly. For more in-depth information, please see subsequent sections in this user’s guide.

Obtaining the SCR Source

The latest version of the SCR source code is kept at github: https://github.com/LLNL/scr. It can be cloned or you may download a release tarball.

Building SCR

SCR has several dependencies. A C compiler, MPI, CMake, and pdsh are required dependencies. The others are optional, and when they are not available some features of SCR may not be available. SCR uses the standard mpicc compiler wrapper in its build, so you will need to have it in your PATH. We assume the minimum set of dependencies in this quick start guide, which can be automatically obtained with either Spack or CMake. For more help on installing dependencies of SCR or building SCR, please see Section Build SCR.

Spack

The most automated way to build SCR is to use the Spack package manager (https://github.com/spack/spack). SCR and all of its dependencies exist in a Spack package. After downloading Spack, simply type:

spack install scr

This will download and install SCR and its dependencies automatically.

CMake

To get started with CMake (version 2.8 or higher), the quick version of building SCR is:

git clone git@github.com:llnl/scr.git
mkdir build
mkdir install

cd build
cmake -DCMAKE_INSTALL_PREFIX=../install ../scr
make
make install
make test

Since pdsh is required, the WITH_PDSH_PREFIX should be passed to CMake if it is installed in a non-standard location. On most systems, MPI should automatically be detected.

Building the SCR test_api Example

After installing SCR, go to the installation directory, <install dir> above. In the <install dir>/share/scr/examples directory you will find the example programs supplied with SCR. For this quick start guide, we will use the test_api program. Build it by executing:

make test_api

Upon successful build, you will have an executable in your directory called test_api. You can use this test program to get a feel for how SCR works and to ensure that your build of SCR is working.

Running the SCR test_api Example

A quick test of your SCR installation can be done by setting a few environment variables in an interactive job allocation. The following assumes you are running on a SLURM-based system. If you are not using SLURM, then you will need to modify the allocation and run commands according to the resource manager you are using.

First, obtain a few compute nodes for testing. Here we will allocate 4 compute nodes on a system with a queue for debugging called pdebug:

salloc -N 4 -p pdebug

Once you have the four compute nodes, you can experiment with SCR using the test_api program. First set a few environment variables. We’re using csh in this example; you’ll need to update the commands if you are using a different shell.:

# make sure the SCR library is in your library path
setenv LD_LIBRARY_PATH ${SCR_INSTALL}/lib

# tell SCR to not flush to the parallel file system periodically
setenv SCR_FLUSH 0

Now, we can run a simple test to see if your SCR installation is working. Here we’ll run a 4-process run on 4 nodes:

srun -n4 -N4 ./test_api

Assuming all goes well, you should see output similar to the following

>>: srun -N 4 -n 4 ./test_api
Init: Min 0.033856 s    Max 0.033857 s  Avg 0.033856 s
No checkpoint to restart from
At least one rank (perhaps all) did not find its checkpoint
Completed checkpoint 1.
Completed checkpoint 2.
Completed checkpoint 3.
Completed checkpoint 4.
Completed checkpoint 5.
Completed checkpoint 6.
FileIO: Min   52.38 MB/s        Max   52.39 MB/s        Avg   52.39 MB/s       Agg  209.55 MB/s

If you did not see output similar to this, there is likely a problem with your environment set up or build of SCR. Please see the detailed sections of this user guide for more help or email us (See the Support and Contacts section of this user guide.)

If you want to get into more depth, in the SCR source directory, you will find a directory called testing. In this directory, there are various scripts we use for testing our code. Perhaps the most useful for getting started are the TESTING.csh or TESTING.sh files, depending on your shell preference.

Getting SCR into Your Application

Here we give a simple example of integrating SCR into an application to write checkpoints. Further sections in the user guide give more details and demonstrate how to perform restart with SCR. You can also look at the source of the test_api program and other programs in the examples directory.

int main(int argc, char* argv[]) {
  MPI_Init(argc, argv);

  /* Call SCR_Init after MPI_Init */
  SCR_Init();

  for(int t = 0; t < TIMESTEPS; t++)
  {
    /* ... Do work ... */

    int flag;
    /* Ask SCR if we should take a checkpoint now */
    SCR_Need_checkpoint(&flag);
    if (flag)
      checkpoint();
  }

  /* Call SCR_Finalize before MPI_Finalize */
  SCR_Finalize();
  MPI_Finalize();
  return 0;
}

void checkpoint() {
  /* Tell SCR that you are getting ready to start a checkpoint phase */
  SCR_Start_checkpoint();

  int rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  char file[256];
  /* create your checkpoint file name */
  sprintf(file, "rank_%d.ckpt", rank);

  /* Call SCR_Route_file to request a new file name (scr_file) that will cause
     your application to write the file to a fast tier of storage, e.g.,
     a burst buffer */
  char scr_file[SCR_MAX_FILENAME];
  SCR_Route_file(file, scr_file);

  /* Use the new file name to perform your checkpoint I/O */
  FILE* fs = fopen(scr_file, "w");
  if (fs != NULL) {
    fwrite(state, ..., fs);
    fclose(fs);
  }

  /* Tell SCR that you are done with your checkpoint phase */
  SCR_Complete_checkpoint(1);
  return;
}

Final Thoughts

This was a really quick introduction to building and running with SCR. For more information, please look at the more detailed sections in the rest of this user guide or contact us with questions.

Assumptions

A number of assumptions are made in the SCR implementation. If any of these assumptions do not hold for a particular application, that application cannot use SCR. If this is the case, or if you have any questions, please notify the SCR developers. The goal is to expand the implementation to support a large number of applications.

  • The code must be an MPI application.
  • The code must read and write datasets as a file per process in a globally-coordinated fashion.
  • A process having a particular MPI rank is only guaranteed access to its own dataset files, i.e., a process of a given MPI rank may not access dataset files written by a process having a different MPI rank within the same run or across different runs.
  • To use the scalable restart capability, a job must be restarted with the same number of processes as used to write the checkpoint, and each process must only access the files it wrote during the checkpoint. Note that this may limit the effectiveness of the library for codes that are capable of restarting from a checkpoint with a different number of processes than were used to write the checkpoint. Such codes can often still benefit from the scalable checkpoint capability, but not the scalable restart – they must fall back to restarting from the parallel file system.
  • It must be possible to store the dataset files from all processes in the same directory. In particular, all files belonging to a given dataset must have distinct names.
  • SCR maintains a set of meta data files, which it stores in a subdirectory of the directory that contains the application dataset files. The application must allow for these SCR meta data files to coexist with its own files.
  • Files cannot contain data that span multiple datasets. In particular, there is no support for appending data of the current dataset to a file containing data from a previous dataset. Each dataset must be self-contained.
  • On some systems, datasets are cached in RAM disk. This restricts usage of SCR on those machines to applications whose memory footprint leaves sufficient room to store the dataset files in memory simultaneously with the running application. The amount of storage needed depends on the number of cached datasets and the redundancy scheme used. See Section Scalable checkpoint for details.
  • SCR occasionally flushes files from cache to the parallel file system. All files must reside under a top-level directory on the parallel file system called the “prefix” directory that is specified by the application. Under that prefix directory, the application may use file and subdirectory trees. One constraint is that no two datasets can write to the same file. See Section Control, cache, and prefix directories for details.
  • Time limits should be imposed so that the SCR library has sufficient time to flush files from cache to the parallel file system before the resource allocation expires. Additionally, care should be taken so that the run does not stop in the middle of a checkpoint. See Section Halt a job for details.

Concepts

This section discusses concepts one should understand about the SCR library implementation including how it interacts with file systems.

Jobs, allocations, and runs

A large-scale simulation often must be restarted multiple times in order to run to completion. It may be interrupted due to a failure, or it may be interrupted due to time limits imposed by the resource scheduler. We use the term allocation to refer to an assigned set of compute resources that are available to the user for a period of time. A resource manager typically assigns an identifier to each resource allocation, which we refer to as the allocation id. SCR uses the allocation id in some directory and file names. Within an allocation, a user may execute a simulation one or more times. We call each execution a run. For MPI applications, each run corresponds to a single invocation of mpirun or its equivalent. Finally, multiple allocations may be required to complete a given simulation. We refer to this series of one or more allocations as a job. To summarize, one or more runs occur within an allocation, and one or more allocations occur within a job.

Group, store, and redundancy descriptors

The SCR library must group processes of the parallel job in various ways. For example, if power supply failures are common, it is necessary to identify the set of processes that share a power supply. Similarly, it is necessary to identify all processes that can access a given storage device, such as an SSD mounted on a compute node. To represent these groups, the SCR library uses a group descriptor. Details of group descriptors are given in Section Group, store, and checkpoint descriptors.

Each group is given a unique name. The library creates two groups by default: NODE and WORLD. The NODE group consists of all processes on the same compute node, and WORLD consists of all processes in the run. The user or system administrator can create additional groups via configuration files (Section Configure a job).

The SCR library must also track details about each class of storage it can access. For each available storage class, SCR needs to know the associated directory prefix, the group of processes that share a device, the capacity of the device, and other details like whether the associated file system can support directories. SCR tracks this information in a store descriptor. Each store descriptor refers to a group descriptor, which specifies how processes are grouped with respect to that class of storage. For a given storage class, it is assumed that all compute nodes refer to the class using the same directory prefix. Each store descriptor is referenced by its directory prefix.

The library creates one store descriptor by default: /tmp. The assumption is made that /tmp is mounted as a local file system on each compute node. On Linux clusters, /tmp is often RAM disk or a local hard drive. Additional store descriptors can be defined by the user or system administrator in configuration files (Section Configure a job).

Finally, SCR defines redundancy descriptors to associate a redundancy scheme with a class of storage devices and a group of processes that are likely to fail at the same time. It also tracks details about the particular redundancy scheme used, and the frequency with which it should be applied. Redundancy descriptors reference both store and group descriptors.

The library creates a default redundancy descriptor. It assumes that processes on the same node are likely to fail at the same time. It also assumes that datasets can be cached in /tmp, which is assumed to be storage local to each compute node. It applies an XOR redundancy scheme using a group size of 8. Additional redundancy descriptors may be defined by the user or system administrator in configuration files (Section Configure a job).

Control, cache, and prefix directories

SCR manages numerous files and directories to cache datasets and to record its internal state. There are three fundamental types of directories: control, cache, and prefix directories. For a detailed illustration of how these files and directories are arranged, see the example presented in Section Example of SCR files and directories.

The control directory is where SCR writes files to store internal state about the current run. This directory is expected to be stored in node-local storage. SCR writes multiple, small files in the control directory, and it may access these files frequently. It is best to configure this directory to be stored in a node-local RAM disk.

To construct the full path of the control directory, SCR incorporates a control base directory name along with the user name and allocation id associated with the resource allocation. This enables multiple users, or multiple jobs by the same user, to run at the same time without conflicting for the same control directory. The control base directory is hard-coded into the SCR library at configure time, but this value may be overridden via a system configuration file. The user may not change the control base directory.

SCR directs the application to write dataset files to subdirectories within a cache directory. SCR also stores its redundancy data in these subdirectories. The device serving the cache directory must be large enough to hold the data for one or more datasets plus the associated redundancy data. Multiple cache directories may be utilized in the same run, which enables SCR to use more than one class of storage within a run (e.g., RAM disk and SSD). Cache directories should be located on scalable storage.

To construct the full path of a cache directory, SCR incorporates a cache base directory name with the user name and the allocation id associated with the resource allocation. A set of valid cache base directories is hard-coded into the SCR library at configure time, but this set can be overridden in a system configuration file. Out of this set, the user may select a subset of cache base directories to use during a run. A cache directory may be the same as the control directory.

The user must configure the maximum number of datasets that SCR should keep in each cache directory. It is up to the user to ensure that the capacity of the device associated with the cache directory is large enough to hold the specified number of datasets.

SCR refers to each application checkpoint or output set as a dataset. SCR assigns a unique sequence number to each dataset called the dataset id. It assigns dataset ids starting from 1 and counts up with each successive dataset written by the application. Within a cache directory, a dataset is written to its own subdirectory called the dataset directory.

Finally, the prefix directory is a directory on the parallel file system that the user specifies. SCR copies datasets to the prefix directory for permanent storage (Section Fetch, flush, and scavenge). The prefix directory should be accessible from all compute nodes, and the user must ensure that the prefix directory is unique for each job. For each dataset stored in the prefix directory, SCR creates and manages a dataset directory. The dataset directory holds all SCR redundancy files and meta data associated with a particular dataset. SCR maintains an index file within the prefix directory, which records information about each dataset stored there.

Note that the term “dataset directory” is overloaded. In some cases, we use this term to refer to a directory in cache and in other cases we use the term to refer to a directory within the prefix directory on the parallel file system. In any particular case, the meaning should be clear from the context.

Example of SCR files and directories

To illustrate how files and directories are arranged in SCR, consider the example shown in Figure Example SCR directories. In this example, a user named “user1” runs a 4-task MPI job with one task per compute node. The base directory for the control directory is /tmp, the base directory for the cache directory is /ssd, and the prefix directory is /p/lscratchb/user1/simulation123. The control and cache directories are storage devices local to the compute node.

_images/directories_wide2.png

Example SCR directories

The full path of the control directory is /tmp/user1/scr.1145655. This is derived from the concatenation of the control base directory (/tmp), the user name (user1), and the allocation id (1145655). SCR keeps files to persist its internal state in the control directory, including filemap files as shown.

Similarly, the cache directory is /ssd/user1/scr.1145655, which is derived from the concatenation of the cache base directory (/ssd), the user name (user1), and the allocation id (1145655). Within the cache directory, SCR creates a subdirectory for each dataset. In this example, there are two datasets with ids 17 and 18. The application dataset files and SCR redundancy files are stored within their corresponding dataset directory. On the node running MPI rank 0, there is one application dataset file (rank_0.ckpt) and one XOR redundancy data file (1_of_4_in_0.xor).

Finally, the full path of the prefix directory is /p/lscratchb/user1/simulation123. This is a path on the parallel file system that is specified by the user. It is unique to the particular simulation the user is running (simulation123). The prefix directory contains dataset directories. It also contains a hidden .scr directory where SCR writes the index file to record info for each of the datasets (Section Manage datasets). The SCR library writes other files to this hidden directory, including the “halt” file (Section Halt a job).

While the user provides the prefix directory, SCR defines the name of each dataset directory to be “scr.dataset.<id>” where <id> is the dataset id. In this example, there are multiple datasets stored on the parallel file system corresponding to dataset ids 10, 12, and 18. Within each dataset directory, SCR stores the files written by the application. SCR also creates a hidden .scr subdirectory, and this hidden directory contains redundancy files and other SCR files that are specific to the dataset.

Scalable checkpoint

In practice, it is common for multiple processes to fail at the same time, but most often this happens because those processes depend on a single, failed component. It is not common for multiple, independent components to fail simultaneously. By expressing the groups of processes that are likely to fail at the same time, the SCR library can apply redundancy schemes to withstand common, multi-process failures. We refer to a set of processes likely to fail at the same time as a failure group.

SCR must also know which groups of processes share a given storage device. This is useful so the group can coordinate its actions when accessing the device. For instance, if a common directory must be created before each process writes a file, a single process can create the directory and then notify the others. We refer to a set of processes that share a storage device as a storage group.

Users and system administrators can pass information about failure and storage groups to SCR in descriptors defined in configuration files (See Section Group, store, and checkpoint descriptors). Given this knowledge of failure and storage groups, the SCR library implements three redundancy schemes which trade off performance, storage space, and reliability:

  • Single - each checkpoint file is written to storage accessible to the local process
  • Partner - each checkpoint file is written to storage accessible to the local process, and a full copy of each file is written to storage accessible to a partner process from another failure group
  • XOR - each checkpoint file is written to storage accessible to the local process, XOR parity data are computed from checkpoints of a set of processes from different failure groups, and the parity data are stored among the set.

With Single, SCR writes each checkpoint file in storage accessible to the local process. It requires sufficient space to store the maximum checkpoint file size. This scheme is fast, but it cannot withstand failures that disable the storage device. For instance, when using node-local storage, this scheme cannot withstand failures that disable the node, such as when a node loses power or its network connection. However, it can withstand failures that kill the application processes but leave the node intact, such as application bugs and file I/O errors.

With Partner, SCR writes checkpoint files to storage accessible to the local process, and it also copies each checkpoint file to storage accessible to a partner process from another failure group. This scheme is slower than Single, and it requires twice the storage space. However, it is capable of withstanding failures that disable a storage device. In fact, it can withstand failures of multiple devices, so long as a device and the device holding the copy do not fail simultaneously.

With XOR, SCR defines sets of processes where members within a set are selected from different failure groups. The processes within a set collectively compute XOR parity data which is stored in files along side the application checkpoint files. This algorithm is based on the work found in [Gropp], which in turn was inspired by RAID5 [Patterson]. This scheme can withstand multiple failures so long as two processes from the same set do not fail simultaneously.

Computationally, XOR is more expensive than Partner, but it requires less storage space. Whereas Partner must store two full checkpoint files, XOR stores one full checkpoint file plus one XOR parity segment, where the segment size is roughly \(1/(N-1)\) times the size of a checkpoint file for a set of size N. Larger sets demand less storage, but they also increase the probability that two processes in the same set will fail simultaneously. Larger sets may also increase the cost of recovering files in the event of a failure.

[Patterson]“A Case for Redundant Arrays of Inexpensive Disks (RAID)”, D. Patterson, G. Gibson, and R. Katz, Proc. of 1988 ACM SIGMOD Conf. on Management of Data, 1988, http://web.mit.edu/6.033/2015/wwwdocs/papers/Patterson88.pdf.
[Gropp]“Providing Efficient I/O Redundancy in MPI Environments”, William Gropp, Robert Ross, and Neill Miller, Lecture Notes in Computer Science, 3241:7786, September 2004. 11th European PVM/MPI Users Group Meeting, 2004, http://www.mcs.anl.gov/papers/P1178.pdf.

Scalable restart

So long as a failure does not violate the redundancy scheme, a job can restart within the same resource allocation using the cached checkpoint files. This saves the cost of writing checkpoint files out to the parallel file system only to read them back during the restart. In addition, SCR provides support for the use of spare nodes. A job can allocate more nodes than it needs and use the extra nodes to fill in for any failed nodes during a restart. SCR includes a set of scripts which encode much of the restart logic (Section Run a job).

Upon encountering a failure, SCR relies on the MPI library, the resource manager, or some other external service to kill the current run. After the run is killed, and if there are sufficient healthy nodes remaining, the same job can be restarted within the same allocation. In practice, such a restart typically amounts to issuing another “mpirun” in the job batch script.

Of the set of nodes used by the previous run, the restarted run should use as many of the same nodes as it can to maximize the number of files available in cache. A given MPI rank in the restarted run does not need to run on the same node that it ran on in the previous run. SCR distributes cached files among processes according to the process mapping of the restarted run.

By default, SCR inspects the cache for existing checkpoints when a job starts. It attempts to rebuild all datasets in cache, and then it attempts to restart the job from the most recent checkpoint. If a checkpoint fails to rebuild, SCR deletes it from cache. To disable restarting from cache, set the SCR_DISTRIBUTE parameter to 0. When disabled, SCR deletes all files from cache and restarts from a checkpoint on the parallel file system.

An example restart scenario is illustrated in Figure Scalable restart in which a 4-node job using the Partner scheme allocates 5 nodes and successfully restarts within the allocation after a node fails.

_images/restart.png

Example restart after a failed node with Partner

Catastrophic failures

There are some failures from which the SCR library cannot recover. In such cases, the application is forced to fall back to the latest checkpoint successfully written to the parallel file system. Such catastrophic failures include the following:

  • Multiple node failure which violates the redundancy scheme. If multiple nodes fail in a pattern which violates the cache redundancy scheme, data are irretrievably lost.
  • Failure during a checkpoint. Due to cache size limitations, some applications can only fit one checkpoint in cache at a time. For such cases, a failure may occur after the library has deleted the previous checkpoint but before the next checkpoint has completed. In this case, there is no valid checkpoint in cache to recover.
  • Failure of the node running the job batch script. The logic at the end of the allocation to scavenge the latest checkpoint from cache to the parallel file system executes as part of the job batch script. If the node executing this script fails, the scavenge logic will not execute and the allocation will terminate without copying the latest checkpoint to the parallel file system.
  • Parallel file system outage. If the application fails when writing output due to an outage of the parallel file system, the scavenge logic may also fail when it attempts to copy files to the parallel file system.

There are other catastrophic failure cases not listed here. Checkpoints must be written to the parallel file system with some moderate frequency so as not to lose too much work in the event of a catastrophic failure. Section Fetch, flush, and scavenge provides details on how to configure SCR to make occasional writes to the parallel file system.

By default, the current implementation stores only the most recent checkpoint in cache. One can change the number of checkpoints stored in cache by setting the SCR_CACHE_SIZE parameter. If space is available, it is recommended to increase this value to at least 2.

Fetch, flush, and scavenge

SCR manages the transfer of datasets between the prefix directory on the parallel file system and the cache. We use the term fetch to refer to the action of copying a dataset from the parallel file system to cache. When transferring data in the other direction, there are two terms used: flush and scavenge. Under normal circumstances, the library directly copies files from cache to the parallel file system, and this direct transfer is known as a flush. However, sometimes a run is killed before the library can complete this transfer. In these cases, a set of SCR commands is executed after the final run to ensure that the latest checkpoint is copied to the parallel file system before the allocation expires. We say that these scripts scavenge the latest checkpoint.

Each time an SCR job starts, SCR first inspects the cache and attempts to distribute files for a scalable restart as discussed in Section Scalable restart. If the cache is empty or the distribute operation fails or is disabled, SCR attempts to fetch a checkpoint from the prefix directory to fill the cache. SCR reads the index file and attempts to fetch the most recent checkpoint, or otherwise the checkpoint that is marked as current within the index file. For a given checkpoint, SCR records whether the fetch attempt succeeds or fails in the index file. SCR does not attempt to fetch a checkpoint that is marked as being incomplete nor does it attempt to fetch a checkpoint for which a previous fetch attempt has failed. If SCR attempts but fails to fetch a checkpoint, it prints an error and continues the run.

To disable the fetch operation, set the SCR_FETCH parameter to 0. If an application disables the fetch feature, the application is responsible for reading its checkpoint set directly from the parallel file system upon a restart.

To withstand catastrophic failures, it is necessary to write checkpoint sets out to the parallel file system with some moderate frequency. In the current implementation, the SCR library writes a checkpoint set out to the parallel file system after every 10 checkpoints. This frequency can be configured by setting the SCR_FLUSH parameter. When this parameter is set, SCR decrements a counter with each successful checkpoint. When the counter hits 0, SCR writes the current checkpoint set out to the file system and resets the counter to the value specified in SCR_FLUSH. SCR preserves this counter between scalable restarts, and when used in conjunction with SCR_FETCH, it also preserves this counter between fetch and flush operations such that it is possible to maintain periodic checkpoint writes across runs. Set SCR_FLUSH to 0 to disable periodic writes in SCR. If an application disables the periodic flush feature, the application is responsible for writing occasional checkpoint sets to the parallel file system.

By default, SCR computes and stores a CRC32 checksum value for each checkpoint file during a flush. It then uses the checksum to verify the integrity of each file as it is read back into cache during a fetch. If data corruption is detected, SCR falls back to fetch an earlier checkpoint set. To disable this checksum feature, set the SCR_CRC_ON_FLUSH parameter to 0.

Build SCR

Dependencies

SCR has several dependencies. A C compiler, MPI, CMake, and pdsh are required dependencies. The others are optional, and when they are not available some features of SCR may not be available.

Spack

The most automated way to build SCR is to use the Spack package manager (https://github.com/spack/spack). SCR and all of its dependencies exist in a Spack package. After downloading Spack, simply type:

spack install scr

This will install the DTCMP, LWGRP, and pdsh packages (and possibly an MPI and a C compiler if needed).

CMake

To get started with CMake (version 2.8 or higher), the quick version of building SCR is:

git clone git@github.com:llnl/scr.git
mkdir build
mkdir install

cd build
cmake -DCMAKE_INSTALL_PREFIX=../install ../scr
make
make install
make test

Some useful CMake command line options are:

  • -DCMAKE_INSTALL_PREFIX=[path]: Place to install the SCR library
  • -DCMAKE_BUILD_TYPE=[Debug/Release]: Build with debugging or optimizations
  • -DBUILD_PDSH=[OFF/ON]: CMake can automatically download and build the PDSH dependency
  • -DWITH_PDSH_PREFIX=[path to PDSH]: Path to an existing PDSH installation (should not be used with BUILD_PDSH)
  • -DWITH_DTCMP_PREFIX=[path to DTCMP]
  • -DWITH_YOGRT_PREFIX=[path to YOGRT]
  • -DSCR_ASYNC_API=[CRAY_DW/INTEL_CPPR/IBM_BBAPI/NONE]
  • -DSCR_RESOURCE_MANAGER=[SLURM/APRUN/PMIX/LSF/NONE]
  • -DSCR_CNTL_BASE=[path] : Path to SCR Control directory, defaults to /tmp
  • -DSCR_CACHE_BASE=[path] : Path to SCR Cache directory, defaults to /tmp
  • -DSCR_CONFIG_FILE=[path] : Path to SCR system configuration file, defaults to /etc/scr/scr.conf

To change the SCR control directory, one must either set -DSCR_CNTL_BASE at build time or one must specify SCR_CNTL_BASE in the SCR system configuration file. These paths are hardcoded into the SCR library and scripts during the build process. It is not possible to specify the control directory through environment variables or the user configuration file.

Unlike the control directory, the SCR cache directory can be specified at run time through either environment variables or the user configuration file.

SCR API

SCR is designed to support MPI applications that write application-level checkpoints and output datasets. Both types of datasets (checkpoints and output) must be stored as a file-per-process, and they must be accessed in a globally-coordinated fashion. In a given dataset, each process may actually write zero or more files, but the current implementation assumes that each process writes roughly the same amount of data.

Parallel file systems allow any process in an MPI job to read/write any byte of a file at any time. However, most applications do not require this full generality. SCR supplies API calls that enable the application to specify limits on its data access in both time and space. Start and complete calls indicate when an application needs to write or read its data. Data cannot be accessed outside of these markers. Additionally, each MPI process may only read files written by itself or another process having the same MPI rank in a previous run. An MPI process cannot read files written by a process having a different MPI rank.

The API is designed to be simple, scalable, and portable. It consists of a small number of function calls to wrap existing application I/O logic. Unless otherwise stated, SCR functions are collective, meaning all processes must call the function synchronously. The underlying implementation may or may not be synchronous, but to be portable, an application must treat a collective call as though it is synchronous. This constraint enables the SCR implementation to utilize the full resources of the job in a collective manner to optimize performance at critical points such as computing redundancy data.

In the sections below, we show the function prototypes for C and Fortran, respectively. Applications written in C should include “scr.h”, and Fortran should include “scrf.h”. All functions return SCR_SUCCESS if successful.

General API

SCR_Init

int SCR_Init();
SCR_INIT(IERROR)
  INTEGER IERROR

Initialize the SCR library. This function must be called after MPI_Init, and it is good practice to call this function immediately after MPI_Init. A process should only call SCR_Init once during its execution. No other SCR calls are valid until a process has returned from SCR_Init.

SCR_Finalize

int SCR_Finalize();
SCR_FINALIZE(IERROR)
  INTEGER IERROR

Shut down the SCR library. This function must be called before MPI_Finalize, and it is good practice to call this function just before MPI_Finalize. A process should only call SCR_Finalize once during its execution.

If SCR_FLUSH is enabled, SCR_Finalize flushes any datasets to the prefix directory if necessary. It updates the halt file to indicate that SCR_Finalize has been called. This halt condition prevents the job from restarting (Section Halt a job).

SCR_Get_version

char* SCR_Get_version(void);
SCR_GET_VERSION(VERSION, IERROR)
  CHARACTER*(*) VERSION
  INTEGER IERROR

This function returns a string that indicates the version number of SCR that is currently in use.

SCR_Should_exit

int SCR_Should_exit(int* flag);
SCR_SHOULD_EXIT(FLAG, IERROR)
  INTEGER FLAG, IERROR

SCR_Should_exit provides a portable way for an application to determine whether it should halt its execution. This function is passed a pointer to an integer in flag. Upon returning from SCR_Should_exit, flag is set to the value 1 if the application should stop, and it is set to 0 otherwise. The call returns the same value in flag on all processes. It is recommended to call this function after each checkpoint.

Since datasets in cache may be deleted by the system at the end of an allocation, it is critical for a job to stop early enough to leave time to copy datasets from cache to the parallel file system before the allocation expires. By default, the SCR library automatically calls exit at certain points. This works especially well in conjunction with the SCR_HALT_SECONDS parameter. However, this default behavior does not provide the application a chance to exit cleanly. SCR can be configured to avoid an automatic exit using the SCR_HALT_ENABLED parameter.

This call also enables a running application to react to external commands. For instance, if the application has been instructed to halt using the scr_halt command, then SCR_Should_exit relays that information.

SCR_Route_file

int SCR_Route_file(const char* name, char* file);
SCR_ROUTE_FILE(NAME, FILE, IERROR)
  CHARACTER*(*) NAME, FILE
  INTEGER IERROR

When files are under control of SCR, they may be written to or exist on different levels of the storage hierarchy at different points in time. For example, a checkpoint might be written first to the RAM disk of a compute node and then later transferred to a burst buffer or the parallel file system by SCR. In order for an application to discover where a file should be written to or read from, we provide the SCR_Route_file routine.

A process calls SCR_Route_file to obtain the full path and file name it must use to access a file under SCR control. The name of the file that the process intends to access must be passed in the name argument. A pointer to a character buffer of at least SCR_MAX_FILENAME bytes must be passed in file. When a call to SCR_Route_file returns, the full path and file name to access the file named in name is written to the buffer pointed to by file. The process must use the character string returned in file to access the file. A process does not need to create any directories listed in the string returned in file. The SCR implementation creates any necessary directories before returning from the call. A call to SCR_Route_file is local to the calling process; it is not a collective call.

As of version 1.2.2, SCR_Route_file can be succesfully called at any point during application execution. If it is called outside of a Start/Complete pair, the original file path is simply copied to the return string.

SCR_Route_file has special behaviour when called within a Start/Complete pair for restart, checkpoint, or output. Within a restart operation, the input parameter name only requires a file name. No path component is needed. SCR will return a full path to the file from the most recent checkpoint having the same name. It will return an error if no file by that name exists. Within checkpoint and output operations, the input parameter name also specifies the final path on the parallel file system. The caller may provide either absolute or relative path components in name. If the path is relative, SCR prepends the current working directory to name at the time that SCR_Route_file is called. With either an absolute or relative path, all paths must resolve to a location within the subtree rooted at the SCR prefix directory.

In the current implementation, SCR only changes the directory portion of name. It extracts the base name of the file by removing any directory components in name. Then it prepends a directory to the base file name and returns the full path and file name in file.

Checkpoint API

Here we describe the SCR API functions that are used for writing checkpoints.

SCR_Need_checkpoint

int SCR_Need_checkpoint(int* flag);
SCR_NEED_CHECKPOINT(FLAG, IERROR)
  INTEGER FLAG, IERROR

Since the failure frequency and the cost of checkpointing vary across platforms, SCR_Need_checkpoint provides a portable way for an application to determine whether a checkpoint should be taken. This function is passed a pointer to an integer in flag. Upon returning from SCR_Need_checkpoint, flag is set to the value 1 if a checkpoint should be taken, and it is set to 0 otherwise. The call returns the same value in flag on all processes.

SCR_Start_checkpoint

int SCR_Start_checkpoint();
SCR_START_CHECKPOINT(IERROR)
  INTEGER IERROR

Inform SCR that a new checkpoint is about to start. A process must call this function before it opens any files belonging to the new checkpoint. SCR_Start_checkpoint must be called by all processes, including processes that do not write files as part of the checkpoint. This function should be called as soon as possible when initiating a checkpoint. The SCR implementation uses this call as the starting point to time the cost of the checkpoint in order to optimize the checkpoint frequency via SCR_Need_checkpoint. Each call to SCR_Start_checkpoint must be followed by a corresponding call to SCR_Complete_checkpoint.

In the current implementation, SCR_Start_checkpoint holds all processes at an MPI_Barrier to ensure that all processes are ready to start the checkpoint before it deletes cached files from a previous checkpoint.

SCR_Complete_checkpoint

int SCR_Complete_checkpoint(int valid);
SCR_COMPLETE_CHECKPOINT(VALID, IERROR)
  INTEGER VALID, IERROR

Inform SCR that all files for the current checkpoint are complete (i.e., done writing and closed) and whether they are valid (i.e., written without error). A process must close all checkpoint files before calling SCR_Complete_checkpoint. SCR_Complete_checkpoint must be called by all processes, including processes that did not write any files as part of the checkpoint.

The parameter valid should be set to 1 if either the calling process wrote all of its files successfully or it wrote no files during the checkpoint. Otherwise, the process should call SCR_Complete_checkpoint with valid set to 0. SCR will determine whether all processes wrote their checkpoint files successfully.

The SCR implementation uses this call as the stopping point to time the cost of the checkpoint that started with the preceding call to SCR_Start_checkpoint. Each call to SCR_Complete_checkpoint must be preceded by a corresponding call to SCR_Start_checkpoint.

In the current implementation, SCR applies the redundancy scheme during SCR_Complete_checkpoint. Before returning from the function, MPI rank 0 determines whether the job should be halted and signals this condition to all other ranks (Section Halt a job). If the job should be halted, rank 0 records a reason in the halt file, and then all tasks call exit, unless the auto exit feature is disabled.

Restart API

Here we describe the SCR API functions used for restarting applications.

SCR_Have_restart

int SCR_Have_restart(int* flag, char* name);
SCR_HAVE_RESTART(FLAG, NAME, IERROR)
  INTEGER FLAG
  CHARACTER*(*) NAME
  INTEGER IERROR

This function indicates whether SCR has a checkpoint available for the application to read. This function is passed a pointer to an integer in flag. Upon returning from SCR_Have_restart, flag is set to the value 1 if a checkpoint is available, and it is set to 0 otherwise. The call returns the same value in flag on all processes.

A pointer to a character buffer of at least SCR_MAX_FILENAME bytes can be passed in name. If there is a checkpoint, and if that checkpoint was assigned a name when it was created, SCR_Have_restart returns the name of that checkpoint in name. The value returned in name is the same string that was passed to SCR_Start_output when the checkpoint was created. In C, one may optionally pass NULL to this function to avoid returning the name. The same value is returned in name on all processes.

SCR_Start_restart

int SCR_Start_restart(char* name);
SCR_START_RESTART(NAME, IERROR)
  CHARACTER*(*) NAME
  INTEGER IERROR

This function informs SCR that a restart operation is about to start. A process must call this function before it opens any files belonging to the restart. SCR_Start_restart must be called by all processes, including processes that do not read files as part of the restart.

SCR returns the name of loaded checkpoint in name. A pointer to a character buffer of at least SCR_MAX_FILENAME bytes can be passed in name. The value returned in name is the same string that was passed to SCR_Start_output when the checkpoint was created. In C, one may optionally pass NULL to this function to avoid returning the name. The same value is returned in name on all processes.

One may only call SCR_Start_restart when SCR_Have_restart indicates that there is a checkpoint to read. SCR_Start_restart returns the same value in name as the preceding call to SCR_Have_restart.

Each call to SCR_Start_restart must be followed by a corresponding call to SCR_Complete_restart.

SCR_Complete_restart

int SCR_Complete_restart(int valid);
SCR_COMPLETE_RESTART(VALID, IERROR)
  INTEGER VALID, IERROR

This call informs SCR that the process has finished reading its checkpoint files. A process must close all restart files before calling SCR_Complete_restart. SCR_Complete_restart must be called by all processes, including processes that did not read any files as part of the restart.

The parameter valid should be set to 1 if either the calling process read all of its files successfully or it read no files during the checkpoint. Otherwise, the process should call SCR_Complete_restart with valid set to 0. SCR will determine whether all processes read their checkpoint files successfully based on the values supplied in the valid parameter. If any process failed to read its checkpoint files, then SCR will abort.

Each call to SCR_Complete_restart must be preceded by a corresponding call to SCR_Start_restart.

Output API

As of SCR version 1.2.0, SCR has the ability to manage application output datasets in addition to checkpoint datasets. Using a combination of bit flags, a dataset can be designated as a checkpoint, output, or both. The checkpoint property means that the dataset can be used to restart the application. The output property means that the dataset must be written to the prefix directory. This enables an application to utilize asynchronous transfers to the parallel file system for both its checkpoints and large output sets, so that it can return to computation while the dataset migrates to the parallel file system in the background.

If a user specifies that a dataset is a checkpoint only, then the dataset will be managed with the SCR Output API as it would be if the SCR Checkpoint API were used. In particular, SCR may delete the checkpoint when a more recent checkpoint is established.

If a user specifies that a dataset is for output only, the dataset will first be cached on a tier of storage specified in the configuration file for the run and protected with the corresponding redundancy scheme. Then, the dataset will be moved to the prefix directory. When the transfer to the prefix directory is complete, the cached copy of the output dataset will be deleted.

If the user specifies that the dataset is both output and checkpoint, then SCR will use a hybrid approach. Files in the dataset will be cached and redundancy schemes will be used to protect the files. The dataset will be copied to the prefix directory, but it will also be kept in cache according to the policy set in the configuration for checkpoints. For example, if the user has set the configuration to keep three checkpoints in cache, then the dataset will be preserved until it is replaced by a newer checkpoint after three more checkpoint phases.

SCR_Start_output

int SCR_Start_output(char* name, int flags);
SCR_START_OUTPUT(NAME, FLAGS, IERROR)
  CHARACTER*(*) NAME
  INTEGER FLAGS, IERROR

Inform SCR that a new output phase is about to start. A process must call this function before it opens any files belonging to the dataset. SCR_Start_output must be called by all processes, including processes that do not write files as part of the dataset.

The caller can provide a name for the dataset in name. This name is used in two places. First, for checkpoints, it is returned as the name value in the SCR Restart API. Second, it is exposed to the user when listing datasets using the scr_index command, and the user may specify the name as a command line argument at times. For this reason, it is recommended to use short but meaningful names that are easy to type. The name value must be less than SCR_MAX_FILENAME characters. All processes should provide identical values in name. In C, the application may pass NULL for name in which case SCR generates a default name for the dataset based on its internal dataset id.

The dataset can be output, a checkpoint, or both. The caller specifies these properties using SCR_FLAG_OUTPUT and SCR_FLAG_CHECKPOINT bit flags. Additionally, a SCR_FLAG_NONE flag is defined for initializing variables. In C, these values can be combined with the | bitwise OR operator. In Fortran, these values can be added together using the + sum operator. Note that with Fortran, the values should be used at most once in the addition. All processes should provide identical values in flags.

This function should be called as soon as possible when initiating a dataset output. It is used internally within SCR for timing the cost of output operations. Each call to SCR_Start_output must be followed by a corresponding call to SCR_Complete_output.

In the current implementation, SCR_Start_output holds all processes at an MPI_Barrier to ensure that all processes are ready to start the output before it deletes cached files from a previous checkpoint.

SCR_Complete_output

int SCR_Complete_output(int valid);
SCR_COMPLETE_OUTPUT(VALID, IERROR)
  INTEGER VALID, IERROR

Inform SCR that all files for the current dataset output are complete (i.e., done writing and closed) and whether they are valid (i.e., written without error). A process must close all files in the dataset before calling SCR_Complete_output. SCR_Complete_output must be called by all processes, including processes that did not write any files as part of the output.

The parameter valid should be set to 1 if either the calling process wrote all of its files successfully or it wrote no files during the output phase. Otherwise, the process should call SCR_Complete_output with valid set to 0. SCR will determine whether all processes wrote their output files successfully.

Each call to SCR_Complete_output must be preceded by a corresponding call to SCR_Start_output.

For the case of checkpoint datasets, SCR_Complete_output behaves similarly to SCR_Complete_checkpoint.

Space/time semantics

SCR imposes the following semantics:

  • A process of a given MPI rank may only access files previously written by itself or by processes having the same MPI rank in prior runs. We say that a rank “owns” the files it writes. A process is never guaranteed access to files written by other MPI ranks.
  • During a checkpoint, a process may only access files of the current checkpoint between calls to SCR_Start_checkpoint() and SCR_Complete_checkpoint(). Once a process calls SCR_Complete_checkpoint() it is no longer guaranteed access to any file that it registered as part of that checkpoint via SCR_Route_file().
  • During a restart, a process may only access files from its “most recent” checkpoint, and it must access those files between calls to SCR_Start_restart() and SCR_Complete_restart(). Once a process calls SCR_Complete_restart() it is no longer guaranteed access to its restart files. SCR selects which checkpoint is considered to be the “most recent”.

These semantics enable SCR to cache files on devices that are not globally visible to all processes, such as node-local storage. Further, these semantics enable SCR to move, reformat, or delete files as needed, such that it can manage this cache.

SCR API state transitions

_images/scr-states3.png

SCR API State Transition Diagram

Figure SCR API State Transition Diagram illustrates the internal states in SCR and which API calls can be used from within each state. The application must call SCR_Init before it may call any other SCR function, and it may not call SCR functions after calling SCR_Finalize. Some calls transition SCR from one state to another as shown by the edges between states. Other calls are only valid when in certain states as shown in the boxes. For example, SCR_Route_file is only valid within the Checkpoint, Restart, or Output states. All SCR functions are implicitly collective across MPI_COMM_WORLD, except for SCR_Route_file and SCR_Get_version.

Integrate SCR

This section provides details on how to integrate the SCR API into an application.

Using the SCR API

Before adding calls to the SCR library, consider that an application has existing checkpointing code that looks like the following

int main(int argc, char* argv[]) {
  MPI_Init(&argc, &argv);

  /* initialize our state from checkpoint file */
  state = restart();

  for (t = 0; t < TIMESTEPS; t++) {
    /* ... do work ... */

    /* every so often, write a checkpoint */
    if (t % CHECKPOINT_FREQUENCY == 0)
      checkpoint();
  }

  MPI_Finalize();
  return 0;
}

void checkpoint() {
  /* rank 0 creates a directory on the file system,
   * and then each process saves its state to a file */

  /* get rank of this process */
  int rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  /* rank 0 creates directory on parallel file system */
  if (rank == 0) mkdir(checkpoint_dir);

  /* hold all processes until directory is created */
  MPI_Barrier(MPI_COMM_WORLD);

  /* build file name of checkpoint file for this rank */
  char checkpoint_file[256];
  sprintf(checkpoint_file, "%s/rank_%d.ckpt",
    checkpoint_dir, rank
  );

  /* each rank opens, writes, and closes its file */
  FILE* fs = fopen(checkpoint_file, "w");
  if (fs != NULL) {
    fwrite(checkpoint_data, ..., fs);
    fclose(fs);
  }

  /* wait for all files to be closed */
  MPI_Barrier(MPI_COMM_WORLD);

  /* rank 0 updates the pointer to the latest checkpoint */
  FILE* fs = fopen("latest", "w");
  if (fs != NULL) {
    fwrite(checkpoint_dir, ..., fs);
    fclose(fs);
  }
}

void* restart() {
  /* rank 0 broadcasts directory name to read from,
   * and then each process reads its state from a file */

  /* get rank of this process */
  int rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  /* rank 0 reads and broadcasts checkpoint directory name */
  char checkpoint_dir[256];
  if (rank == 0) {
    FILE* fs = fopen("latest", "r");
    if (fs != NULL) {
      fread(checkpoint_dir, ..., fs);
      fclose(fs);
    }
  }
  MPI_Bcast(checkpoint_dir, sizeof(checkpoint_dir), MPI_CHAR, ...);

  /* build file name of checkpoint file for this rank */
  char checkpoint_file[256];
  sprintf(checkpoint_file, "%s/rank_%d.ckpt",
    checkpoint_dir, rank
  );

  /* each rank opens, reads, and closes its file */
  FILE* fs = fopen(checkpoint_file, "r");
  if (fs != NULL) {
    fread(state, ..., fs);
    fclose(fs);
  }

  return state;
}

There are three steps to consider when integrating the SCR API into an application: Init/Finalize, Checkpoint, and Restart. One may employ the scalable checkpoint capability of SCR without the scalable restart capability. While it is most valuable to utilize both, some applications cannot use the scalable restart.

The following code exemplifies the changes necessary to integrate SCR. Each change is numbered for further discussion below.

Init/Finalize

You must add calls to SCR_Init and SCR_Finalize in order to start up and shut down the library. The SCR library uses MPI internally, and all calls to SCR must be from within a well defined MPI environment, i.e., between MPI_Init and MPI_Finalize. It is recommended to call SCR_Init immediately after MPI_Init and to call SCR_Finalize just before MPI_Finalize. For example, modify the source to look something like this

int main(int argc, char* argv[]) {
  MPI_Init(&argc, &argv);

  /**** change #1 ****/
  SCR_Init();

  /**** change #2 ****/
  int have_restart;
  SCR_Have_restart(&have_restart, NULL);
  if (have_restart)
    state = restart();
  else
    state = new_run_state;

  for (t = 0; t < TIMESTEPS; t++) {
    /* ... do work ... */

    /**** change #3 ****/
    int need_checkpoint;
    SCR_Need_checkpoint(&need_checkpoint);
    if (need_checkpoint)
      checkpoint();
  }

  /**** change #4 ****/
  SCR_Finalize();

  MPI_Finalize();
  return 0;
}

First, as shown in change #1, one must call SCR_Init() to initialize the SCR library before it can be used. SCR uses MPI, so SCR must be initialized after MPI has been initialized. Similarly, as shown in change #4, one should shut down the SCR library by calling SCR_Finalize(). This must be done before calling MPI_Finalize(). Internally, SCR duplicates MPI_COMM_WORLD during SCR_Init, so MPI messages from the SCR library do not mix with messages sent by the application.

Some applications contain multiple calls to MPI_Finalize. In such cases, be sure to account for each call. The same applies to MPI_Init if there are multiple calls to this function.

In change #2, the application can call SCR_Have_restart() to determine whether there is a checkpoint to read in. If so, it calls its restart function, otherwise it assumes it is starting from scratch. This should only be called if the application is using the scalable restart feature of SCR.

As shown in change #3, the application may rely on SCR to determine when to checkpoint by calling SCR_Need_checkpoint(). SCR can be configured with information on failure rates and checkpoint costs for the particular host platform, so this function provides a portable method to guide an application toward an optimal checkpoint frequency. For this, the application should call SCR_Need_checkpoint at each natural opportunity it has to checkpoint, e.g., at the end of each time step, and then initiate a checkpoint when SCR advises it to do so. An application may ignore the output of SCR_Need_checkpoint, and it does not have to call the function at all. The intent of SCR_Need_checkpoint is to provide a portable way for an application to determine when to checkpoint across platforms with different reliability characteristics and different file system speeds.

Checkpoint

To actually write a checkpoint, there are three steps. First, the application must call SCR_Start_checkpoint to define the start boundary of a new checkpoint. It must do this before it opens any file belonging to the new checkpoint. Then, the application must call SCR_Route_file for each file that it will write in order to register the file with SCR and to determine the full path and file name to open each file. Finally, it must call SCR_Complete_checkpoint to define the end boundary of the checkpoint.

If a process does not write any files during a checkpoint, it must still call SCR_Start_checkpoint and SCR_Complete_checkpoint as these functions are collective. All files registered through a call to SCR_Route_file between a given SCR_Start_checkpoint and SCR_Complete_checkpoint pair are considered to be part of the same checkpoint file set. Some example SCR checkpoint code looks like the following

void checkpoint() {
  /* each process saves its state to a file */

  /**** change #5 ****/
  SCR_Start_checkpoint();

  /* get rank of this process */
  int rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  /**** change #6 ****/
  /*
      if (rank == 0)
        mkdir(checkpoint_dir);

      // hold all processes until directory is created
      MPI_Barrier(MPI_COMM_WORLD);
  */

  /* build file name of checkpoint file for this rank */
  char checkpoint_file[256];
  sprintf(checkpoint_file, "%s/rank_%d.ckpt",
    checkpoint_dir, rank
  );

  /**** change #7 ****/
  char scr_file[SCR_MAX_FILENAME];
  SCR_Route_file(checkpoint_file, scr_file);

  /**** change #8 ****/
  /* each rank opens, writes, and closes its file */
  FILE* fs = fopen(scr_file, "w");
  if (fs != NULL) {
    fwrite(checkpoint_data, ..., fs);
    fclose(fs);
  }

  /**** change #9 ****/
  /*
      // wait for all files to be closed
      MPI_Barrier(MPI_COMM_WORLD);

      // rank 0 updates the pointer to the latest checkpoint
      FILE* fs = fopen("latest", "w");
      if (fs != NULL) {
        fwrite(checkpoint_dir, ..., fs);
        fclose(fs);
      }
  */

  /**** change #10 ****/
  SCR_Complete_checkpoint(valid);

  /**** change #11 ****/
  /* Check whether we should stop */
  int should_exit;
  SCR_Should_exit(&should_exit);
  if (should_exit) {
    exit(0);
  }
}

As shown in change #5, the application must inform SCR when it is starting a new checkpoint by calling SCR_Start_checkpoint(). Similarly, it must inform SCR when it has completed the checkpoint with a corresponding call to SCR_Complete_checkpoint() as shown in change #10. When calling SCR_Complete_checkpoint(), each process sets the valid flag to indicate whether it wrote all of its checkpoint files successfully.

SCR manages checkpoint directories, so the mkdir operation is removed in change #6. Additionally, the application can rely on SCR to track the latest checkpoint, so the logic to track the latest checkpoint is removed in change #9.

Between the call to SCR_Start_checkpoint() and SCR_Complete_checkpoint(), the application must register each of its checkpoint files by calling SCR_Route_file() as shown in change #7. SCR “routes” the file by replacing any leading directory on the file name with a path that points to another directory in which SCR caches data for the checkpoint. As shown in change #8, the application must use the exact string returned by SCR_Route_file() to open its checkpoint file.

Also note how the application can call SCR_Should_exit after a checkpoint to determine whether it is time to stop shown in change #11. This is important so that an application stops with sufficient time remaining to copy datasets from cache to the parallel file system before the allocation expires.

Restart with SCR

There are two options to access files during a restart: with and without SCR. If an application is designed to restart such that each MPI task only needs access to the files it wrote during the previous checkpoint, then the application can utilize the scalable restart capability of SCR. This enables the application to restart from a cached checkpoint in the existing resource allocation, which saves the cost of writing to and reading from the parallel file system.

To use SCR for restart, the application can call SCR_Have_restart to determine whether SCR has a previous checkpoint loaded. If there is a checkpoint available, the application can call SCR_Start_restart to tell SCR that a restart operation is beginning. Then, the application must call SCR_Route_file to determine the full path and file name to each of its checkpoint files that it will read for restart. The input file name to SCR_Route_file does not need a path during restart, as SCR will identify the file just based on its file name. After the application reads in its checkpoint files, it must call SCR_Complete_restart to indicate that it has completed reading its checkpoint files. Some example SCR restart code may look like the following

void* restart() {
  /* each process reads its state from a file */

  /**** change #12 ****/
  SCR_Start_restart(NULL);

  /* get rank of this process */
  int rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  /**** change #13 ****/
  /*
      // rank 0 reads and broadcasts checkpoint directory name
      char checkpoint_dir[256];
      if (rank == 0) {
        FILE* fs = fopen("latest", "r");
        if (fs != NULL) {
          fread(checkpoint_dir, ..., fs);
          fclose(fs);
        }
      }
      MPI_Bcast(checkpoint_dir, sizeof(checkpoint_dir), MPI_CHAR, ...);
  */

  /**** change #14 ****/
  /* build file name of checkpoint file for this rank */
  char checkpoint_file[256];
  sprintf(checkpoint_file, "rank_%d.ckpt",
    rank
  );

  /**** change #15 ****/
  char scr_file[SCR_MAX_FILENAME];
  SCR_Route_file(checkpoint_file, scr_file);

  /**** change #16 ****/
  /* each rank opens, reads, and closes its file */
  FILE* fs = fopen(scr_file, "r");
  if (fs != NULL) {
    fread(state, ..., fs);
    fclose(fs);
  }

  /**** change #17 ****/
  SCR_Complete_restart(valid);

  return state;
}

As shown in change #12, the application calls SCR_Start_restart() to inform SCR that it is beginning its restart. SCR automatically loads the most recent checkpoint, so the application logic to identify the latest checkpoint is removed in change #13. During a restart, the application only needs the file name, so the checkpoint directory can be dropped from the path in change #14. Instead, the application gets the path to use to open the checkpoint file via a call to SCR_Route_file() in change #15. It then uses that path to open the file for reading in change #16. After the process has read each of its checkpoint files, it informs SCR that it has completed reading its data with a call to SCR_Complete_restart() in change #17. When calling SCR_Complete_restart(), each process sets the valid flag to indicate whether it read all of its checkpoint files successfully.

Restart without SCR

If the application does not use SCR for restart, it should not make calls to SCR_Have_restart, SCR_Start_restart, SCR_Route_file, or SCR_Complete_restart during the restart. Instead, it should access files directly from the parallel file system. When restarting without SCR, the value of the SCR_FLUSH counter will not be preserved between restarts. The counter will be reset to its upper limit with each restart. Thus, each restart may introduce some fixed offset in a series of periodic SCR flushes. When not using SCR for restart, one should set the SCR_FLUSH_ON_RESTART parameter to 1, which will cause SCR to flush any cached checkpoint to the file system during SCR_Init.

Building with the SCR library

To compile and link with the SCR library, add the flags in Table~ref{table:build_flags} to your compile and link lines. The value of the variable SCR_INSTALL_DIR should be the path to the installation directory for SCR.

SCR build flags

Compile Flags -I$(SCR_INSTALL_DIR)/include
C Dynamic Link Flags -L$(SCR_INSTALL_DIR)/lib64 -lscr -Wl,-rpath,$(SCR_INSTALL_DIR)/lib64
C Static Link Flags -L$(SCR_INSTALL_DIR)/lib64 -lscr -lz
Fortran Dynamic Link Flags -L$(SCR_INSTALL_DIR)/lib64 -lscrf -Wl,-rpath,$(SCR_INSTALL_DIR)/lib64
Fortran Static Link Flags -L$(SCR_INSTALL_DIR)/lib64 -lscrf -lz

Run a job

In addition to the SCR library, one must properly configure the batch system and include a set of SCR commands in the job batch script. In particular, one must: 1) inform the batch system that the allocation should remain available even after a failure and 2) replace the command to execute the application with an SCR wrapper script. The precise set of options and commands to use depends on the system resource manager.

The SCR commands prepare the cache, scavenge files from cache to the parallel file system, and check that the scavenged dataset is complete among other things. These commands are located in the /bin directory where SCR is installed. There are numerous SCR commands. Any command not mentioned in this document is not intended to be executed by users.

Supported platforms

At the time of this writing, SCR supports specific combinations of resource managers and job launchers. The descriptions for using SCR in this section apply to these specific configurations, however the following description is helpful to understand how to run SCR on any system. Please contact us for help in porting SCR to other platforms. (See Section Support for contact information).

Jobs and job steps

First, we differentiate between a job allocation and a job step. Our terminology originates from the SLURM resource manager, but the principles apply generally across SCR-supported resource managers.

When a job is scheduled resources on a system, the batch script executes inside of a job allocation. The job allocation consists of a set of nodes, a time limit, and a job id. The job id can be obtained by executing the squeue command on SLURM, the apstat command on ALPS, and the bjobs command on LSF.

Within a job allocation, a user may run one or more job steps, each of which is invoked by a call to srun on SLURM, aprun on ALPS, or mpirun on LSF. Each job step is assigned its own step id. On SLURM, within each job allocation, job step ids start at 0 and increment with each issued job step. Job step ids can be obtained by passing the -s option to squeue. A fully qualified name of a SLURM job step consists of: jobid.stepid. For instance, the name 1234.5 refers to step id 5 of job id 1234. On ALPS, each job step within an allocation has a unique id that can be obtained through apstat.

Ignoring node failures

Before running an SCR job, it is necessary to configure the job allocation to withstand node failures. By default, most resource managers terminate the job allocation if a node fails, however SCR requires the job allocation to remain active in order to restart the job or to scavenge files. To enable the job allocation to continue past node failures, one must specify the appropriate flags from the table below.

SCR job allocation flags

MOAB batch script #MSUB -l resfailpolicy=ignore
MOAB interactive qsub -I ... -l resfailpolicy=ignore
SLURM batch script #SBATCH --no-kill
SLURM interactive salloc --no-kill ...
LSF batch script #BSUB -env "all, LSB_DJOB_COMMFAIL_ACTION=KILL_TASKS"
LSF interactive bsub -env "all, LSB_DJOB_COMMFAIL_ACTION=KILL_TASKS" ...

The SCR wrapper script

The easiest way to integrate SCR into a batch script is to set some environment variables and to replace the job run command with an SCR wrapper script. The SCR wrapper script includes logic to restart an application within an job allocation, and it scavenges files from cache to the parallel file system at the end of an allocation.:

SLURM:  scr_srun [srun_options]  <prog> [prog_args ...]
ALPS:   scr_aprun [aprun_options] <prog> [prog_args ...]
LSF:    scr_mpirun [mpirun_options] <prog> [prog_args ...]

The SCR wrapper script must run from within a job allocation. Internally, the command must know the prefix directory. By default, it uses the current working directory. One may specify a different prefix directory by setting the SCR_PREFIX parameter.

It is recommended to set the SCR_HALT_SECONDS parameter so that the job allocation does not expire before datasets can be flushed (Section Halt a job).

By default, the SCR wrapper script does not restart an application after the first job step exits. To automatically restart a job step within the current allocation, set the SCR_RUNS environment variable to the maximum number of runs to attempt. For an unlimited number of attempts, set this variable to -1.

After a job step exits, the wrapper script checks whether it should restart the job. If so, the script sleeps for some time to give nodes in the allocation a chance to clean up. Then, it checks that there are sufficient healthy nodes remaining in the allocation. By default, the wrapper script assumes the next run requires the same number of nodes as the previous run, which is recorded in a file written by the SCR library. If this file cannot be read, the command assumes the application requires all nodes in the allocation. Alternatively, one may override these heuristics and precisely specify the number of nodes needed by setting the SCR_MIN_NODES environment variable to the number of required nodes.

Some applications cannot run via wrapper scripts. For applications that cannot invoke the SCR wrapper script as described here, one should examine the logic contained in the script and duplicate the necessary parts in the job batch script. In particular, one should invoke scr_postrun for scavenge support.

Example batch script for using SCR restart capability

An example MOAB / SLURM batch script with scr_srun is shown below

#!/bin/bash
#MSUB -l partition=atlas
#MSUB -l nodes=66
#MSUB -l resfailpolicy=ignore

# above, tell MOAB to not kill the job allocation upon a node failure
# also note that the job requested 2 spares -- it uses 64 nodes but allocated 66

# specify where datasets should be written
export SCR_PREFIX=/my/parallel/file/system/username/run1/checkpoints

# instruct SCR to flush to the file system every 20 checkpoints
export SCR_FLUSH=20

# halt if there is less than an hour remaining (3600 seconds)
export SCR_HALT_SECONDS=3600

# attempt to run the job up to 3 times
export SCR_RUNS=3

# run the job with scr_srun
scr_srun -n512 -N64 ./my_job

Configure a job

The default SCR configuration suffices for many Linux clusters. However, significant performance improvement or additional functionality may be gained via custom configuration.

Setting parameters

SCR searches the following locations in the following order for a parameter value, taking the first value it finds.

  • Environment variables,
  • User configuration file,
  • System configuration file,
  • Compile-time constants.

Some parameters, such as the location of the control directory, cannot be specified by the user. Such parameters must be either set in the system configuration file or hard-coded into SCR as compile-time constants.

To find a user configuration file, SCR looks for a file named .scrconf in the prefix directory (note the leading dot). Alternatively, one may specify the name and location of the user configuration file by setting the SCR_CONF_FILE environment variable at run time, e.g.,:

export SCR_CONF_FILE=~/myscr.conf

The location of the system configuration file is hard-coded into SCR at build time. This defaults to /etc/scr/scr.conf. One may set this using the SCR_CONFIG_FILE option with cmake, e.g.,:

cmake -DSCR_CONFIG_FILE=/path/to/scr.conf ...

To set an SCR parameter in a configuration file, list the parameter name followed by its value separated by an ‘=’ sign. Blank lines are ignored, and any characters following the ‘#’ comment character are ignored. For example, a configuration file may contain something like the following:

>>: cat ~/myscr.conf
# set the halt seconds to one hour
SCR_HALT_SECONDS=3600

# set SCR to flush every 20 checkpoints
SCR_FLUSH=20

Group, store, and checkpoint descriptors

SCR must have information about process groups, storage devices, and redundancy schemes. The defaults provide reasonable settings for Linux clusters, but one can define custom settings via group, store, and checkpoint descriptors in configuration files.

SCR must know which processes are likely to fail at the same time (failure groups) and which processes access a common storage device (storage groups). By default, SCR creates a group of all processes in the job called WORLD and another group of all processes on the same compute node called NODE. If more groups are needed, they can be defined in configuration files with entries like the following:

GROUPS=host1  POWER=psu1  SWITCH=0
GROUPS=host2  POWER=psu1  SWITCH=1
GROUPS=host3  POWER=psu2  SWITCH=0
GROUPS=host4  POWER=psu2  SWITCH=1

Group descriptor entries are identified by a leading GROUPS key. Each line corresponds to a single compute node, where the hostname is the value of the GROUPS key. There must be one line for every compute node in the allocation. It is recommended to specify groups in the system configuration file.

The remaining values on the line specify a set of group name / value pairs. The group name is the string to be referenced by store and checkpoint descriptors. The value can be an arbitrary character string. The only requirement is that for a given group name, nodes that form a group must provide identical strings the value.

In the above example, there are four compute nodes: host1, host2, host3, and host4. There are two groups defined: POWER and SWITCH. Nodes host1 and host2 belong to the same POWER group, as do nodes host3 and host4. For the SWITCH group, nodes host1 and host3 belong to the same group, as do nodes host2 and host4.

In addition to groups, SCR must know about the storage devices available on a system. SCR requires that all processes be able to access the prefix directory, and it assumes that /tmp is storage local to each compute node. Additional storage can be described in configuration files with entries like the following:

STORE=/tmp          GROUP=NODE   COUNT=1
STORE=/ssd          GROUP=NODE   COUNT=3
STORE=/dev/persist  GROUP=NODE   COUNT=1  ENABLED=1  MKDIR=0
STORE=/p/lscratcha  GROUP=WORLD

Store descriptor entries are identified by a leading STORE key. Each line corresponds to a class of storage devices. The value associated with the STORE key is the directory prefix of the storage device. This directory prefix also serves as the name of the store descriptor. All compute nodes must be able to access their respective storage device via the specified directory prefix.

The remaining values on the line specify properties of the storage class. The GROUP key specifies the group of processes that share a device. Its value must specify a group name. The COUNT key specifies the maximum number of checkpoints that can be kept in the associated storage. The user should be careful to set this appropriately depending on the storage capacity and the application checkpoint size. The COUNT key is optional, and it defaults to the value of the SCR_CACHE_SIZE parameter if not specified. The ENABLED key enables (1) or disables (0) the store descriptor. This key is optional, and it defaults to 1 if not specified. The MKDIR key specifies whether the device supports the creation of directories (1) or not (0). This key is optional, and it defaults to 1 if not specified.

In the above example, there are four storage devices specified: /tmp, /ssd, /dev/persist, and /p/lscratcha. The storage at /tmp, /ssd, and /dev/persist specify the NODE group, which means that they are node-local storage. Processes on the same compute node access the same device. The storage at /p/lscratcha specifies the WORLD group, which means that all processes in the job can access the device. In other words, it is a globally accessible file system.

Finally, SCR must be configured with redundancy schemes. By default, SCR protects against single compute node failures using XOR, and it caches one checkpoint in /tmp. To specify something different, edit a configuration file to include checkpoint and output descriptors. These descriptors look like the following:

# instruct SCR to use the CKPT descriptors from the config file
SCR_COPY_TYPE=FILE

# the following instructs SCR to run with three checkpoint configurations:
# - save every 8th checkpoint to /ssd using the PARTNER scheme
# - save every 4th checkpoint (not divisible by 8) to /ssd using XOR with
#   a set size of 8
# - save all other checkpoints (not divisible by 4 or 8) to /tmp using XOR with
#   a set size of 16
CKPT=0 INTERVAL=1 GROUP=NODE   STORE=/tmp TYPE=XOR     SET_SIZE=16
CKPT=1 INTERVAL=4 GROUP=NODE   STORE=/ssd TYPE=XOR     SET_SIZE=8  OUTPUT=1
CKPT=2 INTERVAL=8 GROUP=SWITCH STORE=/ssd TYPE=PARTNER

CKPT=0 INTERVAL=1 GROUP=NODE   STORE=/tmp TYPE=XOR     SET_SIZE=16

First, one must set the SCR_COPY_TYPE parameter to “FILE”. Otherwise, an implied checkpoint descriptor is constructed using various SCR parameters including SCR_GROUP, SCR_CACHE_BASE, SCR_COPY_TYPE, and SCR_SET_SIZE.

Checkpoint descriptor entries are identified by a leading CKPT key. The values of the CKPT keys must be numbered sequentially starting from 0. The INTERVAL key specifies how often a descriptor is to be applied. For each checkpoint, SCR selects the descriptor having the largest interval value that evenly divides the internal SCR checkpoint iteration number. It is necessary that one descriptor has an interval of 1. This key is optional, and it defaults to 1 if not specified. The GROUP key lists the failure group, i.e., the name of the group of processes likely to fail. This key is optional, and it defaults to the value of the SCR_GROUP parameter if not specified. The STORE key specifies the directory in which to cache the checkpoint. This key is optional, and it defaults to the value of the SCR_CACHE_BASE parameter if not specified. The TYPE key identifies the redundancy scheme to be applied. This key is optional, and it defaults to the value of the SCR_COPY_TYPE parameter if not specified.

Other keys may exist depending on the selected redundancy scheme. For XOR schemes, the SET_SIZE key specifies the minimum number of processes to include in each XOR set.

One checkpoint descriptor can be marked with the OUTPUT key. This indicates that the descriptor should be selected to store datasets that the application flags with SCR_FLAG_OUTPUT. The OUTPUT key is optional, and it defaults to 0. If there is no descriptor with the OUTPUT key defined and if the dataset is also a checkpoint, SCR will choose the checkpoint descriptor according to the normal policy. Otherwise, if there is no descriptor with the OUTPUT key defined and if the dataset is not a checkpoint, SCR will use the checkpoint descriptor having interval of 1.

SCR parameters

The table in this section specifies the full set of SCR configuration parameters.

SCR parameters
Name Default Description
SCR_HALT_SECONDS 0 Set to a positive integer to instruct SCR to halt the job after completing a successful checkpoint if the remaining time in the current job allocation is less than the specified number of seconds.
SCR_HALT_ENABLED 1 Whether SCR should halt a job by calling exit(). Set to 0 to disable in which case the application is responsible for stopping.
SCR_GROUP NODE Specify name of failure group.
SCR_COPY_TYPE XOR Set to one of: SINGLE, PARTNER, XOR, or FILE.
SCR_CACHE_BASE /tmp Specify the base directory SCR should use to cachecheckpoints.
SCR_CACHE_SIZE 1 Set to a non-negative integer to specify the maximum number of checkpoints SCR should keep in cache. SCR will delete the oldest checkpoint from cache before saving another in order to keep the total count below this limit.
SCR_SET_SIZE 8 Specify the minimum number of processes to include in an XOR set. Increasing this value decreases the amount of storage required to cache the checkpoint data. However, higher values have an increased likelihood of encountering a catastrophic error. Higher values may also require more time to reconstruct lost files from redundancy data.
SCR_PREFIX $PWD Specify the prefix directory on the parallel file system where checkpoints should be read from and written to.
SCR_CHECKPOINT_SECONDS 0 Set to positive number of seconds to specify minimum time between consecutive checkpoints as guided by SCR_Need_checkpoint.
SCR_CHECKPOINT_OVERHEAD 0.0 Set to positive percentage to specify maximum overhead allowed for checkpointing operations as guided by SCR_Need_checkpoint.
SCR_DISTRIBUTE 1 Set to 0 to disable file distribution during SCR_Init.
SCR_FETCH 1 Set to 0 to disable SCR from fetching files from the parallel file system during SCR_Init.
SCR_FETCH_WIDTH 256 Specify the number of processes that may read simultaneously from the parallel file system.
SCR_FLUSH 10 Specify the number of checkpoints between periodic SCR flushes to the parallel file system. Set to 0 to disable periodic flushes.
SCR_FLUSH_ASYNC 0 Set to 1 to enable asynchronous flush methods (if supported).
SCR_FLUSH_WIDTH 256 Specify the number of processes that may write simultaneously to the parallel file system.
SCR_FLUSH_ON_RESTART 0 Set to 1 to force SCR to flush a checkpoint during restart. This is useful for codes that must restart from the parallel file system.
SCR_PRESERVE_DIRECTORIES 1 Whether SCR should preserve the application directory structure in prefix directory in flush and scavenge operations. Set to 0 to rely on SCR-defined directory layouts.
SCR_RUNS 1 Specify the maximum number of times the scr_srun command should attempt to run a job within an allocation. Set to -1 to specify an unlimited number of times.
SCR_MIN_NODES N/A Specify the minimum number of nodes required to run a job.
SCR_EXCLUDE_NODES N/A Specify a set of nodes, using SLURM node range syntax, which should be excluded from runs. This is useful to avoid particular nodes while waiting for them to be fixed by system administrators. Nodes in this list which are not in the current allocation are silently ignored.
SCR_MPI_BUF_SIZE 131072 Specify the number of bytes to use for internal MPI send and receive buffers when computing redundancy data or rebuilding lost files.
SCR_FILE_BUF_SIZE 1048576 Specify the number of bytes to use for internal buffers when copying files between the parallel file system and the cache.
SCR_CRC_ON_COPY 0 Set to 1 to enable CRC32 checks when copying files during the redundancy scheme.
SCR_CRC_ON_DELETE 0 Set to 1 to enable CRC32 checks when deleting files from cache.
SCR_CRC_ON_FLUSH 1 Set to 0 to disable CRC32 checks during fetch and flush operations.
SCR_DEBUG 0 Set to 1 or 2 for increasing verbosity levels of debug messages.
SCR_WATCHDOG_TIMEOUT N/A Set to the expected time (seconds) for checkpoint writes to in-system storage (See Section Catch a hanging job).
SCR_WATCHDOG_TIMEOUT_PFS N/A Set to the expected time (seconds) for checkpoint writes to the parallel file system (See Section Catch a hanging job).

Halt a job

There are several mechanisms to instruct a running SCR application to halt. It is often necessary to interact with the resource manager to halt a job.

scr_halt and the halt file

The recommended method to stop an SCR application is to use the scr_halt command. The command must be run from within the prefix directory, or otherwise, the prefix directory of the target job must be specified as an argument.

A number of different halt conditions can be specified. In most cases, the scr_halt command communicates these conditions to the running application via the halt.scr file, which is stored in the hidden .scr directory within the prefix directory. The SCR library reads the halt file when the application calls SCR_Init and each time the application completes a checkpoint. If a halt condition is satisfied, all tasks in the application call exit. One can disable this behavior by setting the SCR_HALT_ENABLED parameter to 0. In this case, the application can determine when to exit by calling SCR_Should_exit.

Halt after next checkpoint

You can instruct an SCR job to halt after completing its next successful checkpoint:

scr_halt

To run scr_halt from outside of a prefix directory, specify the target prefix directory like so:

scr_halt /p/lscratcha/user1/simulation123

You can instruct an SCR job to halt after completing some number of checkpoints via the --checkpoints option. For example, to instruct a job to halt after 10 more checkpoints, use the following:

scr_halt --checkpoints 10

If the last of the checkpoints is unsuccessful, the job continues until it completes a successful checkpoint. This ensures that SCR has a successful checkpoint to flush before it halts the job.

Halt before or after a specified time

It is possible to instruct an SCR job to halt after a specified time using the --after option. The job will halt on its first successful checkpoint after the specified time. For example, you can instruct a job to halt after “12:00pm today” via:

scr_halt --after '12:00pm today'

It is also possible to instruct a job to halt before* a specified time using the --before option. For example, you can instruct a job to halt before “8:30am tomorrow” via:

scr_halt --before '8:30am tomorrow'

For the “halt before” condition to be effective, one must also set the SCR_HALT_SECONDS parameter. When SCR_HALT_SECONDS is set to a positive number, SCR checks how much time is left before the specified time limit. If the remaining time in seconds is less than or equal to SCR_HALT_SECONDS, SCR halts the job. The value of SCR_HALT_SECONDS does not affect the “halt after” condition.

It is highly recommended that SCR_HALT_SECONDS be set so that the SCR library can impose a default “halt before” condition using the end time of the job allocation. This ensures the latest checkpoint can be flushed before the allocation is lost.

It is important to set SCR_HALT_SECONDS to a value large enough that SCR has time to completely flush (and rebuild) files before the allocation expires. Consider that a checkpoint may be taken just before the remaining time is less than SCR_HALT_SECONDS. If a code checkpoints every X seconds and it takes Y seconds to flush files from the cache and rebuild, set SCR_HALT_SECONDS = X + Y + Delta, where Delta is some positive value to provide additional slack time.

One may also set the halt seconds via the --seconds option to scr_halt. Using the scr_halt command, one can set, change, and unset the halt seconds on a running job.

NOTE: If any scr_halt commands are specified as part of the batch script before the first run starts, one must then use scr_halt to set the halt seconds for the job rather than the SCR_HALT_SECONDS parameter. The scr_halt command creates the halt file, and if a halt file exists before a job starts to run, SCR ignores any value specified in the SCR_HALT_SECONDS parameter.

Halt immediately

Sometimes, you need to halt an SCR job immediately, and there are two options for this. You may use the --immediate option:

scr_halt --immediate

This command first updates the halt file, so that the job will not be restarted once stopped. Then, it kills the current run.

If for some reason the --immediate option fails to work, you may manually halt the job. [1] First, issue a simple scr_halt so the job will not restart, and then manually kill the current run using mechanisms provided by the resource manager, e.g., scancel for SLURM and apkill for ALPS. When using mechanisms provided by the resource manager to kill the current run, be careful to cancel the job step and not the job allocation. Canceling the job allocation destroys the cache.

For SLURM, to get the job step id, type: squeue -s. Then be sure to include the job id and step id in the scancel argument. For example, if the job id is 1234 and the step id is 5, then use the following commands:

scr_halt
scancel 1234.5

Do not just type “scancel 1234” – be sure to include the job step id.

For ALPS, use apstat to get the apid of the job step to kill. Then, follow the steps as described above: execute scr_halt followed by the kill command apkill <apid>.

[1]On Cray/ALPS, scr_halt --immediate is not yet supported. The alternate method described in the text must be used instead.

Catch a hanging job

If an application hangs, SCR may not be given the chance to copy files from cache to the parallel file system before the allocation expires. To avoid losing significant work due to a hang, SCR attempts to detect if a job is hanging, and if so, SCR attempts to kill the job step so that it can be restarted in the allocation.

On some systems, SCR employs the io-watchdog library for this purpose. For more information on this tool, see http://code.google.com/p/io-watchdog.

On systems where io-watchdog is not available, SCR uses a generic mechanism based on the expected time between checkpoints as specified by the user. If the time between checkpoints is longer than expected, SCR assumes the job is hanging. Two SCR parameters determine how many seconds should pass between I/O phases in an application, i.e. seconds between consecutive calls to SCR_Start_checkpoint. These are SCR_WATCHDOG_TIMEOUT and SCR_WATCHDOG_TIMEOUT_PFS. The first parameter specifies the time to wait when SCR writes checkpoints to in-system storage, e.g. SSD or RAM disk, and the second parameter specifies the time to wait when SCR writes checkpoints to the parallel file system. The reason for the two timeouts is that writing to the parallel file system generally takes much longer than writing to in-system storage, and so a longer timeout period is useful in that case.

When using this feature, be careful to check that the job does not hang near the end of its allocation time limit, since in this case, SCR may not kill the run with enough time before the allocation ends. If you suspect the job to be hanging and you deem that SCR will not kill the run in sufficient time, manually cancel the run as described above.

Combine, list, change, and unset halt conditions

It is possible to specify multiple halt conditions. To do so, simply list each condition in the same scr_halt command or issue several commands. For example, to instruct a job to halt after 10 checkpoints or before “8:30am tomorrow”, which ever comes earlier, you could issue the following command:

scr_halt --checkpoints 10 --before '8:30am tomorrow'

The following sequence also works:

scr_halt --checkpoints 10
scr_halt --before '8:30am tomorrow'

You may list the current settings in the halt file with the --list option, e.g.,:

scr_halt --list

You may change a setting by issuing a new command to overwrite the current value.

Finally, you can unset some halt conditions by prepending unset- to the option names. See the scr_halt man page for a full listing of unset options. For example, to unset the “halt before” condition on a job, type the following:

scr_halt --unset-before

Remove the halt file

Sometimes, especially during testing, you may want to run in an existing allocation after halting a previous run. When SCR detects a halt file with a satisfied halt condition, it immediately exits. This is the desired effect when trying to halt a job, however this mechanism also prevents one from intentionally running in an allocation after halting a previous run. Along these lines, know that SCR registers a halt condition whenever the application calls SCR_Finalize.

When there is a halt file with a satisfied halt condition, a message is printed to stdout to indicate why SCR is halting. To run in such a case, first remove the satisfied halt conditions. You can unset the conditions or reset them to appropriate values. Another approach is to remove the halt file via the --remove option. This deletes the halt file, which effectively removes all halt conditions. For example, to remove the halt file from a job, type:

scr_halt --remove

Manage datasets

SCR records the status of datasets that are on the parallel file system in the index.scr file. This file is written to the hidden .scr directory within the prefix directory. The library updates the index file as an application runs and during scavenge operations.

While restarting a job, the SCR library reads the index file during SCR_Init to determine which checkpoints are available. The library attempts to restart with the most recent checkpoint and works backwards until it successfully fetches a valid checkpoint. SCR does not fetch any checkpoint marked as “incomplete” or “failed”. A checkpoint is marked as incomplete if it was determined to be invalid during the flush or scavenge. Additionally, the library marks a checkpoint as failed if it detected a problem during a previous fetch attempt (e.g., detected data corruption). In this way, the library avoids invalid or problematic checkpoints.

One may list or modify the contents of the index file via the scr_index command. The scr_index command must run within the prefix directory, or otherwise, one may specify a prefix directory using the “--prefix” option. The default behavior of scr_index is to list the contents of the index file, e.g.:

>>: scr_index
   DSET VALID FLUSHED             NAME
*    18 YES   2014-01-14T11:26:06 ckpt.18
     12 YES   2014-01-14T10:28:23 ckpt.12
      6 YES   2014-01-14T09:27:15 ckpt.6

When listing datasets, the internal SCR dataset id is shown, followed by a field indicating whether the dataset is valid, the time it was flushed to the parallel file system, and finally the dataset name.

One checkpoint may also be marked as “current”. When restarting a job, the SCR library starts from the current dataset and works backwards. The current dataset is denoted with a leading * character. One can change the current checkpoint using the --current option, providing the dataset name as an argument.:

scr_index --current ckpt.12

In most cases, the SCR library or the SCR commands add all necessary entries to the index file. However, there are cases where they may fail. In particular, if the scr_postrun command successfully scavenges a dataset but the resource allocation ends before the command can rebuild missing files, an entry may be missing from the index file. In such cases, one may manually add the corresponding entry using the “--add” option.

When adding a new dataset to the index file, the scr_index command checks whether the files in a dataset constitute a complete and valid set. It rebuilds missing files if there are sufficient redundant data, and it writes the summary.scr file for the dataset if needed. One must provide the SCR dataset id as an argument. To obtain the SCR dataset id value, lookup the trailing integer on the names of scr.dataset subdirectories in the hidden .scr directory within the prefix directory.:

scr_index --add 50

One may remove entries from the index file using the “--remove” option. This operation does not delete the corresponding dataset files. It only deletes the entry from the index.scr file.:

scr_index --remove ckpt.50

This is useful if one deletes a dataset from the parallel file system and then wishes to update the index.