SCR API

SCR is designed to support MPI applications that write application-level checkpoints and output datasets. Both types of datasets (checkpoints and output) must be stored as a file-per-process, and they must be accessed in a globally-coordinated fashion. In a given dataset, each process may actually write zero or more files, but the current implementation assumes that each process writes roughly the same amount of data.

Parallel file systems allow any process in an MPI job to read/write any byte of a file at any time. However, most applications do not require this full generality. SCR supplies API calls that enable the application to specify limits on its data access in both time and space. Start and complete calls indicate when an application needs to write or read its data. Data cannot be accessed outside of these markers. Additionally, each MPI process may only read files written by itself or another process having the same MPI rank in a previous run. An MPI process cannot read files written by a process having a different MPI rank.

The API is designed to be simple, scalable, and portable. It consists of a small number of function calls to wrap existing application I/O logic. Unless otherwise stated, SCR functions are collective, meaning all processes must call the function synchronously. The underlying implementation may or may not be synchronous, but to be portable, an application must treat a collective call as though it is synchronous. This constraint enables the SCR implementation to utilize the full resources of the job in a collective manner to optimize performance at critical points such as computing redundancy data.

In the sections below, we show the function prototypes for C and Fortran, respectively. Applications written in C should include “scr.h”, and Fortran should include “scrf.h”. All functions return SCR_SUCCESS if successful.

General API

SCR_Init

int SCR_Init();
SCR_INIT(IERROR)
  INTEGER IERROR

Initialize the SCR library. This function must be called after MPI_Init, and it is good practice to call this function immediately after MPI_Init. A process should only call SCR_Init once during its execution. No other SCR calls are valid until a process has returned from SCR_Init.

SCR_Finalize

int SCR_Finalize();
SCR_FINALIZE(IERROR)
  INTEGER IERROR

Shut down the SCR library. This function must be called before MPI_Finalize, and it is good practice to call this function just before MPI_Finalize. A process should only call SCR_Finalize once during its execution.

If SCR_FLUSH is enabled, SCR_Finalize flushes any datasets to the prefix directory if necessary. It updates the halt file to indicate that SCR_Finalize has been called. This halt condition prevents the job from restarting (Section Halt a job).

SCR_Get_version

char* SCR_Get_version(void);
SCR_GET_VERSION(VERSION, IERROR)
  CHARACTER*(*) VERSION
  INTEGER IERROR

This function returns a string that indicates the version number of SCR that is currently in use.

SCR_Should_exit

int SCR_Should_exit(int* flag);
SCR_SHOULD_EXIT(FLAG, IERROR)
  INTEGER FLAG, IERROR

SCR_Should_exit provides a portable way for an application to determine whether it should halt its execution. This function is passed a pointer to an integer in flag. Upon returning from SCR_Should_exit, flag is set to the value 1 if the application should stop, and it is set to 0 otherwise. The call returns the same value in flag on all processes. It is recommended to call this function after each checkpoint.

Since datasets in cache may be deleted by the system at the end of an allocation, it is critical for a job to stop early enough to leave time to copy datasets from cache to the parallel file system before the allocation expires. By default, the SCR library automatically calls exit at certain points. This works especially well in conjunction with the SCR_HALT_SECONDS parameter. However, this default behavior does not provide the application a chance to exit cleanly. SCR can be configured to avoid an automatic exit using the SCR_HALT_ENABLED parameter.

This call also enables a running application to react to external commands. For instance, if the application has been instructed to halt using the scr_halt command, then SCR_Should_exit relays that information.

SCR_Route_file

int SCR_Route_file(const char* name, char* file);
SCR_ROUTE_FILE(NAME, FILE, IERROR)
  CHARACTER*(*) NAME, FILE
  INTEGER IERROR

When files are under control of SCR, they may be written to or exist on different levels of the storage hierarchy at different points in time. For example, a checkpoint might be written first to the RAM disk of a compute node and then later transferred to a burst buffer or the parallel file system by SCR. In order for an application to discover where a file should be written to or read from, we provide the SCR_Route_file routine.

A process calls SCR_Route_file to obtain the full path and file name it must use to access a file under SCR control. The name of the file that the process intends to access must be passed in the name argument. A pointer to a character buffer of at least SCR_MAX_FILENAME bytes must be passed in file. When a call to SCR_Route_file returns, the full path and file name to access the file named in name is written to the buffer pointed to by file. The process must use the character string returned in file to access the file. A process does not need to create any directories listed in the string returned in file. The SCR implementation creates any necessary directories before returning from the call. A call to SCR_Route_file is local to the calling process; it is not a collective call.

As of version 1.2.2, SCR_Route_file can be succesfully called at any point during application execution. If it is called outside of a Start/Complete pair, the original file path is simply copied to the return string.

SCR_Route_file has special behaviour when called within a Start/Complete pair for restart, checkpoint, or output. Within a restart operation, the input parameter name only requires a file name. No path component is needed. SCR will return a full path to the file from the most recent checkpoint having the same name. It will return an error if no file by that name exists. Within checkpoint and output operations, the input parameter name also specifies the final path on the parallel file system. The caller may provide either absolute or relative path components in name. If the path is relative, SCR prepends the current working directory to name at the time that SCR_Route_file is called. With either an absolute or relative path, all paths must resolve to a location within the subtree rooted at the SCR prefix directory.

In the current implementation, SCR only changes the directory portion of name. It extracts the base name of the file by removing any directory components in name. Then it prepends a directory to the base file name and returns the full path and file name in file.

Checkpoint API

Here we describe the SCR API functions that are used for writing checkpoints.

SCR_Need_checkpoint

int SCR_Need_checkpoint(int* flag);
SCR_NEED_CHECKPOINT(FLAG, IERROR)
  INTEGER FLAG, IERROR

Since the failure frequency and the cost of checkpointing vary across platforms, SCR_Need_checkpoint provides a portable way for an application to determine whether a checkpoint should be taken. This function is passed a pointer to an integer in flag. Upon returning from SCR_Need_checkpoint, flag is set to the value 1 if a checkpoint should be taken, and it is set to 0 otherwise. The call returns the same value in flag on all processes.

SCR_Start_checkpoint

int SCR_Start_checkpoint();
SCR_START_CHECKPOINT(IERROR)
  INTEGER IERROR

Inform SCR that a new checkpoint is about to start. A process must call this function before it opens any files belonging to the new checkpoint. SCR_Start_checkpoint must be called by all processes, including processes that do not write files as part of the checkpoint. This function should be called as soon as possible when initiating a checkpoint. The SCR implementation uses this call as the starting point to time the cost of the checkpoint in order to optimize the checkpoint frequency via SCR_Need_checkpoint. Each call to SCR_Start_checkpoint must be followed by a corresponding call to SCR_Complete_checkpoint.

In the current implementation, SCR_Start_checkpoint holds all processes at an MPI_Barrier to ensure that all processes are ready to start the checkpoint before it deletes cached files from a previous checkpoint.

SCR_Complete_checkpoint

int SCR_Complete_checkpoint(int valid);
SCR_COMPLETE_CHECKPOINT(VALID, IERROR)
  INTEGER VALID, IERROR

Inform SCR that all files for the current checkpoint are complete (i.e., done writing and closed) and whether they are valid (i.e., written without error). A process must close all checkpoint files before calling SCR_Complete_checkpoint. SCR_Complete_checkpoint must be called by all processes, including processes that did not write any files as part of the checkpoint.

The parameter valid should be set to 1 if either the calling process wrote all of its files successfully or it wrote no files during the checkpoint. Otherwise, the process should call SCR_Complete_checkpoint with valid set to 0. SCR will determine whether all processes wrote their checkpoint files successfully.

The SCR implementation uses this call as the stopping point to time the cost of the checkpoint that started with the preceding call to SCR_Start_checkpoint. Each call to SCR_Complete_checkpoint must be preceded by a corresponding call to SCR_Start_checkpoint.

In the current implementation, SCR applies the redundancy scheme during SCR_Complete_checkpoint. Before returning from the function, MPI rank 0 determines whether the job should be halted and signals this condition to all other ranks (Section Halt a job). If the job should be halted, rank 0 records a reason in the halt file, and then all tasks call exit, unless the auto exit feature is disabled.

Restart API

Here we describe the SCR API functions used for restarting applications.

SCR_Have_restart

int SCR_Have_restart(int* flag, char* name);
SCR_HAVE_RESTART(FLAG, NAME, IERROR)
  INTEGER FLAG
  CHARACTER*(*) NAME
  INTEGER IERROR

This function indicates whether SCR has a checkpoint available for the application to read. This function is passed a pointer to an integer in flag. Upon returning from SCR_Have_restart, flag is set to the value 1 if a checkpoint is available, and it is set to 0 otherwise. The call returns the same value in flag on all processes.

A pointer to a character buffer of at least SCR_MAX_FILENAME bytes can be passed in name. If there is a checkpoint, and if that checkpoint was assigned a name when it was created, SCR_Have_restart returns the name of that checkpoint in name. The value returned in name is the same string that was passed to SCR_Start_output when the checkpoint was created. In C, one may optionally pass NULL to this function to avoid returning the name. The same value is returned in name on all processes.

SCR_Start_restart

int SCR_Start_restart(char* name);
SCR_START_RESTART(NAME, IERROR)
  CHARACTER*(*) NAME
  INTEGER IERROR

This function informs SCR that a restart operation is about to start. A process must call this function before it opens any files belonging to the restart. SCR_Start_restart must be called by all processes, including processes that do not read files as part of the restart.

SCR returns the name of loaded checkpoint in name. A pointer to a character buffer of at least SCR_MAX_FILENAME bytes can be passed in name. The value returned in name is the same string that was passed to SCR_Start_output when the checkpoint was created. In C, one may optionally pass NULL to this function to avoid returning the name. The same value is returned in name on all processes.

One may only call SCR_Start_restart when SCR_Have_restart indicates that there is a checkpoint to read. SCR_Start_restart returns the same value in name as the preceding call to SCR_Have_restart.

Each call to SCR_Start_restart must be followed by a corresponding call to SCR_Complete_restart.

SCR_Complete_restart

int SCR_Complete_restart(int valid);
SCR_COMPLETE_RESTART(VALID, IERROR)
  INTEGER VALID, IERROR

This call informs SCR that the process has finished reading its checkpoint files. A process must close all restart files before calling SCR_Complete_restart. SCR_Complete_restart must be called by all processes, including processes that did not read any files as part of the restart.

The parameter valid should be set to 1 if either the calling process read all of its files successfully or it read no files during the checkpoint. Otherwise, the process should call SCR_Complete_restart with valid set to 0. SCR will determine whether all processes read their checkpoint files successfully based on the values supplied in the valid parameter. If any process failed to read its checkpoint files, then SCR will abort.

Each call to SCR_Complete_restart must be preceded by a corresponding call to SCR_Start_restart.

Output API

As of SCR version 1.2.0, SCR has the ability to manage application output datasets in addition to checkpoint datasets. Using a combination of bit flags, a dataset can be designated as a checkpoint, output, or both. The checkpoint property means that the dataset can be used to restart the application. The output property means that the dataset must be written to the prefix directory. This enables an application to utilize asynchronous transfers to the parallel file system for both its checkpoints and large output sets, so that it can return to computation while the dataset migrates to the parallel file system in the background.

If a user specifies that a dataset is a checkpoint only, then the dataset will be managed with the SCR Output API as it would be if the SCR Checkpoint API were used. In particular, SCR may delete the checkpoint when a more recent checkpoint is established.

If a user specifies that a dataset is for output only, the dataset will first be cached on a tier of storage specified in the configuration file for the run and protected with the corresponding redundancy scheme. Then, the dataset will be moved to the prefix directory. When the transfer to the prefix directory is complete, the cached copy of the output dataset will be deleted.

If the user specifies that the dataset is both output and checkpoint, then SCR will use a hybrid approach. Files in the dataset will be cached and redundancy schemes will be used to protect the files. The dataset will be copied to the prefix directory, but it will also be kept in cache according to the policy set in the configuration for checkpoints. For example, if the user has set the configuration to keep three checkpoints in cache, then the dataset will be preserved until it is replaced by a newer checkpoint after three more checkpoint phases.

SCR_Start_output

int SCR_Start_output(char* name, int flags);
SCR_START_OUTPUT(NAME, FLAGS, IERROR)
  CHARACTER*(*) NAME
  INTEGER FLAGS, IERROR

Inform SCR that a new output phase is about to start. A process must call this function before it opens any files belonging to the dataset. SCR_Start_output must be called by all processes, including processes that do not write files as part of the dataset.

The caller can provide a name for the dataset in name. This name is used in two places. First, for checkpoints, it is returned as the name value in the SCR Restart API. Second, it is exposed to the user when listing datasets using the scr_index command, and the user may specify the name as a command line argument at times. For this reason, it is recommended to use short but meaningful names that are easy to type. The name value must be less than SCR_MAX_FILENAME characters. All processes should provide identical values in name. In C, the application may pass NULL for name in which case SCR generates a default name for the dataset based on its internal dataset id.

The dataset can be output, a checkpoint, or both. The caller specifies these properties using SCR_FLAG_OUTPUT and SCR_FLAG_CHECKPOINT bit flags. Additionally, a SCR_FLAG_NONE flag is defined for initializing variables. In C, these values can be combined with the | bitwise OR operator. In Fortran, these values can be added together using the + sum operator. Note that with Fortran, the values should be used at most once in the addition. All processes should provide identical values in flags.

This function should be called as soon as possible when initiating a dataset output. It is used internally within SCR for timing the cost of output operations. Each call to SCR_Start_output must be followed by a corresponding call to SCR_Complete_output.

In the current implementation, SCR_Start_output holds all processes at an MPI_Barrier to ensure that all processes are ready to start the output before it deletes cached files from a previous checkpoint.

SCR_Complete_output

int SCR_Complete_output(int valid);
SCR_COMPLETE_OUTPUT(VALID, IERROR)
  INTEGER VALID, IERROR

Inform SCR that all files for the current dataset output are complete (i.e., done writing and closed) and whether they are valid (i.e., written without error). A process must close all files in the dataset before calling SCR_Complete_output. SCR_Complete_output must be called by all processes, including processes that did not write any files as part of the output.

The parameter valid should be set to 1 if either the calling process wrote all of its files successfully or it wrote no files during the output phase. Otherwise, the process should call SCR_Complete_output with valid set to 0. SCR will determine whether all processes wrote their output files successfully.

Each call to SCR_Complete_output must be preceded by a corresponding call to SCR_Start_output.

For the case of checkpoint datasets, SCR_Complete_output behaves similarly to SCR_Complete_checkpoint.

Space/time semantics

SCR imposes the following semantics:

  • A process of a given MPI rank may only access files previously written by itself or by processes having the same MPI rank in prior runs. We say that a rank “owns” the files it writes. A process is never guaranteed access to files written by other MPI ranks.
  • During a checkpoint, a process may only access files of the current checkpoint between calls to SCR_Start_checkpoint() and SCR_Complete_checkpoint(). Once a process calls SCR_Complete_checkpoint() it is no longer guaranteed access to any file that it registered as part of that checkpoint via SCR_Route_file().
  • During a restart, a process may only access files from its “most recent” checkpoint, and it must access those files between calls to SCR_Start_restart() and SCR_Complete_restart(). Once a process calls SCR_Complete_restart() it is no longer guaranteed access to its restart files. SCR selects which checkpoint is considered to be the “most recent”.

These semantics enable SCR to cache files on devices that are not globally visible to all processes, such as node-local storage. Further, these semantics enable SCR to move, reformat, or delete files as needed, such that it can manage this cache.

SCR API state transitions

../_images/scr-states3.png

SCR API State Transition Diagram

Figure SCR API State Transition Diagram illustrates the internal states in SCR and which API calls can be used from within each state. The application must call SCR_Init before it may call any other SCR function, and it may not call SCR functions after calling SCR_Finalize. Some calls transition SCR from one state to another as shown by the edges between states. Other calls are only valid when in certain states as shown in the boxes. For example, SCR_Route_file is only valid within the Checkpoint, Restart, or Output states. All SCR functions are implicitly collective across MPI_COMM_WORLD, except for SCR_Route_file and SCR_Get_version.