Application checkpointing

From Free net encyclopedia

(Redirected from Checkpointing)

Template:Expand Checkpointing is a technique for inserting fault tolerance into computing systems. It basically consists on storing a snapshot of the current application state, and use it for restarting the execution in case of failure.

Checkpointing techniques properties

There are many different points of view and techniques for achieving application checkpointing. Depending on the specific implementation, a tool can be classified attending to several properties:

  • Amount of state saved: This property refers to the abstraction level used by the technique to analyze an application. It can range from seeing each application as a black box, hence storing all application data, to selecting specific relevant cores of data in order to achieve a more efficient and portable operation.
  • Automatization level: Depending on the effort needed to achieve fault tolerance through the use of a specific checkpointing solution.
  • Portability: Whether or not the saved state can be used on different machines to restart the application.

Each design decission made affects to properties or efficiency of the final product. For instance, deciding to store the entire application state will allow for a more straightforward implementation, since no analysis of the application will be needed, but it will deny the portability of the generated state files, due to a number of non-portable structures (such as application stack or heap) being stored along with application data.

Checkpointing in distributed shared memory systems

In DSM, checkpointing is a technique that helps tolerate the errors leading to losing the effect of work of long-running applications. Main property which should have checkpointing techniques in such systems is preserving system consistency in case of failures. There are two main approaches to checkpointing in such systems: coordinated checkpointing, in which all cooperating processes work together to establish coherent checkpoint point, and communication induced (called also dependency induced) independent checkpointing.

It must be stressed, that simply forcing processes to checkpoint their state at fixed time is not enough. Even if we would assume the existence of global clock, checkpoints made by different processes may not form consistent state. The need for establishing consistent state may force other process to withdraw to their checkpoints, which in turn may cause other processes to withdraw to even earlier checkpoints, which in most extreme may mean that the only consistent state found is initial state (so called domino effect).

In first approach, processes must ensure the checkpoint is consistent. It is usually achieved by some kind of two-phase commit algorithm. In the latter, each process checkpoints its own state independently whenever this state is exposed to other processes (that is, for example whenever remote process reads the page written by local process).

The system state may be saved either locally, in stable storage, or in distant node's memory.

References

  • E.N. Elnozahy, L. Alvisi, Y-M. Wang, and D.B. Johnson, "A survey of rollback-recovery protocols in message-passing systems", ACM Comput. Surv., vol. 34, no. 3, pp. 375-408, 2002.