Fault Tolerance Interface
New release v0.9.5!
FTI stands for Fault Tolerance Interface and is a library that aims to give
computational scientists the means to perform fast and efficient multilevel
checkpointing in large scale supercomputers. FTI leverages local storage plus
data replication and erasure codes to provide several levels of reliability and
performance. FTI is application-level checkpointing and allows users to select
which datasets needs to be protected, in order to improve efficiency and avoid
wasting space, time and energy. In addition, it offers a direct data interface
so that users do not need to deal with files and/or directory names. All
metadata is managed by FTI in a transparent fashion for the user. If desired,
users can dedicate one process per node to overlap fault tolerance workload and
scientific computation, so that post-checkpoint tasks are executed
- Level 1: Checkpoint on local storage. Fast and efficient against soft and transient errors.
- Level 2: Partner copy of local checkpoint. Can tolerate any single node crash in the system.
- Level 3: Reed-Solomon encoded checkpoints. Can tolerate correlated failures affecting multiple nodes.
- Level 4: Parallel File System based checkpoint. Tolerates catastrophic failures such as power failures.