From: Running resilient MPI applications on a Dynamic Group of Recommended Processes
Related work | Main strategies adopted |
---|---|
FT-MPI (fault-tolerant MPI) [21], Non-Stop and Fault-Resilient MPI (NR-MPI) [23], Run-Through Stabilization (RTS) [25], User Level Failure Mitigation (ULFM) [9], Consensus Protocol [26–28], Adaptive MPI (AMPI) [37] | Primitives for dealing with fault tolerance at the application level |
Checkpoint-restart at the application level | |
Dealing with process faults using ABFT [30] | Algorithm-Based Fault Tolerance (ABFT) |
Ferreira et al. [34], P2P-MPI [35], Fiala, et al. [36], Silent error [36] | State-machine replication |
Gioiosa et al. [5], Aguilar et al. [40], TAUoverSupermon [42] | Monitoring system for performance |
DGRP | Monitoring system that recommends a group of processes to run an application |