(. R. Ashraf, Hukerikar (S.) et Engelmann (C.). -Shrink or Substitute : Handling Process Failures in HPC Systems Using In-Situ Recovery. -In 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), pp.178-185, 2018.

, Task Programming over Clusters of Machines Enhanced with Accelerators. -report, 2014.

, Bosilca (G.) et Dongarra (J.). -Post-failure recovery of MPI communication capability : Design and rationale, Bouteiller (A.), Herault (T.), vol.27, pp.244-254, 2013.

, Lemarinier (P.) et Cappello (F.). -MPICH-V Project : A Multiprotocol Automatic Fault-Tolerant MPI, Int. J. High Perform. Comput. Appl, vol.20, pp.319-333, 2006.

, Coti (C.). -Fault Tolerance Techniques for Distributed, Parallel Applications. Innovative Research and Applications in Next-Generation High Performance Computing, pp.221-252, 2016.

, ). -A Survey of Rollbackrecovery Protocols in Message-passing Systems, ACM Comput. Surv, vol.34, pp.375-408, 2002.

, Kacsuk (P.) et Podhorszki (N.) (édité par), Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science, pp.346-353, 2000.

(. M. Sergent, et Archipoff (S.). -Modulariser les ordonnanceurs de tâches : une approche structurelle, 2014.

. Tessier, -TAPIOCA : An I/O Library for Optimized Topology-Aware Data Aggregation on Large-Scale Supercomputers, CLUSTER 2017 -IEEE International Conference on Cluster Computing, p.2017

, Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes, vol.87, pp.1820-1828, 2004.