Abstract : Workflows systems are considered here to support large-scale multiphysics simulations. Because the use of large distributed and parallel multi-core infrastructures is prone to software and hardware failures, the paper addresses the need for error recovery procedures. A new mechanism based on asymmetric checkpointing is presented. A rule-based implementation for a distributed workflow platform is detailed.
https://hal.inria.fr/inria-00524612 Contributor : Toan NguyenConnect in order to contact the contributor Submitted on : Friday, October 8, 2010 - 12:32:15 PM Last modification on : Saturday, June 25, 2022 - 11:04:54 PM Long-term archiving on: : Monday, January 10, 2011 - 11:43:08 AM
Toan Nguyen, Laurentiu Trifan, Jean-Antoine Désidéri. Resilient Workflows for High-Performance Simulation Platforms. The 2010 International Conference on High Performance Computing & Simulation (HPCS 2010), Jun 2010, Caen, France. ⟨inria-00524612⟩