Contention awareness and fault tolerant scheduling for precedence constrained tasks in heterogeneous systems

Abstract : Heterogeneous distributed systems are widely deployed for executing computationally intensive parallel applications with diverse computing needs. Such environments require effective scheduling strategies that take into account both algorithmic and architectural characteristics. Unfortunately, most of the scheduling algorithms developed for such systems rely on a simple platform model where communication contention is not taken into account. In addition, it is generally assumed that processors are completely safe. To schedule precedence graphs in a more realistic framework, we introduce first an efficient fault-tolerant scheduling algorithm that is both contention-aware and capable of supporting an arbitrary number of fail-silent (fail-stop) processor failures. Next, we derive a more complex heuristic that departs from the main principle of the first algorithm. Instead of considering a single task (one with highest priority) and assigning all its replicas to the currently best available resources, we consider a chunk of ready tasks, and assign all their replicas in the same decision making procedure. This leads to a better load balance of processors and communication links. We focus on a bi-criteria approach, where we aim at minimizing the total execution time, or latency, given a fixed number of failures supported in the system. Our algorithms have a low time complexity, and drastically reduce the number of additional communications induced by the replication mechanism. Experimental results fully demonstrate the usefulness of the proposed algorithms, which lead to efficient execution schemes while guaranteeing a prescribed level of fault-tolerance.
Type de document :
Article dans une revue
Parallel Computing, Elsevier, 2009, 35 (2), pp.83-108. 〈10.1016/j.parco.2008.11.001〉
Liste complète des métadonnées

https://hal.inria.fr/hal-00980693
Contributeur : Equipe Roma <>
Soumis le : vendredi 18 avril 2014 - 15:53:34
Dernière modification le : vendredi 6 juillet 2018 - 15:06:08

Identifiants

Citation

Anne Benoit, Mourad Hakem, Yves Robert. Contention awareness and fault tolerant scheduling for precedence constrained tasks in heterogeneous systems. Parallel Computing, Elsevier, 2009, 35 (2), pp.83-108. 〈10.1016/j.parco.2008.11.001〉. 〈hal-00980693〉

Partager

Métriques

Consultations de la notice

330