Contention awareness and fault tolerant scheduling for precedence constrained tasks in heterogeneous systems

Abstract : Heterogeneous distributed systems are widely deployed for executing computationally intensive parallel applications with diverse computing needs. Such environments require effective scheduling strategies that take into account both algorithmic and architectural characteristics. Unfortunately, most of the scheduling algorithms developed for such systems rely on a simple platform model where communication contention is not taken into account. In addition, it is generally assumed that processors are completely safe. To schedule precedence graphs in a more realistic framework, we introduce first an efficient fault-tolerant scheduling algorithm that is both contention-aware and capable of supporting an arbitrary number of fail-silent (fail-stop) processor failures. Next, we derive a more complex heuristic that departs from the main principle of the first algorithm. Instead of considering a single task (one with highest priority) and assigning all its replicas to the currently best available resources, we consider a chunk of ready tasks, and assign all their replicas in the same decision making procedure. This leads to a better load balance of processors and communication links. We focus on a bi-criteria approach, where we aim at minimizing the total execution time, or latency, given a fixed number of failures supported in the system. Our algorithms have a low time complexity, and drastically reduce the number of additional communications induced by the replication mechanism. Experimental results fully demonstrate the usefulness of the proposed algorithms, which lead to efficient execution schemes while guaranteeing a prescribed level of fault-tolerance.
Complete list of metadatas

https://hal.inria.fr/hal-00980693
Contributor : Equipe Roma <>
Submitted on : Friday, April 18, 2014 - 3:53:34 PM
Last modification on : Friday, July 6, 2018 - 3:06:08 PM

Identifiers

Citation

Anne Benoit, Mourad Hakem, Yves Robert. Contention awareness and fault tolerant scheduling for precedence constrained tasks in heterogeneous systems. Parallel Computing, Elsevier, 2009, 35 (2), pp.83-108. ⟨10.1016/j.parco.2008.11.001⟩. ⟨hal-00980693⟩

Share

Metrics

Record views

377