dMPI: Facilitating Debugging of MPI Programs via Deterministic Message Passing

Abstract : This paper presents a novel deterministic MPI implementation (dMPI) to facilitate the debugging of MPI programs. Distinct from existing approaches, dMPI ensures inherent determinism without using any external support (e.g., logs), which achieves convenience and performance simultaneously. The basic idea of dMPI is to use deterministic logical time to solve message races and control asynchronous transmissions, thus we could eliminate the nondeterministic behaviors of the existing message passing mechanism. To avoid deadlocks introduced by dMPI, we also integrate dMPI with a lightweight deadlock checker to dynamically detect and solve these deadlocks. We have implemented dMPI and evaluated it using NPB benchmarks. The results show that dMPI could guarantee determinism with incurring modest overhead (8% on average).
Type de document :
Communication dans un congrès
James J. Park; Albert Zomaya; Sang-Soo Yeo; Sartaj Sahni. 9th International Conference on Network and Parallel Computing (NPC), Sep 2012, Gwangju, South Korea. Springer, Lecture Notes in Computer Science, LNCS-7513, pp.172-179, 2012, Network and Parallel Computing. 〈10.1007/978-3-642-35606-3_20〉
Liste complète des métadonnées

Littérature citée [11 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01551348
Contributeur : Hal Ifip <>
Soumis le : vendredi 30 juin 2017 - 10:35:59
Dernière modification le : vendredi 1 décembre 2017 - 01:09:58
Document(s) archivé(s) le : lundi 22 janvier 2018 - 20:42:35

Fichier

978-3-642-35606-3_20_Chapter.p...
Fichiers produits par l'(les) auteur(s)

Licence


Distributed under a Creative Commons Paternité 4.0 International License

Identifiants

Citation

Xu Zhou, Kai Lu, Xicheng Lu, Xiaoping Wang, Baohua Fan. dMPI: Facilitating Debugging of MPI Programs via Deterministic Message Passing. James J. Park; Albert Zomaya; Sang-Soo Yeo; Sartaj Sahni. 9th International Conference on Network and Parallel Computing (NPC), Sep 2012, Gwangju, South Korea. Springer, Lecture Notes in Computer Science, LNCS-7513, pp.172-179, 2012, Network and Parallel Computing. 〈10.1007/978-3-642-35606-3_20〉. 〈hal-01551348〉

Partager

Métriques

Consultations de la notice

27

Téléchargements de fichiers

15