Elastic deep learning through resilient collective operations - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2023

Elastic deep learning through resilient collective operations

Jiali Li
George Bosilca
Bogdan Nicolae

Résumé

A robust solution that incorporates fault tolerance and elastic scaling capabilities for distributed deep learning. Taking advantage of MPI resilient capabilities, aka. User-Level Failure Mitigation (ULFM), this novel approach promotes efficient and lightweight failure management and encourages smooth scaling in volatile computational settings. The proposed ULFM MPI-centered mechanism outperforms the only officially supported elastic learning framework, Elastic Horovod (using Gloo and NCCL), by a significant factor. These results reinforce the capability of MPI extension to deal with resiliency and promote ULFM as an effective technique for fault management, minimizing downtime, and thereby enhancing the overall performance of distributed applications, in particular elastic training in high-performance computing (HPC) environments and machine learning applications.
Fichier principal
Vignette du fichier
ULFM_hvd.pdf (6.53 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-04343677 , version 1 (14-12-2023)

Licence

Paternité

Identifiants

Citer

Jiali Li, George Bosilca, Aurelien Bouteiller, Bogdan Nicolae. Elastic deep learning through resilient collective operations. AI4S'23: 4th Workshop on Artificial Intelligence and Machine Learning for Scientific Applications (with SC’23), Nov 2023, Denver, United States. pp.44-50, ⟨10.1145/3624062.3626080⟩. ⟨hal-04343677⟩
7 Consultations
22 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More