Audio-Visual Speaker Diarization in the Framework of Multi-User Human-Robot Interaction - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2023

Audio-Visual Speaker Diarization in the Framework of Multi-User Human-Robot Interaction

Résumé

The speaker diarization task answers the question "who is speaking at a given time?". It represents valuable information for scene analysis in a domain such as robotics. In this paper, we introduce a temporal audiovisual fusion model for multiusers speaker diarization, with low computing requirement, a good robustness and an absence of training phase. The proposed method identifies the dominant speakers and tracks them over time by measuring the spatial coincidence between sound locations and visual presence. The model is generative, parameters are estimated online, and does not require training. Its effectiveness was assessed using two datasets, a public one and one collected in-house with the Pepper humanoid robot.
Fichier principal
Vignette du fichier
Dhaussy-ICASSP2023-authorcopy.pdf (199.32 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-04140076 , version 1 (30-06-2023)

Licence

Paternité

Identifiants

Citer

Timothée Dhaussy, Bassam Jabaian, Fabrice Lefèvre, Radu Horaud. Audio-Visual Speaker Diarization in the Framework of Multi-User Human-Robot Interaction. ICASSP 2023 - IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE Signal Processing Society, Jun 2023, Ixia-Ialyssos, Greece. pp.1-5, ⟨10.1109/ICASSP49357.2023.10096295⟩. ⟨hal-04140076⟩
19 Consultations
48 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More