Audio-Visual Speaker Diarization in the Framework of Multi-User Human-Robot Interaction

Timothée Dhaussy; Bassam Jabaian; Fabrice Lefèvre; Radu Horaud

doi:10.1109/ICASSP49357.2023.10096295

Communication Dans Un Congrès Année : 2023

Audio-Visual Speaker Diarization in the Framework of Multi-User Human-Robot Interaction

(1) , (1) , (1) , (2)

1
2

Timothée Dhaussy

Fonction : Auteur

Laboratoire Informatique d'Avignon

Bassam Jabaian

Fonction : Auteur

Laboratoire Informatique d'Avignon

Fabrice Lefèvre

Fonction : Auteur

Laboratoire Informatique d'Avignon

Radu Horaud

Fonction : Auteur
PersonId : 16183
IdHAL : radu-horaud
ORCID : 0000-0001-5232-024X
IdRef : 032302495

Vers des robots à l’intelligence sociale au travers de l’apprentissage, de la perception et de la commande

Résumé

The speaker diarization task answers the question "who is speaking at a given time?". It represents valuable information for scene analysis in a domain such as robotics. In this paper, we introduce a temporal audiovisual fusion model for multiusers speaker diarization, with low computing requirement, a good robustness and an absence of training phase. The proposed method identifies the dominant speakers and tracks them over time by measuring the spatial coincidence between sound locations and visual presence. The model is generative, parameters are estimated online, and does not require training. Its effectiveness was assessed using two datasets, a public one and one collected in-house with the Pepper humanoid robot.

Mots clés

speaker diarization multimodal human-robot interaction multimodal

Domaines

Informatique et langage [cs.CL] Robotique [cs.RO] Son [cs.SD]

Fichier principal

Dhaussy-ICASSP2023-authorcopy.pdf (199.32 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Radu Horaud : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04140076

Soumis le : vendredi 30 juin 2023-13:26:16

Dernière modification le : jeudi 4 avril 2024-20:56:57

Archivage à long terme le : dimanche 1 octobre 2023-18:10:39

Dates et versions

hal-04140076 , version 1 (30-06-2023)

Licence

Paternité

Identifiants

HAL Id : hal-04140076 , version 1
DOI : 10.1109/ICASSP49357.2023.10096295

Citer

Timothée Dhaussy, Bassam Jabaian, Fabrice Lefèvre, Radu Horaud. Audio-Visual Speaker Diarization in the Framework of Multi-User Human-Robot Interaction. ICASSP 2023 - IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE Signal Processing Society, Jun 2023, Ixia-Ialyssos, Greece. pp.1-5, ⟨10.1109/ICASSP49357.2023.10096295⟩. ⟨hal-04140076⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-AVIGNON UGA INRIA INRIA2 LIA ANR

19 Consultations

48 Téléchargements

Audio-Visual Speaker Diarization in the Framework of Multi-User Human-Robot Interaction

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Altmetric

Partager