Exploiting the Complementarity of Audio and Visual Data in Multi-Speaker Tracking

Yutong Ban; Laurent Girin; Xavier Alameda-Pineda; Radu Horaud

doi:10.1109/ICCVW.2017.60

Communication Dans Un Congrès Année : 2018

Exploiting the Complementarity of Audio and Visual Data in Multi-Speaker Tracking

(1) , (1, 2) , (1) , (1)

1
2

Yutong Ban

Fonction : Auteur

Interpretation and Modelling of Images and Videos

Laurent Girin

Fonction : Auteur
PersonId : 3682
IdHAL : laurent-girin
ORCID : 0000-0002-9214-8760
IdRef : 088998037

Interpretation and Modelling of Images and Videos

GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing

Xavier Alameda-Pineda

Fonction : Auteur
PersonId : 16186
IdHAL : xavier-alameda-pineda
ORCID : 0000-0002-5354-1084
IdRef : 18450919X

Interpretation and Modelling of Images and Videos

Radu Horaud

Fonction : Auteur
PersonId : 16183
IdHAL : radu-horaud
ORCID : 0000-0001-5232-024X
IdRef : 032302495

Interpretation and Modelling of Images and Videos

Résumé

Multi-speaker tracking is a central problem in human-robot interaction. In this context, exploiting auditory and visual information is gratifying and challenging at the same time. Gratifying because the complementary nature of auditory and visual information allows us to be more robust against noise and outliers than unimodal approaches. Challenging because how to properly fuse auditory and visual information for multi-speaker tracking is far from being a solved problem. In this paper we propose a probabilistic generative model that tracks multiple speakers by jointly exploiting auditory and visual features in their own representation spaces. Importantly, the method is robust to missing data and is therefore able to track even when observations from one of the modalities are absent. Quantitative and qualitative results on the AVDIAR dataset are reported.

Domaines

Vision par ordinateur et reconnaissance de formes [cs.CV] Traitement du signal et de l'image [eess.SP] Apprentissage [cs.LG] Son [cs.SD]

Fichier principal

ICCVW_submission.pdf (5.57 Mo)

0504_1000.png (467.97 Ko)

0504_1000.jpg (38.89 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Format : Figure, Image
Origine : Fichiers produits par l'(les) auteur(s)

Perception team : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-01577965

Soumis le : lundi 28 août 2017-15:07:40

Dernière modification le : jeudi 4 avril 2024-21:17:19

Dates et versions

hal-01577965 , version 1 (28-08-2017)

Identifiants

HAL Id : hal-01577965 , version 1
DOI : 10.1109/ICCVW.2017.60

Citer

Yutong Ban, Laurent Girin, Xavier Alameda-Pineda, Radu Horaud. Exploiting the Complementarity of Audio and Visual Data in Multi-Speaker Tracking. ICCVW 2017 - IEEE International Conference on Computer Vision Workshops, Oct 2017, Venise, Italy. pp.446-454, ⟨10.1109/ICCVW.2017.60⟩. ⟨hal-01577965⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 UGA CNRS INRIA IRISA GIPSA GIPSA-DPC LJK LJK_GI LJK_GI_PERCEPTION GIPSA-CRISSP INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES UR1-MATH-NUM

524 Consultations

312 Téléchargements

Exploiting the Complementarity of Audio and Visual Data in Multi-Speaker Tracking

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager