Audio-Visual Clustering for Multiple Speaker Localization

Vasil Khalidov; Florence Forbes; Miles Hansard; Élise Arnaud; Radu Horaud

doi:10.1007/978-3-540-85853-9_8

Communication Dans Un Congrès Année : 2008

Audio-Visual Clustering for Multiple Speaker Localization

(1) , (1) , (2) , (2) , (2)

1
2

Vasil Khalidov

Fonction : Auteur

Modelling and Inference of Complex and Structured Stochastic Systems

Florence Forbes

Fonction : Auteur
PersonId : 16305
IdHAL : florence-forbes
ORCID : 0000-0003-3639-0226
IdRef : 12469781X

Modelling and Inference of Complex and Structured Stochastic Systems

Miles Hansard

Fonction : Auteur

Interpretation and Modelling of Images and Videos

Élise Arnaud

Fonction : Auteur
PersonId : 752388
IdHAL : elise-arnaud
IdRef : 08300842X

Interpretation and Modelling of Images and Videos

Radu Horaud

Fonction : Auteur
PersonId : 16183
IdHAL : radu-horaud
ORCID : 0000-0001-5232-024X
IdRef : 032302495

Interpretation and Modelling of Images and Videos

Résumé

We address the issue of identifying and localizing individuals in a scene that contains several people engaged in conversation. We use a human-like configuration of sensors (binaural and binocular) to gather both auditory and visual observations.We show that the identification and localization problem can be recast as the task of clustering the audio-visual observations into coherent groups. We propose a probabilistic generative model that captures the relations between audio and visual observations. This model maps the data to a representation of the common 3D scene-space, via a pair of Gaussian mixture models. Inference is performed by a version of the Expectation Maximization algorithm, which provides cooperative estimates of both the activity and the 3D position of each speaker.

Domaines

Vision par ordinateur et reconnaissance de formes [cs.CV]

Fichier principal

mlmi2008.pdf (363.23 Ko)

mlmi08.jpg (84.46 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Format : Figure, Image

Elise Arnaud : Connectez-vous pour contacter le contributeur

https://inria.hal.science/inria-00373154

Soumis le : vendredi 3 avril 2009-14:46:46

Dernière modification le : jeudi 4 avril 2024-21:06:51

Archivage à long terme le : jeudi 10 juin 2010-19:41:09

Dates et versions

inria-00373154 , version 1 (03-04-2009)

Identifiants

HAL Id : inria-00373154 , version 1
DOI : 10.1007/978-3-540-85853-9_8

Citer

Vasil Khalidov, Florence Forbes, Miles Hansard, Élise Arnaud, Radu Horaud. Audio-Visual Clustering for Multiple Speaker Localization. MLMI 2008 - International Workshop on Machine Learning for Multimodal Interaction, Sep 2008, Utrecht, Netherlands. pp.86-97, ⟨10.1007/978-3-540-85853-9_8⟩. ⟨inria-00373154⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 UGA CNRS INRIA IRISA LJK LJK_GI LJK_PS LJK_GI_PERCEPTION LJK_PS_MISTIS INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES UR1-MATH-NUM

296 Consultations

190 Téléchargements

Audio-Visual Clustering for Multiple Speaker Localization

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager