Audio-Visual Clustering for Multiple Speaker Localization

Vasil Khalidov 1 Florence Forbes 1 Miles Hansard 2 Elise Arnaud 2 Radu Horaud 2
1 MISTIS - Modelling and Inference of Complex and Structured Stochastic Systems
Inria Grenoble - Rhône-Alpes, LJK - Laboratoire Jean Kuntzmann, INPG - Institut National Polytechnique de Grenoble
2 PERCEPTION - Interpretation and Modelling of Images and Videos
Inria Grenoble - Rhône-Alpes, LJK - Laboratoire Jean Kuntzmann, INPG - Institut National Polytechnique de Grenoble
Abstract : We address the issue of identifying and localizing individuals in a scene that contains several people engaged in conversation. We use a human-like configuration of sensors (binaural and binocular) to gather both auditory and visual observations.We show that the identification and localization problem can be recast as the task of clustering the audio-visual observations into coherent groups. We propose a probabilistic generative model that captures the relations between audio and visual observations. This model maps the data to a representation of the common 3D scene-space, via a pair of Gaussian mixture models. Inference is performed by a version of the Expectation Maximization algorithm, which provides cooperative estimates of both the activity and the 3D position of each speaker.
Type de document :
Communication dans un congrès
Andrei Popescu-Belis and Rainer Stiefelhagen. MLMI 2008 - International Workshop on Machine Learning for Multimodal Interaction, Sep 2008, Utrecht, Netherlands. Springer, 5237, pp.86-97, 2008, Lecture Notes in Computer Science. 〈10.1007/978-3-540-85853-9_8〉
Liste complète des métadonnées


https://hal.inria.fr/inria-00373154
Contributeur : Elise Arnaud <>
Soumis le : vendredi 3 avril 2009 - 14:46:46
Dernière modification le : mercredi 11 avril 2018 - 01:57:51
Document(s) archivé(s) le : jeudi 10 juin 2010 - 19:41:09

Fichiers

mlmi2008.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Citation

Vasil Khalidov, Florence Forbes, Miles Hansard, Elise Arnaud, Radu Horaud. Audio-Visual Clustering for Multiple Speaker Localization. Andrei Popescu-Belis and Rainer Stiefelhagen. MLMI 2008 - International Workshop on Machine Learning for Multimodal Interaction, Sep 2008, Utrecht, Netherlands. Springer, 5237, pp.86-97, 2008, Lecture Notes in Computer Science. 〈10.1007/978-3-540-85853-9_8〉. 〈inria-00373154〉

Partager

Métriques

Consultations de la notice

462

Téléchargements de fichiers

174