Audio-Visual Clustering for Multiple Speaker Localization - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2008

Audio-Visual Clustering for Multiple Speaker Localization

Résumé

We address the issue of identifying and localizing individuals in a scene that contains several people engaged in conversation. We use a human-like configuration of sensors (binaural and binocular) to gather both auditory and visual observations.We show that the identification and localization problem can be recast as the task of clustering the audio-visual observations into coherent groups. We propose a probabilistic generative model that captures the relations between audio and visual observations. This model maps the data to a representation of the common 3D scene-space, via a pair of Gaussian mixture models. Inference is performed by a version of the Expectation Maximization algorithm, which provides cooperative estimates of both the activity and the 3D position of each speaker.
Fichier principal
Vignette du fichier
mlmi2008.pdf (363.23 Ko) Télécharger le fichier
Vignette du fichier
mlmi08.jpg (84.46 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Format : Figure, Image

Dates et versions

inria-00373154 , version 1 (03-04-2009)

Identifiants

Citer

Vasil Khalidov, Florence Forbes, Miles Hansard, Élise Arnaud, Radu Horaud. Audio-Visual Clustering for Multiple Speaker Localization. MLMI 2008 - International Workshop on Machine Learning for Multimodal Interaction, Sep 2008, Utrecht, Netherlands. pp.86-97, ⟨10.1007/978-3-540-85853-9_8⟩. ⟨inria-00373154⟩
296 Consultations
190 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More