Audio-Visual Clustering for Multiple Speaker Localization - Archive ouverte HAL Access content directly
Conference Papers Year : 2008

Audio-Visual Clustering for Multiple Speaker Localization

(1) , (1) , (2) , (2) , (2)
1
2

Abstract

We address the issue of identifying and localizing individuals in a scene that contains several people engaged in conversation. We use a human-like configuration of sensors (binaural and binocular) to gather both auditory and visual observations.We show that the identification and localization problem can be recast as the task of clustering the audio-visual observations into coherent groups. We propose a probabilistic generative model that captures the relations between audio and visual observations. This model maps the data to a representation of the common 3D scene-space, via a pair of Gaussian mixture models. Inference is performed by a version of the Expectation Maximization algorithm, which provides cooperative estimates of both the activity and the 3D position of each speaker.
Vignette du fichier
mlmi08.jpg (84.46 Ko) Télécharger le fichier Fichier principal
Vignette du fichier
mlmi2008.pdf (363.23 Ko) Télécharger le fichier
Format : Figure, Image
Origin : Files produced by the author(s)

Dates and versions

inria-00373154 , version 1 (03-04-2009)

Identifiers

Cite

Vasil Khalidov, Florence Forbes, Miles Hansard, Élise Arnaud, Radu Horaud. Audio-Visual Clustering for Multiple Speaker Localization. MLMI 2008 - International Workshop on Machine Learning for Multimodal Interaction, Sep 2008, Utrecht, Netherlands. pp.86-97, ⟨10.1007/978-3-540-85853-9_8⟩. ⟨inria-00373154⟩
294 View
170 Download

Altmetric

Share

Gmail Facebook Twitter LinkedIn More