Audio-Visual Speech-Turn Detection and Tracking

Israel Dejene Gebru 1 Silèye Ba 1 Georgios Evangelidis 1 Radu Horaud 1
1 PERCEPTION - Interpretation and Modelling of Images and Videos
Inria Grenoble - Rhône-Alpes, LJK - Laboratoire Jean Kuntzmann, INPG - Institut National Polytechnique de Grenoble
Abstract : Speaker diarization is an important component of multi-party dialog systems in order to assign speech-signal segments among participants. Diariza-tion may well be viewed as the problem of detecting and tracking speech turns. It is proposed to address this problem by modeling the spatial coincidence of visual and auditory observations and by combining this coincidence model with a dynamic Bayesian formulation that tracks the identity of the active speaker. Speech-turn tracking is formulated as a latent-variable temporal graphical model and an exact inference algorithm is proposed. We describe in detail an audiovisual discriminative observation model as well as a state-transition model. We also describe an implementation of a full system composed of multi-person visual tracking, sound-source localization and the proposed online diarization technique. Finally we show that the proposed method yields promising results with two challenging scenarios that were carefully recorded and annotated.
Type de document :
Communication dans un congrès
Emmanuel Vincent ; Arie Yeredor; Zbyněk Koldovský; Petr Tichavský. 12th International Conference on Latent Variable Analysis and Signal Separation, LVA/ICA 2015, Aug 2015, Liberec, Czech Republic. Springer, 9237, pp.143-151, 2015, Lecture Notes in Computer Science. 〈http://link.springer.com/chapter/10.1007/978-3-319-22482-4_17〉. 〈10.1007/978-3-319-22482-4_17〉
Liste complète des métadonnées

Littérature citée [9 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01163659
Contributeur : Team Perception <>
Soumis le : lundi 15 juin 2015 - 11:39:10
Dernière modification le : mercredi 11 avril 2018 - 01:58:54
Document(s) archivé(s) le : mardi 25 avril 2017 - 07:58:18

Fichier

gebru_lva2015_cameraready.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Collections

Citation

Israel Dejene Gebru, Silèye Ba, Georgios Evangelidis, Radu Horaud. Audio-Visual Speech-Turn Detection and Tracking. Emmanuel Vincent ; Arie Yeredor; Zbyněk Koldovský; Petr Tichavský. 12th International Conference on Latent Variable Analysis and Signal Separation, LVA/ICA 2015, Aug 2015, Liberec, Czech Republic. Springer, 9237, pp.143-151, 2015, Lecture Notes in Computer Science. 〈http://link.springer.com/chapter/10.1007/978-3-319-22482-4_17〉. 〈10.1007/978-3-319-22482-4_17〉. 〈hal-01163659〉

Partager

Métriques

Consultations de la notice

471

Téléchargements de fichiers

317