Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion

Israel Gebru 1 Sileye Ba 1 Xiaofei Li 1 Radu Horaud 1
1 PERCEPTION - Interpretation and Modelling of Images and Videos
Inria Grenoble - Rhône-Alpes, LJK - Laboratoire Jean Kuntzmann, INPG - Institut National Polytechnique de Grenoble
Abstract : Speaker diarization consists of assigning speech signals to people engaged in a dialogue. An audio-visual spatiotemporal diarization model is proposed. The model is well suited for challenging scenarios that consist of several participants engaged in multi-party interaction while they move around and turn their heads towards the other participants rather than facing the cameras and the microphones. Multiple-person visual tracking is combined with multiple speech-source localization in order to tackle the speech-to-person association problem. The latter is solved within a novel audio-visual fusion method on the following grounds: binaural spectral features are first extracted from a microphone pair, then a supervised audio-visual alignment technique maps these features onto an image, and finally a semi-supervised clustering method assigns binaural spectral features to visible persons. The main advantage of this method over previous work is that it processes in a principled way speech signals uttered simultaneously by multiple persons. The diarization itself is cast into a latent-variable temporal graphical model that infers speaker identities and speech turns, based on the output of an audio-visual association process, executed at each time slice, and on the dynamics of the diarization variable itself. The proposed formulation yields an efficient exact inference procedure. A novel dataset, that contains audio-visual training data as well as a number of scenarios involving several participants engaged in formal and informal dialogue, is introduced. The proposed method is thoroughly tested and benchmarked with respect to several state-of-the art diarization algorithms.
Type de document :
Article dans une revue
IEEE Transactions on Pattern Analysis and Machine Intelligence, Institute of Electrical and Electronics Engineers, 2017, 39, 14 p. 〈10.1109/TPAMI.2017.2648793〉
Liste complète des métadonnées


https://hal.inria.fr/hal-01413403
Contributeur : Team Perception <>
Soumis le : mardi 3 janvier 2017 - 14:40:08
Dernière modification le : jeudi 11 janvier 2018 - 06:22:00
Document(s) archivé(s) le : mardi 4 avril 2017 - 13:41:16

Fichiers

Gebru-TPAMI2017-final.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Collections

Citation

Israel Gebru, Sileye Ba, Xiaofei Li, Radu Horaud. Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, Institute of Electrical and Electronics Engineers, 2017, 39, 14 p. 〈10.1109/TPAMI.2017.2648793〉. 〈hal-01413403〉

Partager

Métriques

Consultations de la notice

691

Téléchargements de fichiers

194