Integer Linear Programming for Speaker Diarization and Cross-Modal Identification in TV Broadcast

Hervé Bredin 1 Johann Poignant 2
2 MRIM - Modélisation et Recherche d’Information Multimédia [Grenoble]
LIG - Laboratoire d'Informatique de Grenoble, Inria - Institut National de Recherche en Informatique et en Automatique
Abstract : Most state-of-the-art approaches address speaker diariza- tion as a hierarchical agglomerative clustering problem in the audio domain. In this paper, we propose to revisit one of them: speech turns clustering based on the Bayesian Information Cri- terion (a.k.a. BIC clustering). First, we show how to model it as an integer linear programming (ILP) problem. Its resolu- tion leads to the same overall diarization error rate as standard BIC clustering but generates significantly purer speaker clus- ters. Then, we describe how this approach can easily be ex- tended to the audiovisual domain and TV broadcast in particu- lar. The straightforward integration of detected overlaid names (used to introduce guests or journalists, and obtained via video OCR) into a multimodal ILP problem yields significantly better speaker diarization results. Finally, we explain how this novel paradigm can incidentally be used for unsupervised speaker identification (i.e. not relying on any prior acoustic speaker models). Experiments on the REPERE TV broadcast corpus show that it achieves performance close to that of an oracle ca- pable of identifying any speaker as long as their name appears on screen at least once in the video.
Type de document :
Communication dans un congrès
the 14rd Annual Conference of the International Speech Communication Association, INTERSPEECH, 2013, Lyon, France. 2013
Liste complète des métadonnées

https://hal.inria.fr/hal-00953095
Contributeur : Marie-Christine Fauvet <>
Soumis le : lundi 3 mars 2014 - 15:50:29
Dernière modification le : jeudi 25 janvier 2018 - 15:12:02
Document(s) archivé(s) le : samedi 31 mai 2014 - 10:46:07

Fichier

BREDIN--INTERSPEECH--2013.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-00953095, version 1

Collections

Citation

Hervé Bredin, Johann Poignant. Integer Linear Programming for Speaker Diarization and Cross-Modal Identification in TV Broadcast. the 14rd Annual Conference of the International Speech Communication Association, INTERSPEECH, 2013, Lyon, France. 2013. 〈hal-00953095〉

Partager

Métriques

Consultations de la notice

239

Téléchargements de fichiers

272