Building a bilingual dictionary from movie subtitles based on inter-lingual triggers

Caroline Lavecchia 1 Kamel Smaïli 1 David Langlois 1
1 PAROLE - Analysis, perception and recognition of speech
INRIA Lorraine, LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
Abstract : This paper focuses on two aspects of Machine Translation: parallel corpora and translation model. First, we present a method to automatically build parallel corpora from subtitle files. We use subtitle files gathered from the Internet. This leads to useful data for Subtitling Machine Translation. Our method is based on Dynamic Time Warping. We evaluated this alignment method by comparing it with a sample aligned by hand and we obtained a precision of alignment equal to $0.92$. Second, we use the notion of inter-lingual triggers in order to build from the subtitle parallel corpora multilingual dictionaries and translation tables for machine translation. Inter-lingual triggers allow to detect couple of source and target words from parallel corpora. The Mutual Information measure used to determine inter-lingual triggers allows to hypothesize that a word in the source language is a translation of another word in the target language. We evaluate the obtained dictionary by comparing it to two existing dictionaries. Then, we integrated the obtained translation tables into an entire translation decoding process supplied by Pharaoh. We compared the translation performance using our translation tables with the performance obtained by the Giza++ tool. The results showed that the system tuned for our tables improves the Bleu value by 2.2% compared to the ones obtained by Giza++.
Type de document :
Communication dans un congrès
Translating and the Computer, Nov 2007, Londres, United Kingdom. 2007
Liste complète des métadonnées

Littérature citée [16 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/inria-00184421
Contributeur : Caroline Lavecchia <>
Soumis le : mercredi 31 octobre 2007 - 09:37:19
Dernière modification le : jeudi 11 janvier 2018 - 06:19:56
Document(s) archivé(s) le : lundi 12 avril 2010 - 01:03:11

Fichier

aslib07.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : inria-00184421, version 1

Collections

Citation

Caroline Lavecchia, Kamel Smaïli, David Langlois. Building a bilingual dictionary from movie subtitles based on inter-lingual triggers. Translating and the Computer, Nov 2007, Londres, United Kingdom. 2007. 〈inria-00184421〉

Partager

Métriques

Consultations de la notice

352

Téléchargements de fichiers

674