Building Parallel Corpora from Movies - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2007

Building Parallel Corpora from Movies

Résumé

This paper proposes to use DTW to construct parallel corpora from difficult data. Parallel corpora are considered as raw material for machine translation (MT), frequently, MT systems use European or Canadian parliament corpora. In order to achieve a realistic machine translation system, we decided to use movie subtitles. These data could be considered difficult because they contain unfamiliar expressions, abbreviations, hesitations, words which do not exist in classical dictionaries (as vulgar words), etc. The obtained parallel corpora can constitute a rich ressource to train decoding spontaneous speech translation system. From 40 movies, we align 43013 English subtitles with 42306 French subtitles. This leads to 37625 aligned pairs with a precision of 92,3%.
Fichier principal
Vignette du fichier
nlpcs07.pdf (295.96 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

inria-00155787 , version 1 (19-06-2007)

Identifiants

  • HAL Id : inria-00155787 , version 1

Citer

Caroline Lavecchia, Kamel Smaïli, David Langlois. Building Parallel Corpora from Movies. The 4th International Workshop on Natural Language Processing and Cognitive Science - NLPCS 2007, Jun 2007, Funchal, Madeira, Portugal. ⟨inria-00155787⟩
226 Consultations
435 Téléchargements

Partager

Gmail Facebook X LinkedIn More