Skip to Main content Skip to Navigation
Conference papers

Building Parallel Corpora from Movies

Caroline Lavecchia 1 Kamel Smaïli 1 David Langlois 1
1 PAROLE - Analysis, perception and recognition of speech
INRIA Lorraine, LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
Abstract : This paper proposes to use DTW to construct parallel corpora from difficult data. Parallel corpora are considered as raw material for machine translation (MT), frequently, MT systems use European or Canadian parliament corpora. In order to achieve a realistic machine translation system, we decided to use movie subtitles. These data could be considered difficult because they contain unfamiliar expressions, abbreviations, hesitations, words which do not exist in classical dictionaries (as vulgar words), etc. The obtained parallel corpora can constitute a rich ressource to train decoding spontaneous speech translation system. From 40 movies, we align 43013 English subtitles with 42306 French subtitles. This leads to 37625 aligned pairs with a precision of 92,3%.
Document type :
Conference papers
Complete list of metadata

Cited literature [8 references]  Display  Hide  Download
Contributor : Caroline Lavecchia Connect in order to contact the contributor
Submitted on : Tuesday, June 19, 2007 - 11:24:26 AM
Last modification on : Friday, February 26, 2021 - 3:28:06 PM
Long-term archiving on: : Thursday, April 8, 2010 - 8:46:35 PM


Files produced by the author(s)


  • HAL Id : inria-00155787, version 1



Caroline Lavecchia, Kamel Smaïli, David Langlois. Building Parallel Corpora from Movies. The 4th International Workshop on Natural Language Processing and Cognitive Science - NLPCS 2007, Jun 2007, Funchal, Madeira, Portugal. ⟨inria-00155787⟩



Record views


Files downloads