Building Parallel Corpora from Movies

Caroline Lavecchia 1 Kamel Smaïli 1 David Langlois 1
1 PAROLE - Analysis, perception and recognition of speech
INRIA Lorraine, LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
Abstract : This paper proposes to use DTW to construct parallel corpora from difficult data. Parallel corpora are considered as raw material for machine translation (MT), frequently, MT systems use European or Canadian parliament corpora. In order to achieve a realistic machine translation system, we decided to use movie subtitles. These data could be considered difficult because they contain unfamiliar expressions, abbreviations, hesitations, words which do not exist in classical dictionaries (as vulgar words), etc. The obtained parallel corpora can constitute a rich ressource to train decoding spontaneous speech translation system. From 40 movies, we align 43013 English subtitles with 42306 French subtitles. This leads to 37625 aligned pairs with a precision of 92,3%.
Document type :
Conference papers
Complete list of metadatas

Cited literature [8 references]  Display  Hide  Download

https://hal.inria.fr/inria-00155787
Contributor : Caroline Lavecchia <>
Submitted on : Tuesday, June 19, 2007 - 11:24:26 AM
Last modification on : Thursday, January 11, 2018 - 6:19:56 AM
Long-term archiving on: Thursday, April 8, 2010 - 8:46:35 PM

File

nlpcs07.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : inria-00155787, version 1

Collections

Citation

Caroline Lavecchia, Kamel Smaïli, David Langlois. Building Parallel Corpora from Movies. The 4th International Workshop on Natural Language Processing and Cognitive Science - NLPCS 2007, Jun 2007, Funchal, Madeira, Portugal. ⟨inria-00155787⟩

Share

Metrics

Record views

356

Files downloads

437