Building Parallel Corpora from Movies

Caroline Lavecchia 1 Kamel Smaïli 1 David Langlois 1
1 PAROLE - Analysis, perception and recognition of speech
INRIA Lorraine, LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
Abstract : This paper proposes to use DTW to construct parallel corpora from difficult data. Parallel corpora are considered as raw material for machine translation (MT), frequently, MT systems use European or Canadian parliament corpora. In order to achieve a realistic machine translation system, we decided to use movie subtitles. These data could be considered difficult because they contain unfamiliar expressions, abbreviations, hesitations, words which do not exist in classical dictionaries (as vulgar words), etc. The obtained parallel corpora can constitute a rich ressource to train decoding spontaneous speech translation system. From 40 movies, we align 43013 English subtitles with 42306 French subtitles. This leads to 37625 aligned pairs with a precision of 92,3%.
Type de document :
Communication dans un congrès
The 4th International Workshop on Natural Language Processing and Cognitive Science - NLPCS 2007, Jun 2007, Funchal, Madeira, Portugal. 2007
Liste complète des métadonnées

Littérature citée [8 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/inria-00155787
Contributeur : Caroline Lavecchia <>
Soumis le : mardi 19 juin 2007 - 11:24:26
Dernière modification le : jeudi 11 janvier 2018 - 06:19:56
Document(s) archivé(s) le : jeudi 8 avril 2010 - 20:46:35

Fichier

nlpcs07.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : inria-00155787, version 1

Collections

Citation

Caroline Lavecchia, Kamel Smaïli, David Langlois. Building Parallel Corpora from Movies. The 4th International Workshop on Natural Language Processing and Cognitive Science - NLPCS 2007, Jun 2007, Funchal, Madeira, Portugal. 2007. 〈inria-00155787〉

Partager

Métriques

Consultations de la notice

311

Téléchargements de fichiers

275