DiaBLa: A Corpus of Bilingual Spontaneous Written Dialogues for Machine Translation - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Article Dans Une Revue Language Resources and Evaluation Année : 2020

DiaBLa: A Corpus of Bilingual Spontaneous Written Dialogues for Machine Translation

Résumé

We present a new English-French dataset for the evaluation of Machine Translation (MT) for informal, written bilingual dialogue. The test set contains 144 spontaneous dialogues (5,700+ sentences) between native English and French speakers, mediated by one of two neural MT systems in a range of role-play settings. The dialogues are accompanied by fine-grained sentence-level judgments of MT quality, produced by the dialogue participants themselves, as well as by manually normalised versions and reference translations produced a posteriori. The motivation for the corpus is twofold: to provide (i) a unique resource for evaluating MT models, and (ii) a corpus for the analysis of MT-mediated communication. We provide an initial analysis of the corpus to confirm that the participants' judgments reveal perceptible differences in MT quality between the two MT systems used.
Fichier principal
Vignette du fichier
diabla-lre-personal-formatting.pdf (1.11 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03021633 , version 1 (24-11-2020)

Identifiants

Citer

Rachel Bawden, Eric Bilinski, Thomas Lavergne, Sophie Rosset. DiaBLa: A Corpus of Bilingual Spontaneous Written Dialogues for Machine Translation. Language Resources and Evaluation, 2020, ⟨10.1007/s10579-020-09514-4⟩. ⟨hal-03021633⟩
163 Consultations
271 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More