Skip to Main content Skip to Navigation
Journal articles

DiaBLa: A Corpus of Bilingual Spontaneous Written Dialogues for Machine Translation

Abstract : We present a new English-French dataset for the evaluation of Machine Translation (MT) for informal, written bilingual dialogue. The test set contains 144 spontaneous dialogues (5,700+ sentences) between native English and French speakers, mediated by one of two neural MT systems in a range of role-play settings. The dialogues are accompanied by fine-grained sentence-level judgments of MT quality, produced by the dialogue participants themselves, as well as by manually normalised versions and reference translations produced a posteriori. The motivation for the corpus is twofold: to provide (i) a unique resource for evaluating MT models, and (ii) a corpus for the analysis of MT-mediated communication. We provide an initial analysis of the corpus to confirm that the participants' judgments reveal perceptible differences in MT quality between the two MT systems used.
Document type :
Journal articles
Complete list of metadatas

https://hal.inria.fr/hal-03021633
Contributor : Rachel Bawden <>
Submitted on : Tuesday, November 24, 2020 - 1:50:26 PM
Last modification on : Thursday, November 26, 2020 - 3:31:36 AM

File

diabla-lre-personal-formatting...
Files produced by the author(s)

Identifiers

Collections

Citation

Rachel Bawden, Eric Bilinski, Thomas Lavergne, Sophie Rosset. DiaBLa: A Corpus of Bilingual Spontaneous Written Dialogues for Machine Translation. Language Resources and Evaluation, Springer Verlag, 2020, ⟨10.1007/s10579-020-09514-4⟩. ⟨hal-03021633⟩

Share

Metrics

Record views

45

Files downloads

59