DiaBLa: A Corpus of Bilingual Spontaneous Written Dialogues for Machine Translation

Rachel Bawden; Eric Bilinski; Thomas Lavergne; Sophie Rosset

doi:10.1007/s10579-020-09514-4

Article Dans Une Revue Language Resources and Evaluation Année : 2020

DiaBLa: A Corpus of Bilingual Spontaneous Written Dialogues for Machine Translation

(1) , (2) , (2, 3) , (2)

1
2
3

Rachel Bawden

Fonction : Auteur
PersonId : 9441
IdHAL : rachel-bawden
ORCID : 0000-0001-9553-1768
IdRef : 233174591

School of Informatics [Edimbourg]

Eric Bilinski

Fonction : Auteur
PersonId : 1034097

Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur

Thomas Lavergne

Fonction : Auteur
PersonId : 176452
IdHAL : thomas-lavergne
ORCID : 0000-0002-0029-0015
IdRef : 139185127

Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur

University of Paris Sud

Sophie Rosset

Fonction : Auteur
PersonId : 14913
IdHAL : sophie-rosset
ORCID : 0000-0002-6865-4989
IdRef : 137157835

Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur

Résumé

We present a new English-French dataset for the evaluation of Machine Translation (MT) for informal, written bilingual dialogue. The test set contains 144 spontaneous dialogues (5,700+ sentences) between native English and French speakers, mediated by one of two neural MT systems in a range of role-play settings. The dialogues are accompanied by fine-grained sentence-level judgments of MT quality, produced by the dialogue participants themselves, as well as by manually normalised versions and reference translations produced a posteriori. The motivation for the corpus is twofold: to provide (i) a unique resource for evaluating MT models, and (ii) a corpus for the analysis of MT-mediated communication. We provide an initial analysis of the corpus to confirm that the participants' judgments reveal perceptible differences in MT quality between the two MT systems used.

Mots clés

Machine translation Dialogue Context Evaluation Dataset Corpus English French Bilingual conversation

Domaines

Informatique et langage [cs.CL]

Fichier principal

diabla-lre-personal-formatting.pdf (1.11 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Rachel Bawden : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03021633

Soumis le : mardi 24 novembre 2020-13:50:26

Dernière modification le : jeudi 7 mars 2024-12:32:05

Archivage à long terme le : jeudi 25 février 2021-19:53:15

Dates et versions

hal-03021633 , version 1 (24-11-2020)

Identifiants

HAL Id : hal-03021633 , version 1
DOI : 10.1007/s10579-020-09514-4

Citer

Rachel Bawden, Eric Bilinski, Thomas Lavergne, Sophie Rosset. DiaBLa: A Corpus of Bilingual Spontaneous Written Dialogues for Machine Translation. Language Resources and Evaluation, 2020, ⟨10.1007/s10579-020-09514-4⟩. ⟨hal-03021633⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS LIMSI UNIV-PARIS-SACLAY LISN GS-ENGINEERING GS-COMPUTER-SCIENCE LISN-ASARD

163 Consultations

271 Téléchargements

DiaBLa: A Corpus of Bilingual Spontaneous Written Dialogues for Machine Translation

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager