RoCS-MT: Robustness Challenge Set for Machine Translation - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2023

RoCS-MT: Robustness Challenge Set for Machine Translation

Résumé

RoCS-MT, a Robust Challenge Set for Machine Translation (MT), is designed to test MT systems' ability to translate user-generated content (UGC) that displays non-standard characteristics, such as spelling errors, devowelling, acronymisation, etc. RoCS-MT is composed of English comments from Reddit, selected for their non-standard nature, which have been manually normalised and professionally translated into five languages: French, German, Czech, Ukrainian and Russian. In the context of the WMT23 test suite shared task, we analyse the models submitted to the general MT task for all from-English language pairs, offering some insights into the types of problems faced by state-of-the-art MT models when dealing with non-standard UGC texts. We compare automatic metrics for MT quality, including quality estimation to see if the same conclusions can be drawn without references. In terms of robustness, we find that many of the systems struggle with non-standard variants of words (e.g. due to phonetically inspired spellings, contraction, truncations, etc.), but that this depends on the system and the amount of training data, with the best overall systems performing better across all phenomena. GPT4 is the clear frontrunner. However we caution against drawing conclusions about generalisation capacity as it and other systems could be trained on the source side of RoCS and also on similar data.
Fichier principal
Vignette du fichier
Article_RoCS_MT-7.pdf (248.74 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-04300824 , version 1 (22-11-2023)

Licence

Paternité

Identifiants

  • HAL Id : hal-04300824 , version 1

Citer

Rachel Bawden, Benoît Sagot. RoCS-MT: Robustness Challenge Set for Machine Translation. WMT23 - Eighth Conference on Machine Translation, Dec 2023, Singapore, Singapore. pp.198--216. ⟨hal-04300824⟩
39 Consultations
57 Téléchargements

Partager

Gmail Facebook X LinkedIn More