Skip to Main content Skip to Navigation
Conference papers

Automatic Selection of Parallel Data for Machine Translation

Abstract : Nowadays machine translation is widely used, but the required data for training, tuning and testing a machine translation engine is often not sufficient or not useful. The automatic selection of data that are qualitatively appropriate for building translation models can help improve translation accuracy. In this paper, we used a large parallel corpus of educational video lecture subtitles as well as text posted by students and lecturers on the course fora. The text is quite challenging to translate due to the scientific domains involved and its informal genre. We applied a random forest classification schema on the output of three machine translation models (one based on statistical machine translation and two on neural machine translation) in order to automatically identify the best output. The unorthodox language phenomena observed as well as the rich-in-terminology scientific domains addressed in the educational video lectures, the language-independent nature of the approach, and the tackled three-class classification problem constitute innovative challenges of the work described herein.
Document type :
Conference papers
Complete list of metadatas

Cited literature [19 references]  Display  Hide  Download

https://hal.inria.fr/hal-01821299
Contributor : Hal Ifip <>
Submitted on : Friday, June 22, 2018 - 2:12:51 PM
Last modification on : Friday, June 22, 2018 - 2:24:16 PM
Long-term archiving on: : Monday, September 24, 2018 - 11:36:51 AM

File

468652_1_En_14_Chapter.pdf
Files produced by the author(s)

Licence


Distributed under a Creative Commons Attribution 4.0 International License

Identifiers

Citation

Despoina Mouratidis, Katia Kermanidis. Automatic Selection of Parallel Data for Machine Translation. 14th IFIP International Conference on Artificial Intelligence Applications and Innovations (AIAI), May 2018, Rhodes, Greece. pp.146-156, ⟨10.1007/978-3-319-92016-0_14⟩. ⟨hal-01821299⟩

Share

Metrics

Record views

121