Skip to Main content Skip to Navigation
Conference papers

Data augmentation for pipeline-based speech translation

Abstract : Pipeline-based speech translation methods may suffer from errors found in speech recognition system output. Therefore, it is crucial that machine translation systems are trained to be robust against such noise. In this paper, we propose two methods for parallel data augmentation for pipeline-based speech translation system development. The first method utilises a speech processing workflow to introduce errors and the second method generates commonly found suffix errors using a rule-based method. We show that the methods in combination allow significantly improving speech translation quality by 1.87 BLEU points over a baseline system.
Complete list of metadata

Cited literature [15 references]  Display  Hide  Download
Contributor : Emmanuel Vincent Connect in order to contact the contributor
Submitted on : Monday, July 27, 2020 - 9:39:56 AM
Last modification on : Tuesday, July 28, 2020 - 8:53:12 AM
Long-term archiving on: : Tuesday, December 1, 2020 - 7:12:38 AM


Files produced by the author(s)


  • HAL Id : hal-02907053, version 1



Diego Alves, Askars Salimbajevs, Mārcis Pinnis. Data augmentation for pipeline-based speech translation. 9th International Conference on Human Language Technologies - the Baltic Perspective (Baltic HLT 2020), Sep 2020, Kaunas, Lithuania. ⟨hal-02907053⟩



Record views


Files downloads