Skip to Main content Skip to Navigation
Conference papers

Data augmentation for pipeline-based speech translation

Abstract : Pipeline-based speech translation methods may suffer from errors found in speech recognition system output. Therefore, it is crucial that machine translation systems are trained to be robust against such noise. In this paper, we propose two methods for parallel data augmentation for pipeline-based speech translation system development. The first method utilises a speech processing workflow to introduce errors and the second method generates commonly found suffix errors using a rule-based method. We show that the methods in combination allow significantly improving speech translation quality by 1.87 BLEU points over a baseline system.
Complete list of metadatas

Cited literature [15 references]  Display  Hide  Download

https://hal.inria.fr/hal-02907053
Contributor : Emmanuel Vincent <>
Submitted on : Monday, July 27, 2020 - 9:39:56 AM
Last modification on : Tuesday, July 28, 2020 - 8:53:12 AM

File

Data_Augmentation_for_Pipeline...
Files produced by the author(s)

Identifiers

  • HAL Id : hal-02907053, version 1

Collections

Citation

Diego Alves, Askars Salimbajevs, Mārcis Pinnis. Data augmentation for pipeline-based speech translation. 9th International Conference on Human Language Technologies - the Baltic Perspective (Baltic HLT 2020), Sep 2020, Kaunas, Lithuania. ⟨hal-02907053⟩

Share

Metrics

Record views

28

Files downloads

76