Analyzing the impact of speaker localization errors on speech separation for automatic speech recognition

Sunit Sivasankaran 1 Emmanuel Vincent 1 Dominique Fohr 1
1 MULTISPEECH - Speech Modeling for Facilitating Oral-Based Communication
Inria Nancy - Grand Est, LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
Abstract : We investigate the effect of speaker localization on the performance of speech recognition systems in a multispeaker, multichannel environment. Given the speaker location information , speech separation is performed in three stages. In the first stage, a simple delay-and-sum (DS) beamformer is used to enhance the signal impinging from the speaker location which is then used to estimate a time-frequency mask corresponding to the localized speaker using a neural network. This mask is used to compute the second order statistics and to derive an adaptive beamformer in the third stage. We generated a multichannel, multispeaker, reverberated, noisy dataset inspired from the well studied WSJ0-2mix and study the performance of the proposed pipeline in terms of the word error rate (WER). An average WER of 29.4% was achieved using the ground truth localization information and 42.4% using the localization information estimated via GCC-PHAT. The signal-to-interference ratio (SIR) between the speakers has a higher impact on the ASR performance, to the extent of reducing the WER by 59% relative for a SIR increase of 15 dB. By contrast, increasing the spatial distance to 50 • or more improves the WER by 23% relative only.
Document type :
Preprints, Working Papers, ...
Complete list of metadatas

Cited literature [26 references]  Display  Hide  Download

https://hal.inria.fr/hal-02355669
Contributor : Sunit Sivasankaran <>
Submitted on : Friday, November 8, 2019 - 1:27:32 PM
Last modification on : Wednesday, November 13, 2019 - 1:02:58 AM
Long-term archiving on: Sunday, February 9, 2020 - 3:01:11 PM

File

sivasankaran.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-02355669, version 1

Citation

Sunit Sivasankaran, Emmanuel Vincent, Dominique Fohr. Analyzing the impact of speaker localization errors on speech separation for automatic speech recognition. 2019. ⟨hal-02355669⟩

Share

Metrics

Record views

76

Files downloads

198