Skip to Main content Skip to Navigation
New interface
Journal articles

Overlapped speech detection and speaker counting using distant microphone arrays

Abstract : We study the problem of detecting and counting simultaneous, overlapping speakers in a multichannel, distant-microphone scenario. Focusing on a supervised learning approach, we treat Voice Activity Detection (VAD), Overlapped Speech Detection (OSD), joint VAD and OSD (VAD+OSD) and speaker counting in a unified way, as instances of a general Overlapped Speech Detection and Counting (OSDC) multi-class supervised learning problem. We consider a Temporal Convolutional Network (TCN) and a Transformer based architecture for this task, and compare them with previously proposed state-of-the art methods based on Recurrent Neural Networks (RNN) or hybrid Convolutional-Recurrent Neural Networks (CRNN). In addition, we propose ways of exploiting multichannel input by means of early or late fusion of single-channel features with spatial features extracted from one or more microphone pairs. We conduct an extensive experimental evaluation on the AMI and CHiME-6 datasets and on a purposely made multichannel synthetic dataset. We show that the Transformer-based architecture performs best among all architectures and that neural network based spatial localization features outperform signal-based spatial features and significantly improve performance compared to single-channel features only. Finally, we find that training with a speaker counting objective improves OSD compared to training with a VAD+OSD objective.
Document type :
Journal articles
Complete list of metadata

https://hal.inria.fr/hal-03375681
Contributor : Emmanuel Vincent Connect in order to contact the contributor
Submitted on : Wednesday, October 13, 2021 - 8:09:11 AM
Last modification on : Friday, November 18, 2022 - 9:24:00 AM
Long-term archiving on: : Friday, January 14, 2022 - 6:17:06 PM

File

cornell_CSL21.pdf
Files produced by the author(s)

Identifiers

Citation

Samuele Cornell, Maurizio Omologo, Stefano Squartini, Emmanuel Vincent. Overlapped speech detection and speaker counting using distant microphone arrays. Computer Speech and Language, 2021, 72, ⟨10.1016/j.csl.2021.101306⟩. ⟨hal-03375681⟩

Share

Metrics

Record views

50

Files downloads

463