Skip to Main content Skip to Navigation
Conference papers

Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Abstract : Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not. VVAD is useful whenever audio VAD (A-VAD) is inefficient either because the acoustic signal is difficult to analyze or because it is simply missing. We propose two deep architectures for V-VAD, one based on facial landmarks and one based on optical flow. Moreover, available datasets, used for learning and for testing VVAD, lack content variability. We introduce a novel methodology to automatically create and annotate very large datasets inthe-wild – WildVVAD – based on combining A-VAD with face detection and tracking. A thorough empirical evaluation showsthe advantage of training the proposed deep V-VAD models with this dataset.
Complete list of metadata

Cited literature [31 references]  Display  Hide  Download
Contributor : Team Perception Connect in order to contact the contributor
Submitted on : Friday, October 16, 2020 - 4:33:57 PM
Last modification on : Monday, November 8, 2021 - 12:10:17 PM


Files produced by the author(s)


  • HAL Id : hal-02882229, version 4


Sylvain Guy, Stéphane Lathuilière, Pablo Mesejo, Radu Horaud. Learning Visual Voice Activity Detection with an Automatically Annotated Dataset. ICPR 2020 - 25th International Conference on Pattern Recognition, Jan 2021, Milano, Italy. pp.1-6. ⟨hal-02882229v4⟩



Record views


Files downloads