Skip to Main content Skip to Navigation
Conference papers

Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Sylvain Guy 1 Stéphane Lathuilière 2, 1 Pablo Mesejo 3, 1 Radu Horaud 1
1 PERCEPTION - Interpretation and Modelling of Images and Videos
Inria Grenoble - Rhône-Alpes, LJK - Laboratoire Jean Kuntzmann, Grenoble INP [2020-....] - Institut polytechnique de Grenoble - Grenoble Institute of Technology [2020-....]
Abstract : Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not. V-VAD is useful whenever audio VAD (A-VAD) is inefficient either because the acoustic signal is difficult to analyze or is simply missing. We propose two deep architectures for V-VAD, one based on facial landmarks and one based on optical flow. Moreover, available datasets, used for learning and for testing V-VAD, lack content variability. We introduce a novel methodology to automatically create and annotate very large datasets in-the-wild, based on combining A-VAD and face detection. A thorough empirical evaluation shows the advantage of training the proposed deep V-VAD models with such a dataset.
Complete list of metadatas

Cited literature [30 references]  Display  Hide  Download
Contributor : Team Perception <>
Submitted on : Friday, October 16, 2020 - 3:25:50 PM
Last modification on : Tuesday, October 20, 2020 - 3:37:24 AM


Files produced by the author(s)


  • HAL Id : hal-02882229, version 3


Sylvain Guy, Stéphane Lathuilière, Pablo Mesejo, Radu Horaud. Learning Visual Voice Activity Detection with an Automatically Annotated Dataset. ICPR 2020 - 25th International Conference on Pattern Recognition, Jan 2021, Milano / Virtual, Italy. ⟨hal-02882229v3⟩



Record views


Files downloads