J. Ramirez, J. M. Gorriz, and J. C. Segura, Voice Activity Detection. Fundamentals and Speech Recognition System Robustness, 2007.

G. Potamianos, C. Neti, J. Luettin, and I. Matthews, Audio-visual automatic speech recognition: An overview, 2004.

P. Liu and Z. Wang, Voice activity detection using visual information, ICASSP, 2004.

F. Patrona, A. Iosifidis, A. Tefas, N. Nikolaidis, and I. Pitas, Visual voice activity detection in the wild, IEEE TMM, 2016.

S. Siatras, N. Nikolaidis, M. Krinidis, and I. Pitas, Visual lip activity detection and speaker detection using mouth region intensities, IEEE T. Circ. Syst. Vid, 2009.

Q. Liu, A. J. Aubrey, and W. Wang, Interference reduction in reverberant speech separation with visual voice activity detection, IEEE TMM, 2014.

D. Sodoyer, B. Rivet, L. Girin, J. Schwartz, and C. Jutten, An analysis of visual speech information applied to voice activity detection, ICASSP, vol.1, pp.601-604, 2006.
URL : https://hal.archives-ouvertes.fr/hal-00361750

A. Aubrey, B. Rivet, Y. Hicks, L. Girin, J. Chambers et al., Two novel visual voice activity detectors based on appearance models and retinal filtering, 2007.

Q. Liu, W. Wang, and P. Jackson, A visual voice activity detection method with adaboosting, SSPD, pp.1-5, 2011.

R. Sharma, K. Somandepalli, and S. Narayanan, Toward visual voice activity detection for unconstrained videos, ICIP, 2019.

E. K. Patterson, S. Gurbuz, Z. Tufekci, and J. N. Gowdy, Cuave: A new audio-visual database for multimodal human-computer interface research, ICASSP, 2002.

P. Tiawongsombat, M. Jeong, J. Yun, B. You, and S. Oh, Robust visual speakingness detection using bi-level hmm, Pattern Recognit, 2012.

V. P. Minotto, C. R. Jung, and B. Lee, Simultaneous-speaker voice activity detection and localization using mid-fusion of svm and hmms, IEEE TMM, 2014.

M. Cooke, J. Barker, S. Cunningham, and X. Shao, An audio-visual corpus for speech perception and automatic speech recognition, J. Acoust. Soc. Am, 2006.

A. Bulat and G. Tzimiropoulos, How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks), ICCV, 2017.

A. Graves and J. Schmidhuber, Framewise phoneme classification with bidirectional lstm and other neural network architectures, Neural Netw, 2005.

S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, ICML, 2015.

G. Chéron, I. Laptev, and C. Schmid, P-cnn: Pose-based cnn features for action recognition, IEEE ICCV, 2015.

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh et al., , 2015.

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, 2014.

S. Abu-el-haija, N. Kothari, J. Lee, P. Natsev, G. Toderici et al., Youtube-8m: A large-scale video classification benchmark, 2016.

S. Salishev, A. Barabanov, D. Kocharov, P. Skrelin, and M. Moiseev, Voice activity detector (vad) based on long-term mel frequency band features, TSD, 2016.

D. E. King, Easily create high quality object detectors with deep learning

S. Siatras, N. Nikolaidis, and I. Pitas, Visual speech detection using mouth region intensities, 2006.

R. Navarathna, D. Dean, P. Lucey, S. Sridharan, and C. Fookes, Dynamic visual features for visual-speech activity detection, 2010.

R. Navarathna, D. Dean, S. Sridharan, C. Fookes, and P. Lucey, Visual voice activity detection using frontal versus profile views, 2011.

I. Laptev, Fast implementation of space-time interest point detector and descriptors

H. Wang, A. Kläser, C. Schmid, and C. Liu, Dense trajectories video description

D. Kingma and J. Ba, Adam: A method for stochastic optimization, 2015.