Skip to Main content Skip to Navigation
Conference papers

Audiovisual Synchrony Detection with Optimized Audio Features

Sami Sieranoja 1 Md Sahidullah 2 Tomi Kinnunen 1 Jukka Komulainen 3 Abdenour Hadid 3 
2 MULTISPEECH - Speech Modeling for Facilitating Oral-Based Communication
Inria Nancy - Grand Est, LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
Abstract : Audiovisual speech synchrony detection is an important part of talking-face verification systems. Prior work has primarily focused on visual features and joint-space models, while standard mel-frequency cepstral coefficients (MFCCs) have been commonly used to present speech. We focus more closely on audio by studying the impact of context window length for delta feature computation and comparing MFCCs with simpler energy-based features in lip-sync detection. We select state-of-the-art hand-crafted lip-sync visual features, space-time auto-correlation of gradients (STACOG), and canonical correlation analysis (CCA), for joint-space modeling. To enhance joint space modeling, we adopt deep CCA (DCCA), a nonlinear extension of CCA. Our results on the XM2VTS data indicate substantially enhanced audiovisual speech synchrony detection, with an equal error rate (EER) of 3.68%. Further analysis reveals that failed lip region localization and beard-edness of the subjects constitutes most of the errors. Thus, the lip motion description is the bottleneck, while the use of novel audio features or joint-modeling techniques is unlikely to boost lip-sync detection accuracy further.
Complete list of metadata

Cited literature [29 references]  Display  Hide  Download
Contributor : Md Sahidullah Connect in order to contact the contributor
Submitted on : Monday, October 8, 2018 - 10:46:26 AM
Last modification on : Saturday, June 25, 2022 - 7:42:06 PM
Long-term archiving on: : Wednesday, January 9, 2019 - 1:49:39 PM


Files produced by the author(s)


  • HAL Id : hal-01889918, version 1


Sami Sieranoja, Md Sahidullah, Tomi Kinnunen, Jukka Komulainen, Abdenour Hadid. Audiovisual Synchrony Detection with Optimized Audio Features. ICSIP 2018 - 3rd International Conference on Signal and Image Processing, Jul 2018, Shenzhen, China. ⟨hal-01889918⟩



Record views


Files downloads