HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation
Journal articles

Learning emotions latent representation with CVAE for Text-Driven Expressive AudioVisual Speech Synthesis

Sara Dahmani 1, 2 Vincent Colotte 1, 2 Valérian Girard 1, 2 Slim Ouni 1, 2
1 MULTISPEECH - Speech Modeling for Facilitating Oral-Based Communication
Inria Nancy - Grand Est, LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
Abstract : Great improvement has been made in the field of expressive audiovisual Text-to-Speech synthesis (EAVTTS) thanks to deep learning techniques. However, generating realistic speech is still an open issue and researchers in this area have been focusing lately on controlling the speech variability.In this paper, we use different neural architectures to synthesize emotional speech. We study the application of unsupervised learning techniques for emotional speech modeling as well as methods for restructuring emotions representation to make it continuous and more flexible. This manipulation of the emotional representation should allow us to generate new styles of speech by mixing emotions. We first present our expressive audiovisual corpus. We validate the emotional content of this corpus with three perceptual experiments using acoustic only, visual only and audiovisual stimuli.After that, we analyze the performance of a fully connected neural network in learning characteristics specific to different emotions for the phone duration aspect and the acoustic and visual modalities.We also study the contribution of a joint and separate training of the acoustic and visual modalities in the quality of the generated synthetic speech.In the second part of this paper, we use a conditional variational auto-encoder (CVAE) architecture to learn a latent representation of emotions. We applied this method in an unsupervised manner to generate features of expressive speech. We used a probabilistic metric to compute the overlapping degree between emotions latent clusters to choose the best parameters for the CVAE. By manipulating the latent vectors, we were able to generate nuances of a given emotion and to generate new emotions that do not exist in our database. For these new emotions, we obtain a coherent articulation. We conducted four perceptual experiments to evaluate our findings.
Complete list of metadata

Contributor : Slim Ouni Connect in order to contact the contributor
Submitted on : Wednesday, April 21, 2021 - 1:24:10 PM
Last modification on : Friday, April 1, 2022 - 3:57:15 AM
Long-term archiving on: : Thursday, July 22, 2021 - 6:38:53 PM


Files produced by the author(s)



Sara Dahmani, Vincent Colotte, Valérian Girard, Slim Ouni. Learning emotions latent representation with CVAE for Text-Driven Expressive AudioVisual Speech Synthesis. Neural Networks, Elsevier, 2021, 141, pp.315-329. ⟨10.1016/j.neunet.2021.04.021⟩. ⟨hal-03204193⟩



Record views


Files downloads