Skip to Main content Skip to Navigation

Synthèse audiovisuelle de la parole expressive : modélisation des émotions par apprentissage profond

Sara Dahmani 1
1 MULTISPEECH - Speech Modeling for Facilitating Oral-Based Communication
Inria Nancy - Grand Est, LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
Abstract : The work of this thesis concerns the modeling of emotions for expressive audiovisual textto-speech synthesis. Today, the results of text-to-speech synthesis systems are of good quality, however audiovisual synthesis remains an open issue and expressive synthesis is even less studied. As part of this thesis, we present an emotions modeling method which is malleable and flexible, and allows us to mix emotions as we mix shades on a palette of colors. In the first part, we present and study two expressive corpora that we have built. The recording strategy and the expressive content of these corpora are analyzed to validate their use for the purpose of audiovisual speech synthesis. In the second part, we present two neural architectures for speech synthesis. We used these two architectures to model three aspects of speech : 1) the duration of sounds, 2) the acoustic modality and 3) the visual modality. First, we use a fully connected architecture. This architecture allowed us to study the behavior of neural networks when dealing with different contextual and linguistic descriptors. We were also able to analyze, with objective measures, the network’s ability to model emotions. The second neural architecture proposed is a variational auto-encoder. This architecture is able to learn a latent representation of emotions without using emotion labels. After analyzing the latent space of emotions, we presented a procedure for structuring it in order to move from a discrete representation of emotions to a continuous one. We were able to validate, through perceptual experiments, the ability of our system to generate emotions, nuances of emotions and mixtures of emotions, and this for expressive audiovisual text-to-speech synthesis.
Document type :
Complete list of metadata
Contributor : Sara Dahmani Connect in order to contact the contributor
Submitted on : Thursday, December 17, 2020 - 11:02:23 AM
Last modification on : Wednesday, November 3, 2021 - 7:08:58 AM
Long-term archiving on: : Thursday, March 18, 2021 - 6:51:49 PM


Files produced by the author(s)


  • HAL Id : tel-03079349, version 1


Sara Dahmani. Synthèse audiovisuelle de la parole expressive : modélisation des émotions par apprentissage profond. Informatique [cs]. Université de Lorraine, 2020. Français. ⟨NNT : 2020LORR0137⟩. ⟨tel-03079349⟩



Les métriques sont temporairement indisponibles