Conditional Variational Auto-Encoder for Text-Driven Expressive AudioVisual Speech Synthesis

Sara Dahmani; Vincent Colotte; Valérian Girard; Slim Ouni

Communication Dans Un Congrès Année : 2019

Conditional Variational Auto-Encoder for Text-Driven Expressive AudioVisual Speech Synthesis

(1) , (1) , (1) , (1)

Sara Dahmani

Fonction : Auteur
PersonId : 988945

Speech Modeling for Facilitating Oral-Based Communication

Vincent Colotte

Fonction : Auteur
PersonId : 16268
IdHAL : vincent-colotte
IdRef : 070401683

Speech Modeling for Facilitating Oral-Based Communication

Valérian Girard

Fonction : Auteur
PersonId : 1050210

Speech Modeling for Facilitating Oral-Based Communication

Slim Ouni

Fonction : Auteur
PersonId : 1158
IdHAL : slim-ouni
ORCID : 0000-0001-5286-7368

Speech Modeling for Facilitating Oral-Based Communication

Résumé

In recent years, the performance of speech synthesis systems has been improved thanks to deep learning-based models, but generating expressive audiovisual speech is still an open issue. The variational auto-encoders (VAE)s are recently proposed to learn latent representations of data. In this paper, we present a system for expressive text-to-audiovisual speech synthesis that learns a latent embedding space of emotions using a conditional generative model based on the variational auto-encoder framework. When conditioned on textual input, the VAE is able to learn an embedded representation that captures emotion characteristics from the signal, while being invariant to the phonetic content of the utterances. We applied this method in an unsuper-vised manner to generate duration, acoustic and visual features of speech. This conditional variational auto-encoder (CVAE) has been used to blend emotions together. This model was able to generate nuances of a given emotion or to generate new emotions that do not exist in our database. We conducted three perceptive experiments to evaluate our findings.

Mots clés

emotion Expressive talking avatar conditional variational auto-encoder Expressive audiovisual speech synthesis deep bidirectional long short-term memory (DBLSTM) facial expression

Domaines

Intelligence artificielle [cs.AI] Multimédia [cs.MM] Informatique Linguistique

Fichier principal

interspeech19_sd_26_06.pdf (1.34 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Slim Ouni : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-02175776

Soumis le : samedi 6 juillet 2019-11:33:31

Dernière modification le : jeudi 1 février 2024-10:06:00

Dates et versions

hal-02175776 , version 1 (06-07-2019)

Identifiants

HAL Id : hal-02175776 , version 1

Citer

Sara Dahmani, Vincent Colotte, Valérian Girard, Slim Ouni. Conditional Variational Auto-Encoder for Text-Driven Expressive AudioVisual Speech Synthesis. INTERSPEECH 2019 - 20th Annual Conference of the International Speech Communication Association, Sep 2019, Graz, Austria. ⟨hal-02175776⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 CNRS INRIA IRISA GRID5000 UNIV-LORRAINE INRIA2 LORIA LORIA-NLPKD UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES SILECS UR1-MATH-NUM CREATIV-LAB

665 Consultations

1085 Téléchargements

Conditional Variational Auto-Encoder for Text-Driven Expressive AudioVisual Speech Synthesis

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager