Transfer learning of the expressivity using flow metric learning in multispeaker text-to-speech synthesis

Ajinkya Kulkarni; Vincent Colotte; Denis Jouvet

Communication Dans Un Congrès Année : 2020

Transfer learning of the expressivity using flow metric learning in multispeaker text-to-speech synthesis

(1) , (1) , (1)

Ajinkya Kulkarni

Fonction : Auteur
PersonId : 1069614

Speech Modeling for Facilitating Oral-Based Communication

Vincent Colotte

Fonction : Auteur
PersonId : 16268
IdHAL : vincent-colotte
IdRef : 070401683

Speech Modeling for Facilitating Oral-Based Communication

Denis Jouvet

Fonction : Auteur
PersonId : 15904
IdHAL : denis-jouvet
IdRef : 029418666

Speech Modeling for Facilitating Oral-Based Communication

Résumé

In this paper, we present a novel deep metric learning architecture along with variational inference incorporated in a paramet-ric multispeaker expressive text-to-speech (TTS) system. We proposed inverse autoregressive flow (IAF) as a way to perform the variational inference, thus providing flexible approximate posterior distribution. The proposed approach condition the text-to-speech system on speaker embeddings so that latent space represents the emotion as semantic characteristics. For representing the speaker, we extracted speaker em-beddings from the x-vector based speaker recognition model trained on speech data from many speakers. To predict the vocoder features, we used the acoustic model conditioned on the textual features as well as on the speaker embedding. We transferred the expressivity by using the mean of the latent variables for each emotion to generate expressive speech in different speaker's voices for which no expressive speech data is available. We compared the results obtained using flow-based variational inference with variational autoencoder as a base-line model. The performance measured by mean opinion score (MOS), speaker MOS, and expressive MOS shows that N-pair loss based deep metric learning along with IAF model improves the transfer of expressivity in the desired speaker's voice in synthesized speech.

Mots clés

expressivity variational autoencoder inverse autoregressive flow deep metric learning text-to-speech

Domaines

Informatique [cs] Traitement du signal et de l'image [eess.SP]

Fichier principal

interspeech_2020_published_version.pdf (333.9 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Denis Jouvet : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-02572106

Soumis le : jeudi 22 octobre 2020-16:37:45

Dernière modification le : lundi 11 septembre 2023-17:41:19

Dates et versions

hal-02572106 , version 1 (13-05-2020)

hal-02572106 , version 2 (15-05-2020)

hal-02572106 , version 3 (22-10-2020)

Identifiants

HAL Id : hal-02572106 , version 3

Citer

Ajinkya Kulkarni, Vincent Colotte, Denis Jouvet. Transfer learning of the expressivity using flow metric learning in multispeaker text-to-speech synthesis. INTERSPEECH 2020, Oct 2020, Shanghai / Virtual, China. ⟨hal-02572106v3⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA GRID5000 UNIV-LORRAINE INRIA2 LORIA LORIA-NLPKD SILECS

333 Consultations

634 Téléchargements

Transfer learning of the expressivity using flow metric learning in multispeaker text-to-speech synthesis

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager