Deep variational metric learning for transfer of expressivity in multispeaker text to Speech

Ajinkya Kulkarni; Vincent Colotte; Denis Jouvet

Communication Dans Un Congrès Année : 2020

Deep variational metric learning for transfer of expressivity in multispeaker text to Speech

(1) , (1) , (1)

Ajinkya Kulkarni

Fonction : Auteur
PersonId : 1069614

Speech Modeling for Facilitating Oral-Based Communication

Vincent Colotte

Fonction : Auteur
PersonId : 16268
IdHAL : vincent-colotte
IdRef : 070401683

Speech Modeling for Facilitating Oral-Based Communication

Denis Jouvet

Fonction : Auteur
PersonId : 15904
IdHAL : denis-jouvet
IdRef : 029418666

Speech Modeling for Facilitating Oral-Based Communication

Résumé

In this paper, we propose to use the deep metric learning based multi-class N-pair loss, for text-to-speech (TTS) synthesis. We use the proposed loss function in a recurrent conditional variational autoencoder (RCVAE) for transferring expressivity in a French multispeaker TTS system. We extracted the speaker embeddings from the x-vector based speaker recognition model trained on speech data from many speakers to represent the speaker identity. We use mean of the latent variables to transfer expressivity for each emotion to generate expressive speech in the desired speaker's voice. In contrast to the commonly used loss functions such as triplet loss or contrastive loss, multi-class N-pair loss considers all the negative examples which make each class of emotion distinguished from one another. Furthermore, the presented approach assists in creating a robust representation of expressivity irrespective of speaker identities. Our proposed approach demonstrates the improved performance for transfer of expressivity in the target speaker's voice in a synthesized speech. To our knowledge, it is for the first time multi-class N-pair loss and x-vector based speaker embeddings are used in a TTS system.

Mots clés

text-to-speech variational autoencoder deep metric learning expressivity

Domaines

Intelligence artificielle [cs.AI] Informatique et langage [cs.CL] Traitement du signal et de l'image [eess.SP]

Fichier principal

SLSP_2020_published_version.pdf (683.84 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Denis Jouvet : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-02573885

Soumis le : jeudi 22 octobre 2020-17:00:59

Dernière modification le : lundi 11 septembre 2023-17:41:19

Dates et versions

hal-02573885 , version 1 (14-05-2020)

hal-02573885 , version 2 (22-10-2020)

Identifiants

HAL Id : hal-02573885 , version 2

Citer

Ajinkya Kulkarni, Vincent Colotte, Denis Jouvet. Deep variational metric learning for transfer of expressivity in multispeaker text to Speech. SLSP 2020 - 8th International Conference on Statistical Language and Speech Processing, Oct 2020, Cardiff / Virtual, United Kingdom. ⟨hal-02573885v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA GRID5000 UNIV-LORRAINE INRIA2 LORIA LORIA-NLPKD SILECS

292 Consultations

681 Téléchargements

Deep variational metric learning for transfer of expressivity in multispeaker text to Speech

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager