Skip to Main content Skip to Navigation
Conference papers

Deep variational metric learning for transfer of expressivity in multispeaker text to Speech

Ajinkya Kulkarni 1 Vincent Colotte 1 Denis Jouvet 1
1 MULTISPEECH - Speech Modeling for Facilitating Oral-Based Communication
Inria Nancy - Grand Est, LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
Abstract : In this paper, we propose to use the deep metric learning based multi-class N-pair loss, for text-to-speech (TTS) synthesis. We use the proposed loss function in a recurrent conditional variational autoencoder (RCVAE) for transferring expressivity in a French multispeaker TTS system. We extracted the speaker embeddings from the x-vector based speaker recognition model trained on speech data from many speakers to represent the speaker identity. We use mean of the latent variables to transfer expressivity for each emotion to generate expressive speech in the desired speaker's voice. In contrast to the commonly used loss functions such as triplet loss or contrastive loss, multi-class N-pair loss considers all the negative examples which make each class of emotion distinguished from one another. Furthermore, the presented approach assists in creating a robust representation of expressivity irrespective of speaker identities. Our proposed approach demonstrates the improved performance for transfer of expressivity in the target speaker's voice in a synthesized speech. To our knowledge, it is for the fi rst time multi-class N-pair loss and x-vector based speaker embeddings are used in a TTS system.
Complete list of metadata

Cited literature [27 references]  Display  Hide  Download
Contributor : Denis Jouvet Connect in order to contact the contributor
Submitted on : Thursday, October 22, 2020 - 5:00:59 PM
Last modification on : Wednesday, November 3, 2021 - 7:56:51 AM


Files produced by the author(s)


  • HAL Id : hal-02573885, version 2


Ajinkya Kulkarni, Vincent Colotte, Denis Jouvet. Deep variational metric learning for transfer of expressivity in multispeaker text to Speech. SLSP 2020 - 8th International Conference on Statistical Language and Speech Processing, Oct 2020, Cardiff / Virtual, United Kingdom. ⟨hal-02573885v2⟩



Record views


Files downloads