Skip to Main content Skip to Navigation
Conference papers

Deep variational metric learning for transfer of expressivity in multispeaker text to Speech

Ajinkya Kulkarni 1 Vincent Colotte 1 Denis Jouvet 1
1 MULTISPEECH - Speech Modeling for Facilitating Oral-Based Communication
Inria Nancy - Grand Est, LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
Abstract : In this paper, we propose to use the deep metric learning based multi-class N-pair loss, for text-to-speech (TTS) synthesis. We use the proposed loss function in a recurrent conditional variational autoencoder (RCVAE) for transferring expressivity in a French multispeaker TTS system. We extracted the speaker embeddings from the x-vector based speaker recognition model trained on speech data from many speakers to represent the speaker identity. We use mean of the latent variables to transfer expressivity for each emotion to generate expressive speech in the desired speaker's voice. In contrast to the commonly used loss functions such as triplet loss or contrastive loss, multi-class N-pair loss considers all the negative examples which make each class of emotion distinguished from one another. Furthermore, the presented approach assists in creating a robust representation of expressivity irrespective of speaker identities. Our proposed approach demonstrates the improved performance for transfer of expressivity in the target speaker's voice in a synthesized speech. To our knowledge, it is for the fi rst time multi-class N-pair loss and x-vector based speaker embeddings are used in a TTS system.
Complete list of metadatas

Cited literature [27 references]  Display  Hide  Download

https://hal.inria.fr/hal-02573885
Contributor : Denis Jouvet <>
Submitted on : Thursday, October 22, 2020 - 5:00:59 PM
Last modification on : Saturday, November 28, 2020 - 10:24:02 AM

File

SLSP_2020_published_version.pd...
Files produced by the author(s)

Identifiers

  • HAL Id : hal-02573885, version 2

Collections

Citation

Ajinkya Kulkarni, Vincent Colotte, Denis Jouvet. Deep variational metric learning for transfer of expressivity in multispeaker text to Speech. SLSP 2020 - 8th International Conference on Statistical Language and Speech Processing, Oct 2020, Cardiff / Virtual, United Kingdom. ⟨hal-02573885v2⟩

Share

Metrics

Record views

21

Files downloads

111