Skip to Main content Skip to Navigation
Conference papers

Transfer learning of the expressivity using flow metric learning in multispeaker text-to-speech synthesis

Ajinkya Kulkarni 1 Vincent Colotte 1 Denis Jouvet 1
1 MULTISPEECH - Speech Modeling for Facilitating Oral-Based Communication
Inria Nancy - Grand Est, LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
Abstract : In this paper, we present a novel deep metric learning architecture along with variational inference incorporated in a paramet-ric multispeaker expressive text-to-speech (TTS) system. We proposed inverse autoregressive flow (IAF) as a way to perform the variational inference, thus providing flexible approximate posterior distribution. The proposed approach condition the text-to-speech system on speaker embeddings so that latent space represents the emotion as semantic characteristics. For representing the speaker, we extracted speaker em-beddings from the x-vector based speaker recognition model trained on speech data from many speakers. To predict the vocoder features, we used the acoustic model conditioned on the textual features as well as on the speaker embedding. We transferred the expressivity by using the mean of the latent variables for each emotion to generate expressive speech in different speaker's voices for which no expressive speech data is available. We compared the results obtained using flow-based variational inference with variational autoencoder as a base-line model. The performance measured by mean opinion score (MOS), speaker MOS, and expressive MOS shows that N-pair loss based deep metric learning along with IAF model improves the transfer of expressivity in the desired speaker's voice in synthesized speech.
Document type :
Conference papers
Complete list of metadata

Cited literature [32 references]  Display  Hide  Download
Contributor : Denis Jouvet Connect in order to contact the contributor
Submitted on : Thursday, October 22, 2020 - 4:37:45 PM
Last modification on : Wednesday, November 3, 2021 - 7:56:51 AM


Files produced by the author(s)


  • HAL Id : hal-02572106, version 3


Ajinkya Kulkarni, Vincent Colotte, Denis Jouvet. Transfer learning of the expressivity using flow metric learning in multispeaker text-to-speech synthesis. INTERSPEECH 2020, Oct 2020, Shanghai / Virtual, China. ⟨hal-02572106v3⟩



Record views


Files downloads