Analysis of expressivity transfer in non-autoregressive end-to-end multispeaker TTS systems - Inria - Institut national de recherche en sciences et technologies du numérique Access content directly
Conference Papers Year : 2022

Analysis of expressivity transfer in non-autoregressive end-to-end multispeaker TTS systems

Abstract

The main objective of this work is to study the expressivity transfer in a speaker's voice for which no expressive speech data is available in non-autoregressive end-to-end TTS systems. We investigated the expressivity transfer capability of probability density estimation based on deep generative models, namely Generative Flow (Glow) and diffusion probabilistic models (DPM). The usage of deep generative models provides better log likelihood estimates and tractability of the system, subsequently providing high-quality speech synthesis with faster inference speed. Furthermore, we propose the usage of various expressivity encoders, which assist in expressivity transfer in the text-to-speech (TTS) system. More precisely, we used self-attention statistical pooling and multi-scale expressivity encoder architectures for creating a meaningful representation of expressivity. In addition to traditional subjective metrics used for speech synthesis evaluation, we incorporated cosine-similarity to measure the strength of attributes associated with speaker and expressivity. The performance of a non-autoregressive TTS system with a multi-scale expressivity encoder showed better expressivity transfer on Glow and DPM-based decoders. Thus, illustrating the ability of multi-scale architecture to apprehend the underlying attributes of expressivity from multiple acoustic features.
Fichier principal
Vignette du fichier
Interspeech_2022_expressivity_transfert.pdf (238.19 Ko) Télécharger le fichier
Origin : Files produced by the author(s)

Dates and versions

hal-03832870 , version 1 (28-10-2022)

Identifiers

  • HAL Id : hal-03832870 , version 1

Cite

Ajinkya Kulkarni, Vincent Colotte, Denis Jouvet. Analysis of expressivity transfer in non-autoregressive end-to-end multispeaker TTS systems. INTERSPEECH 2022, Sep 2022, Incheon, South Korea. ⟨hal-03832870⟩
87 View
136 Download

Share

Gmail Facebook X LinkedIn More