Analysis of expressivity transfer in non-autoregressive end-to-end multispeaker TTS systems

The main objective of this work is to study the expressivity transfer in a speaker's voice for which no expressive speech data is available in non-autoregressive end-to-end TTS systems. We investigated the expressivity transfer capability of probability density estimation based on deep generative models, namely Generative Flow (Glow) and diffusion probabilistic models (DPM). The usage of deep generative models provides better log likelihood estimates and tractability of the system, subsequently providing high-quality speech synthesis with faster inference speed. Furthermore, we propose the usage of various expressivity encoders, which assist in expressivity transfer in the text-to-speech (TTS) system. More precisely, we used self-attention statistical pooling and multi-scale expressivity encoder architectures for creating a meaningful representation of expressivity. In addition to traditional subjective metrics used for speech synthesis evaluation, we incorporated cosine-similarity to measure the strength of attributes associated with speaker and expressivity. The performance of a non-autoregressive TTS system with a multi-scale expressivity encoder showed better expressivity transfer on Glow and DPM-based decoders. Thus, illustrating the ability of multi-scale architecture to apprehend the underlying attributes of expressivity from multiple acoustic features.

Keywords

expressivity generative models text-to-speech

Domains

Artificial Intelligence [cs.AI]

Fichier principal

Interspeech_2022_expressivity_transfert.pdf (238.19 Ko)

Origin : Files produced by the author(s)

Vincent Colotte : Connect in order to contact the contributor

https://inria.hal.science/hal-03832870

Submitted on : Friday, October 28, 2022-8:59:37 AM

Last modification on : Monday, September 11, 2023-5:41:19 PM

Long-term archiving on: Sunday, January 29, 2023-6:09:49 PM

Dates and versions

hal-03832870 , version 1 (28-10-2022)

Identifiers

HAL Id : hal-03832870 , version 1

Cite

Ajinkya Kulkarni, Vincent Colotte, Denis Jouvet. Analysis of expressivity transfer in non-autoregressive end-to-end multispeaker TTS systems. INTERSPEECH 2022, Sep 2022, Incheon, South Korea. ⟨hal-03832870⟩

Export

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA GRID5000 UNIV-LORRAINE INRIA2 LORIA LORIA-NLPKD SILECS

87 View

136 Download