Analysis of expressivity transfer in non-autoregressive end-to-end multispeaker TTS systems - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2022

Analysis of expressivity transfer in non-autoregressive end-to-end multispeaker TTS systems

Résumé

The main objective of this work is to study the expressivity transfer in a speaker's voice for which no expressive speech data is available in non-autoregressive end-to-end TTS systems. We investigated the expressivity transfer capability of probability density estimation based on deep generative models, namely Generative Flow (Glow) and diffusion probabilistic models (DPM). The usage of deep generative models provides better log likelihood estimates and tractability of the system, subsequently providing high-quality speech synthesis with faster inference speed. Furthermore, we propose the usage of various expressivity encoders, which assist in expressivity transfer in the text-to-speech (TTS) system. More precisely, we used self-attention statistical pooling and multi-scale expressivity encoder architectures for creating a meaningful representation of expressivity. In addition to traditional subjective metrics used for speech synthesis evaluation, we incorporated cosine-similarity to measure the strength of attributes associated with speaker and expressivity. The performance of a non-autoregressive TTS system with a multi-scale expressivity encoder showed better expressivity transfer on Glow and DPM-based decoders. Thus, illustrating the ability of multi-scale architecture to apprehend the underlying attributes of expressivity from multiple acoustic features.
Fichier principal
Vignette du fichier
Interspeech_2022_expressivity_transfert.pdf (238.19 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03832870 , version 1 (28-10-2022)

Identifiants

  • HAL Id : hal-03832870 , version 1

Citer

Ajinkya Kulkarni, Vincent Colotte, Denis Jouvet. Analysis of expressivity transfer in non-autoregressive end-to-end multispeaker TTS systems. INTERSPEECH 2022, Sep 2022, Incheon, South Korea. ⟨hal-03832870⟩
84 Consultations
131 Téléchargements

Partager

Gmail Facebook X LinkedIn More