Analysis of expressivity transfer in non-autoregressive end-to-end multispeaker TTS systems

The main objective of this work is to study the expressivity transfer in a speaker's voice for which no expressive speech data is available in non-autoregressive end-to-end TTS systems. We investigated the expressivity transfer capability of probability density estimation based on deep generative models, namely Generative Flow (Glow) and diffusion probabilistic models (DPM). The usage of deep generative models provides better log likelihood estimates and tractability of the system, subsequently providing high-quality speech synthesis with faster inference speed. Furthermore, we propose the usage of various expressivity encoders, which assist in expressivity transfer in the text-to-speech (TTS) system. More precisely, we used self-attention statistical pooling and multi-scale expressivity encoder architectures for creating a meaningful representation of expressivity. In addition to traditional subjective metrics used for speech synthesis evaluation, we incorporated cosine-similarity to measure the strength of attributes associated with speaker and expressivity. The performance of a non-autoregressive TTS system with a multi-scale expressivity encoder showed better expressivity transfer on Glow and DPM-based decoders. Thus, illustrating the ability of multi-scale architecture to apprehend the underlying attributes of expressivity from multiple acoustic features.

Mots clés

expressivity generative models text-to-speech

Domaines

Intelligence artificielle [cs.AI]

Fichier principal

Interspeech_2022_expressivity_transfert.pdf (238.19 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Vincent Colotte : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03832870

Soumis le : vendredi 28 octobre 2022-08:59:37

Dernière modification le : lundi 11 septembre 2023-17:41:19

Archivage à long terme le : dimanche 29 janvier 2023-18:09:49

Dates et versions

hal-03832870 , version 1 (28-10-2022)

Identifiants

HAL Id : hal-03832870 , version 1

Citer

Ajinkya Kulkarni, Vincent Colotte, Denis Jouvet. Analysis of expressivity transfer in non-autoregressive end-to-end multispeaker TTS systems. INTERSPEECH 2022, Sep 2022, Incheon, South Korea. ⟨hal-03832870⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA GRID5000 UNIV-LORRAINE INRIA2 LORIA LORIA-NLPKD SILECS

84 Consultations

131 Téléchargements