Y. Wang, R. J. Skerry-ryan, D. Stanton, Y. Wu, R. J. Weiss et al., Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model, Journal: CoRR, arxiv.org, 2017.

W. Ping, K. Peng, A. Gibiansky, O. Sercan, A. Arik et al., Deep Voice 3: 2000-Speaker Neural Text-to-Speech, CoRR, 2017.

J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner et al., Char2Wav: End-to-End Speech Synthesis. ICLR, 2017.

Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani, VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop, 2017.

Y. Zhang, S. Pan, L. He, and Z. Ling, Learning latent representations for style control and transfer in end-to-end speech synthesis, ICASSP, 2018.

K. Akuzawa, I. Yusuke, Y. , and M. , Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder, 2018.

W. N. Hsu, Y. Zhang, R. J. Weiss, H. Zen, Y. Wu et al., Hierarchical Generative Modeling for Controllable Speech Synthesis. ICLR, 2019.

Y. Wang, D. Stanton, Y. Zhang, R. J. Skerry-ryan, E. Battenberg et al., Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis. ICML, 2018.

R. J. Skerry-ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton et al., Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron. ICML, 2018.

Y. Lee and T. Kim, Robust and Fine-grained Prosody Control of Endto-end Speech Synthesis, ICASSP, 2019.

J. Parker, Y. Stylianou, and R. Cipolla, Adaptation of an Expressive Single Speaker Deep Neural Network Speech Synthesis System, ICASSP, 2018.

A. Kulkarni, V. Colotte, and D. Jouvet, Layer adaptation for transfer of expressivity in speech synthesis, Language & Technology Conference (LTC), 2019.
URL : https://hal.archives-ouvertes.fr/hal-02177945

X. Lin, . Duan, . Yueqi, . Dong, . Qiyuan et al., Deep Variational Metric Learning.The European Conference on Computer Vision, 2018.

M. Kaya, B. , and H. ?akir, Deep Metric Learning: A Survey. Symmetry, vol.11, pp.2073-8994, 2019.

K. Sohn, Improved Deep Metric Learning with Multi-class N-pair Loss Objective, 2016.

D. Snyder, D. Garcia-romero, G. Sell, D. Povey, and S. Khudanpur, X-Vectors: Robust DNN Embeddings for Speaker Recognition, ICASSP, 2018.

D. P. Kingma and M. Welling, Auto-Encoding Variational Bayes. CoRR, arxiv.org, abs/1312, vol.6114, 2013.

D. Rezende and . Jimenez, Stochastic Backpropagation and Approximate Inference in Deep Generative Models. ICML, 2014.

S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz et al., Generating Sentences from a Continuous Space. SIGNLL Conference on Computational Natural Language Learning, 2016.

J. S. Chung, A. Nagrani, and A. Zisserman, VoxCeleb2: Deep Speaker Recognition, 2018.

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek et al., Silovský, Georg Stemmer and Karel Veselý. The Kaldi Speech Recognition Toolkit. ASRU conference, 2011.

M. Morise, F. Yokomori, and K. Ozawa, WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications, IEICE Transactions, 2016.

J. Yamagishi, P. Honnet, P. N. Garner, and A. Lazaridis, The SIWIS French Speech Synthesis Database, 2017.

A. Stan, O. Watts, Y. Mamiya, M. Giurgiu, A. J. Robert et al., TUNDRA: a multilingual corpus of found data for TTS research created with light supervision, 2013.

R. Streijl, C. Winkler, S. Hands, and D. S. , Mean Opinion Score (MOS) Revisited: Methods and Applications, Limitations and Alternatives. Multimedia System, vol.22, 2016.

S. Dahmani, V. Colotte, V. Girard, and S. Ouni, Conditional Variational Auto-Encoder for Text-Driven Expressive AudioVisual Speech Synthesis, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02175776

Z. Wu, O. Watts, and S. King, Merlin: An Open Source Neural Network Speech Synthesis System, ISCA Speech Synthesis Workshop (SSW9), 2016.