Y. Wang, R. J. Skerry-ryan, D. Stanton, Y. Wu, R. J. Weiss et al., Tacotron: A fully end-to-end text-to-speech synthesis model, ArXiv, 2017.

W. Ping, K. Peng, A. Gibiansky, S. Ö. Arik, A. Kannan et al., Deep voice 3: 2000-speaker neural text-to-speech, ArXiv, 2017.

J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner et al., Char2wav: End-to-end speech synthesis, 2017.

Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani, Voiceloop: Voice fitting and synthesis via a phonological loop, ICLR, 2017.

D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling, Semi-supervised learning with deep generative models, NIPS, 2014.

M. El-kaddoury, A. Mahmoudi, and M. M. Himmi, Deep generative models for image generation: A practical comparison between variational autoencoders and generative adversarial networks, Mobile, Secure, and Programmable Networking, pp.1-8, 2019.

A. Van-den-oord, O. Vinyals, and K. Kavukcuoglu, Neural discrete representation learning, NIPS, 2017.

Y. Bengio, A. C. Courville, and P. Vincent, Representation learning: A review and new perspectives, IEEE Transactions on Pattern Analysis and Machine Intelligence, pp.1798-1828, 2013.

D. J. Rezende and S. Mohamed, Variational inference with normalizing flows, pp.1730-1538, 2015.

D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever et al., Improved variational inference with inverse autoregressive flow, Advances in Neural Information Processing Systems, pp.4743-4751, 2016.

A. Van-den-oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals et al., Parallel wavenet: Fast highfidelity speech synthesis, pp.3915-3923, 2018.

P. Esling, N. Masuda, A. Bardet, R. Despres, and A. Chemla-romeu-santos, Universal audio synthesizer control with normalizing flows, ArXiv, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02471340

W. Hsu, Y. L. Zhang, R. J. Weiss, H. Zen, Y. Wu et al., Hierarchical generative modeling for controllable speech synthesis, 2019.

Y. Wang, D. Stanton, Y. Zhang, R. J. Skerry-ryan, E. Battenberg et al., Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis, 2018.

R. J. Skerry-ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton et al., Towards end-to-end prosody transfer for expressive speech synthesis with tacotron, ArXiv, pp.4700-4709, 2018.

D. Snyder, D. Garcia-romero, G. Sell, D. Povey, and S. Khudanpur, X-vectors: Robust dnn embeddings for speaker recognition, ICASSP, pp.5329-5333, 2018.

Y. Zhang, S. Pan, L. He, and Z. Ling, Learning latent representations for style control and transfer in end-to-end speech synthesis, ICASSP, pp.6945-6949, 2019.

K. Akuzawa, Y. Iwasawa, and Y. Matsuo, Expressive speech synthesis via modeling expressions with variational autoencoder, Interspeech. ISCA, pp.3067-3071, 2018.

Y. Lee and T. Kim, Robust and fine-grained prosody control of end-to-end speech synthesis, ICASSP, pp.5911-5915, 2019.

S. Dahmani, V. Colotte, V. Girard, and S. Ouni, Conditional variational auto-encoder for text-driven expressive audiovisual speech synthesis, INTERSPEECH, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02175776

M. Kaya and H. Bilge, Deep metric learning: A survey, Symmetry, vol.11, p.1066, 2019.

K. Sohn, Improved deep metric learning with multi-class n-pair loss objective, NIPS, 2016.

X. Lin, Y. Duan, Q. Dong, J. Lu, and J. Zhou, Deep variational metric learning, ECCV, 2018.

Z. Wu, O. Watts, and S. King, Merlin: An open source neural network speech synthesis system, 2016.

A. Kulkarni, V. Colotte, and D. Jouvet, Deep Variational Metric Learning For Transfer Of Expressivity In Multispeaker Text To Speech, 2020.
URL : https://hal.archives-ouvertes.fr/hal-02573885

R. Van-den, L. Berg, J. M. Hasenclever, M. Tomczak, and . Welling, Sylvester normalizing flows for variational inference, UAI, pp.393-402

J. S. Chung, A. Nagrani, and A. Zisserman, Voxceleb2: Deep speaker recognition, in INTERSPEECH, 2018.

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek et al., The kaldi speech recognition toolkit, 2011.

J. Yamagishi, P. Honnet, P. N. Garner, and A. Lazaridis, The siwis french speech synthesis database, 2017.

A. Stan, O. Watts, Y. Mamiya, M. Giurgiu, R. A. Clark et al., Tundra: a multilingual corpus of found data for tts research created with light supervision, INTER-SPEECH, 2013.

M. Morise, F. Yokomori, and K. Ozawa, World: A vocoder-based high-quality speech synthesis system for real-time applications, IEICE, pp.1877-1884, 2016.

R. C. Streijl, S. Winkler, and D. S. Hands, Mean opinion score (mos) revisited: methods and applications, limitations and alternatives, Multimedia Systems, pp.213-227, 2014.