H. Abelson and G. J. Sussman, Structure and interpretation of computer programs, 1985.

G. Alain and Y. Bengio, What regularized auto-encoders learn from the data-generating distribution, Journal of Machine Learning Research, vol.15, issue.1, pp.3563-3593, 2014.

G. Alain, Y. Bengio, L. Yao, J. Yosinski, E. Thibodeau-laufer et al., GSNs: generative stochastic networks. Information and Inference: A, Journal of the IMA, vol.5, issue.2, pp.210-249, 2016.

. Philip-w-anderson, More is different, Science, vol.177, issue.4047, pp.393-396, 1972.

S. Arora, N. Cohen, and E. Hazan, On the optimization of deep networks: Implicit acceleration by overparameterization, 2018.

. Atilim-gunes-baydin, A. Barak, A. Pearlmutter, J. M. Andreyevich-radul, and . Siskind, Automatic differentiation in machine learning: a survey, Journal of Machine Learning Research, vol.18, issue.153, 2018.

Y. Bengio, Learning deep architectures for AI. Foundations and Trends in Machine Learning, vol.2, pp.1-127, 2009.

Y. Bengio, A. Courville, and P. Vincent, Representation learning: A review and new perspectives, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.35, issue.8, pp.1798-1828, 2013.

Y. Bengio, L. Yao, G. Alain, and P. Vincent, Generalized denoising autoencoders as generative models, Advances in Neural Information Processing Systems, pp.899-907, 2013.

M. Carreira-perpiñán, A review of mean-shift algorithms for clustering, 2015.

M. Carreira-perpiñán, K. I. Christopher, and . Williams, On the number of modes of a gaussian mixture, International Conference on Scale-Space Theories in Computer Vision, pp.625-640, 2003.

R. Chaudhuri and I. Fiete, Associative content-addressable networks with exponentially many robust stable states, 2017.

Y. Cheng, Mean shift, mode seeking, and clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.17, issue.8, pp.790-799, 1995.

D. Comaniciu and P. Meer, Mean shift: A robust approach toward feature space analysis, IEEE Transactions, vol.24, issue.5, pp.603-619, 2002.

S. Elfwing, E. Uchibe, and K. Doya, Sigmoid-weighted linear units for neural network function approximation in reinforcement learning, Neural Networks, 2018.

K. Fukunaga and L. Hostetler, The estimation of the gradient of a density function, with applications in pattern recognition, IEEE Transactions on information theory, vol.21, issue.1, pp.32-40, 1975.

C. Hillar and N. M. Tran, Robust exponential memory in Hopfield networks, 2014.

E. Geoffrey, T. J. Hinton, and . Sejnowski, Learning and relearning in Boltzmann machines, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol.1, pp.282-317, 1986.

J. John and . Hopfield, Neural networks and physical systems with emergent collective computational abilities, Proceedings of the National Academy of Sciences, vol.79, issue.8, pp.2554-2558, 1982.

A. Hyvärinen, Estimation of non-normalized statistical models by score matching, Journal of Machine Learning Research, vol.6, pp.695-709, 2005.

P. Diederik, J. Kingma, and . Ba, Adam: A method for stochastic optimization, 2014.

D. Koller and N. Friedman, Probabilistic graphical models: principles and techniques, 2009.

D. Krotov, J. John, and . Hopfield, Dense associative memory for pattern recognition, Advances in Neural Information Processing Systems, pp.1172-1180, 2016.

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, vol.86, issue.11, pp.2278-2324, 1998.

M. Ledoux, The concentration of measure phenomenon. Number 89, 2001.

D. Mackay, Information theory, inference and learning algorithms, 2003.

S. Warren, W. Mcculloch, and . Pitts, A logical calculus of the ideas immanent in nervous activity, The Bulletin of Mathematical Biophysics, vol.5, issue.4, pp.115-133, 1943.

K. Miyasawa, An empirical Bayes estimator of the mean of a normal population, Bulletin of the International Statistical Institute, vol.38, pp.181-188, 1961.

E. Parzen, On estimation of a probability density function and mode, The Annals of Mathematical Statistics, vol.33, issue.3, pp.1065-1076, 1962.

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang et al., Automatic differentiation in PyTorch, 2017.

P. Ramachandran, B. Zoph, and Q. Le, Swish: a self-gated activation function, vol.7, 2017.

M. Raphan, P. Eero, and . Simoncelli, Least squares estimation without priors or supervision, Neural Computation, vol.23, issue.2, pp.374-420, 2011.

H. Robbins, An empirical Bayes approach to statistics, Proc. Third Berkeley Symp, vol.1, pp.157-163, 1956.

H. Robbins and S. Monro, A stochastic approximation method, The Annals of Mathematical Statistics, pp.400-407, 1951.

S. Saremi, On approximating ?f with neural networks, 2019.

S. Saremi, A. Mehrjou, B. Schölkopf, and A. Hyvärinen, Deep energy estimator networks, 2018.

K. Lawrence, . Saul, and . Sam-t-roweis, Think globally, fit locally: unsupervised learning of low dimensional manifolds, Journal of Machine Learning Research, vol.4, pp.119-155, 2003.

T. Tao, Topics in random matrix theory, Asymptotic statistics, 2000.

N. Godfried-van-kampen, Stochastic processes in physics and chemistry, vol.1, 1992.

V. Vapnik, The nature of statistical learning theory, 1995.

R. Vershynin, High-dimensional probability: An introduction with applications in data science, 2018.