. .. Indicators, 83 6.1.2 Intrinsic dimension and selection subset size

. .. Results, 90 6.3.1 Sensitivity w.r.t. initialization of network parameters

. .. Sensitivity-study, 93 6.4.1 Sensitivity w.r.t. number of selected features k

.. .. , 98 6.4.2.1 Classification-based criterion

. .. Summary-of-contributions,

. .. , Towards more robust and computationally efficient agnostic feature selection

A. Appendix and .. .. Auto-encoders, Cluster analysis, 2017.

, The scikit-feature project

M. Aharon, M. Elad, and A. Bruckstein, K-svd : An algorithm for designing overcomplete dictionaries for sparse representation, IEEE Transactions on signal processing, vol.54, issue.11, pp.4311-4322, 2006.

R. K. Ando and T. Zhang, A framework for learning predictive structures from multiple tasks and unlabeled data, Journal of Machine Learning Research, vol.6, pp.1817-1853, 2005.

F. R. Bach, Consistency of the group lasso and multiple kernel learning, Journal of Machine Learning Research, vol.9, pp.1179-1225, 2008.
URL : https://hal.archives-ouvertes.fr/hal-00164735

M. Balasubramanian and E. L. Schwartz, The isomap algorithm and topological stability, vol.295, p.7, 2002.

P. Baldi, Autoencoders, unsupervised learning, and deep architectures, Proceedings of ICML workshop on unsupervised and transfer learning, pp.37-49, 2012.

P. Baldi and K. Hornik, Neural networks and principal component analysis : Learning from examples without local minima, Neural networks, vol.2, issue.1, pp.53-58, 1989.

A. Banerjee, I. S. Dhillon, J. Ghosh, and S. Sra, Clustering on the unit hypersphere using von mises-fisher distributions, Journal of Machine Learning Research, vol.6, pp.1345-1382, 2005.

K. W. Bauer, S. G. Alsing, and K. A. Greene, Feature screening using signal-to-noise ratios, Neurocomputing, vol.31, pp.29-44, 2000.

Y. Bengio, Deep learning of representations for unsupervised and transfer learning, Proceedings of ICML Workshop on Unsupervised and Transfer Learning, pp.17-36, 2012.

Y. Bengio, A. Courville, and P. Vincent, Representation learning : A review and new perspectives, IEEE transactions on Pattern Analysis and Machine Intelligence, vol.35, issue.8, pp.1798-1828, 2013.

A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth, Occam's razor. Information processing letters, vol.24, pp.377-380, 1987.

I. Borg and P. Groenen, Modern multidimensional scaling : Theory and applications, Journal of Educational Measurement, vol.40, issue.3, pp.277-280, 2003.

Y. Boureau and Y. Lecun, Sparse feature learning for deep belief networks, Advances in neural information processing systems, pp.1185-1192, 2008.

J. Bromley, I. Guyon, Y. Lecun, E. Säckinger, and R. Shah, Signature verification using a siamese time delay neural network, Advances in neural information processing systems, pp.737-744, 1994.

A. M. Bronsteina, M. M. Bronstein, and R. Kimmel, Generalized multidimensional scaling : a framework for isometry-invariant partial surface matching, Proceedings of the National Academy of Sciences, vol.103, issue.5, pp.1168-1172, 2006.

D. Cai, C. Zhang, and X. He, Unsupervised feature selection for multi-cluster data, 2010.

F. Camastra and A. Vinciarelli, Estimating the intrinsic dimension of data with a fractal-based method, IEEE Transactions on pattern analysis and machine intelligence, vol.24, issue.10, pp.1404-1407, 2002.

P. Campadelli, E. Casiraghi, C. Ceruti, and A. Rozza, Intrinsic dimension estimation : Relevant techniques and a benchmark framework, 2015.

E. J. Candès, X. Li, Y. Ma, W. , and J. , Robust principal component analysis ?, vol.58, 2011.

X. Chang, F. Nie, Y. Yang, and H. Huang, A convex formulation for semi-supervised multi-label feature selection, Twenty-eighth AAAI conference on artificial intelligence, 2014.

C. H. Chen, Handbook of pattern recognition and computer vision, 2015.

J. Chen, M. Stern, M. J. Wainwright, J. , and M. I. , Kernel feature selection via conditional covariance minimization, Advances in Neural Information Processing Systems, pp.6946-6955, 2017.

E. Chávez, G. Navarro, R. Baeza-yates, and J. L. Marroquín, Searching in metric spaces, ACM computing surveys (CSUR), vol.33, issue.3, pp.273-321, 2001.

C. Cortes and V. Vapnik, Support-vector networks, Machine Learning, vol.20, pp.273-297, 1995.

J. Costa and A. O. Hero, Geodesic entropic graphs for dimension and entropy estimation in manifold learning, IEEE Trans. on Signal Processing, vol.52, issue.8, pp.2210-2221, 2004.

T. M. Cover and J. A. Thomas, Elements of information theory, 2012.

J. A. Crowder and J. N. Carbone, Occam learning through pattern discovery : Computational mechanics in ai systems, Proceedings on the International Conference on Artificial Intelligence (ICAI), 2011.

R. K. De, N. R. Pal, and S. K. Pal, Feature analysis : neural network and fuzzy set theoretic approaches, Pattern Recognition, vol.30, issue.10, pp.1579-1590, 1997.

G. Deepmind, , 2019.

L. Devroye, Sample-based non-uniform random variate generation, 1986.

F. Doshi-velez and B. Kim, Towards a rigorous science of interpretable machine learning, 2017.

R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification, vol.2, 2000.

R. Díaz-uriarte and S. A. Andres, Gene selection and classification of microarray data using random forest, BMC bioinformatics, vol.7, issue.1, 2006.

C. Eckart and G. Young, The approximation of one matrix by another of lower rank, Psychometrika, vol.1, issue.3, pp.211-218, 1936.

E. Facco, M. Errico, A. Rodriguez, and A. Laio, Estimating the intrinsic dimension of datasets by a minimal neighborhood information, Nature, vol.7, issue.1, 2017.

J. Feng and N. Simon, Sparse-input neural networks for high-dimensional nonparametric regression and classication, 2017.

R. A. Fisher, The use of multiple measurements in taxonomic problems, Annals of eugenics, vol.7, issue.2, pp.179-188, 1936.

. Fta, Fair, transparent and accountable learning, 2018.

A. J. Gates, I. B. Wood, W. P. Hetrick, and Y. Y. Ahn, On comparing clusterings : an element-centric framework unifies overlaps and hierarchy, 2018.

R. Gaudel and M. Sebag, Feature selection as a one-player game, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00484049

Y. A. Ghassabeh, F. Rudzicz, and H. A. Moghaddam, Fast incremental lda feature extraction, Pattern Recognition, vol.48, issue.6, pp.1999-2012, 2015.

X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, International conference on Artificial Intelligence and Statistics, pp.249-256, 2010.

A. S. Goldberger, Classical linear regression, Econometric Theory, pp.156-212, 1964.

G. H. Golu and C. Reinsch, Singular value decomposition and least squares solutions, Linear Algebra, pp.134-151, 1971.

A. N. Gorban, A. Golubkov, B. Grechuk, E. M. Mirkes, and I. Y. Tyukin, Correction of ai systems by linear discriminants : Probabilistic foundations, Information Sciences, vol.466, pp.303-322, 2018.

O. Goudet, D. Kalainathan, P. Caillou, I. Guyon, D. Lopez-paz et al., Learning functional causal models with generative neural networks, Explainable and Interpretable Models in Computer Vision and Machine Learning, pp.39-80, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01649153

Q. Gu, Z. Li, and J. Han, Generalized fisher score for feature selection, 2012.

I. Guyon and C. Aliferis, Causal feature selection. In Computational methods of feature selection, 2007.

I. Guyon and A. Elisseeff, An introduction to variable and feature selection, Journal of Machine Learning Research, vol.3, pp.1157-1182, 2003.

I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, Gene selection for cancer classification using support vector machines, Machine learning, vol.46, issue.1-3, pp.389-422, 2002.

J. A. Hartigann and M. A. Wong, Algorithm as 136 : A k-means clustering algorithm, Journal of the Royal Statistical Society. Series C (Applied Statistics), vol.28, issue.1, pp.100-108, 1979.

S. Haykin, Neural networks : a comprehensive foundation, 1994.

X. He, D. Cai, and P. Niyogi, Laplacian score for feature selection, Advances in Neural Information Processing Systems, 2005.

G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Andsutskever, and R. R. Salakhutdinov, Improving neural networks by preventing co-adaptation of feature detectors, 2012.

S. Hochreiter, The vanishing gradient problem during learning recurrent neural nets and problem solutions, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol.6, issue.02, pp.107-116, 1998.

T. Hofmann, B. Scholkopf, and A. J. Smola, Kernel Methods in Machine Learning, 2008.

J. Huang, Y. Cai, Y. Xu, and X. , A hybrid genetic algorithm for feature selection wrapper based on mutual information, Pattern Recognition Letters, vol.28, issue.13, pp.1825-1844, 2007.

S. Ivanoff, F. Picard, and V. Rivoirard, Adaptive lasso and group-lasso for functional poisson regression, The Journal of Machine Learning Research, vol.17, issue.1, pp.1903-1948, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01097914

R. Jenatton, J. Audibert, and F. Bach, Structured variable selection with sparsity-inducing norms, JMLR, vol.12, pp.2777-2824, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00377732

Y. Y. Jianya, An efficient implementation of shortest path algorithm based on dijkstra algorithm, Journal of Wuhan Technical University of Surveying and Mapping (Wtusm), issue.004, p.3, 1999.

M. Kantardzic, Data Reduction, 2003.

M. J. Kearns and U. V. Vazirani, An introduction to computational learning theory, 1994.

H. Kim and A. Mnih, Disentangling by factorising, 2018.

D. P. Kingma and M. Welling, Auto-encoding variational bayes, 2013.

K. Kira and L. A. Rendell, The feature selection problem : Traditional methods and a new algorithm, vol.2, pp.129-134, 1992.

R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, IJCAI, vol.14, pp.1137-1145, 1995.

R. Kohavi and G. H. John, Wrappers for feature subset selection, Artificial intelligence, vol.97, issue.1-2, pp.273-324, 1997.

I. Kononenko, Estimating attributes : analysis and extensions of relief, European conference on machine learning, pp.171-182, 1994.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, Advances in neural information processing systems, pp.1097-1105, 2012.

T. O. Kvalseth, Entropy and correlation : Some comments, IEEE Transactions on Systems, Man, and Cybernetics, vol.17, issue.3, pp.517-519, 1987.

B. Kégl, Intrinsic dimension estimation using packing numbers, Advances in neural information processing systems, pp.697-704, 2003.

Y. Lecun, The next frontier in AI : Unsupervised learning, 2016.

P. Leray and P. Gallinari, Feature selection with neural networks, Behaviormetrika, vol.26, issue.1, pp.145-166, 1999.

E. Levina and P. J. Bickel, Maximum likelihood estimation of intrinsic dimension, Advances in neural information processing systems, pp.777-784, 2005.

B. Li, C. Wang, and D. S. Huang, Supervised feature extraction based on orthogonal discriminant projection, Neurocomputing, vol.73, issue.1-3, pp.191-196, 2009.

J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino et al., Feature selection : A data perspective, ACM Computing Surveys (CSUR), issue.6, p.50, 2018.

J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino et al., Feature selection : A data perspective, ACM Computing Surveys (CSUR), vol.50, issue.6, p.94, 2018.

J. Li and J. Liu, Challenges of feature selection for big data analytics, IEEE Intelligent Systems, vol.32, pp.9-15, 2017.

J. Li, J. Tang, and H. Liu, Reconstruction-based unsupervised feature selection : an embedded approach, Proceedings of the 26th International Joint Conference on Artificial Intelligence, 2017.

Y. Li, C. Y. Chen, and W. W. Wasserman, Deep feature selection : theory and application to identify enhancers and promoters, Journal of Computational Biology, vol.23, issue.5, pp.322-336, 2016.

Z. Li, J. Liu, Y. Yang, X. Zhou, and H. Lu, Clustering-guided sparse structural learning for unsupervised feature selection, IEEE Transactions on Knowledge and Data Engineering, vol.26, issue.9, pp.2138-2150, 2014.

Z. Li, Y. Yang, Y. Liu, X. Zhou, and H. Lu, Unsupervised feature selection using non-negative spectral analysis, 2012.

B. Liu, X. Yu, P. Zhang, A. Yu, Q. Fu et al., Supervised deep feature extraction for hyperspectral image classification, IEEE Transactions on Geoscience and Remote Sensing, vol.56, issue.4, pp.1909-1921, 2018.

J. Liu and J. Ye, Moreau-yosida regularization for grouped tree structure learning, NIPS, pp.1459-1467, 2010.

Z. Liu, Z. Guo, and M. Tan, Constructing tumor progression pathways and biomarker discovery with fuzzy kernel kmeans and dna methylation data, Cancer informatics, p.6, 2008.

Z. Ma, F. Nie, Y. Yang, J. R. Uijlings, and N. Sebe, Web image annotation via subspace-sparsity collaborated feature selection, IEEE Trans. Multimedia, vol.14, issue.4, pp.1021-1030, 2012.

B. Mandelbrot, How long is the coast of britain ? statistical self-similarity and fractional dimension, Science, vol.156, issue.3775, pp.636-638, 1967.

B. B. Mandelbrot, The fractal geometry of nature, WH freeman, vol.173, p.51, 1983.

A. M. Martinez and M. Zhu, Where are linear feature extraction methods applicable ?, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.27, pp.1934-1944, 2005.

R. H. Mccuen, Z. Knight, and A. G. Cutter, Evaluation of the nash-sutcliffe efficiency index, Journal of Hydrologic Engineering, vol.11, issue.6, pp.597-602, 2006.

L. Meier, S. V. Geer, and P. Bühlmann, The group lasso for logistic regression, Journal of the Royal Statistical Society : Series B (Statistical Methodology), vol.70, issue.1, pp.53-71, 2008.

M. Meil?, Comparing clusterings by the variation of information, Learning theory and kernel machines, pp.173-187, 2003.

S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K. R. Mullers, Fisher discriminant analysis with kernels, Neural networks for signal processing IX : Proceedings of the 1999 IEEE signal processing society workshop, pp.41-48, 1999.

M. Muja and D. G. Lowe, Scalable nearest neighbor algorithms for high dimensional data, IEEE transactions on pattern analysis and machine intelligence, vol.36, pp.2227-2240, 2014.

A. Ng, On spectral clustering : Analysis and an algorithm, Advances in Neural Information Processing Systems, 2001.

F. Nie, W. Zhu, L. , and X. , Unsupervised feature selection with structured graph optimization, AAAI, pp.1302-1308, 2016.

B. Olshausen and D. Field, Sparse coding with an overcomplete basis set : a strategy employed by v1 ? Vision Research, vol.37, pp.3311-3325, 1997.

C. O'neil, Weapons of math destruction : How big data increases inequality and threatens democracy, 2016.

R. Pascanu, T. Mikolov, and Y. Bengio, On the difficulty of training recurrent neural networks, International conference on machine learning, pp.1310-1318, 2013.

J. Pearl, Causal inference in statistics : An overview, Statistics surveys, vol.3, pp.96-146, 2009.

K. Pearson, On lines and planes of closest fit to systems of points in space, Philosophical Magazine, vol.2, issue.11, pp.559-572, 1901.

V. Pestov, On the geometry of similarity search : dimensionality curse and concentration of measure, 1999.

V. Pestov, Intrinsic dimension of a dataset : what properties does one expect ?, International Joint Conference on Neural Networks, pp.2959-2964, 2007.

J. Peters, D. Janzing, and B. Schölkopf, Elements of causal inference : foundations and learning algorithms, 2017.

K. W. Pettis, T. A. Bailey, A. K. Jain, and R. C. Dubes, An intrinsic dimensionality estimator from near-neighbor information, IEEE Trans. on PAMI, vol.1, pp.25-37, 1979.

C. Poultney, S. Chopra, and Y. Lecun, Efficient learning of sparse representations with an energybased model, Advances in neural information processing systems, pp.1137-1144, 2007.

Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li et al., Variational autoencoder for deep learning of images, labels and captions, Advances in neural information processing systems, pp.2352-2360, 2016.

M. Qian and C. Zhai, Robust unsupervised feature selection, IJCAI, pp.1621-1627, 2013.

S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, Contractive auto-encoders : Explicit invariance during feature extraction, Proceedings of the 28th International Conference on International Conference on Machine Learning, pp.833-840, 2011.

S. T. Roweis and L. K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science, New Series, vol.290, pp.2323-2326, 2000.

D. Roy, K. S. Murty, and C. K. Mohan, Feature selection using deep neural networks, International Joint Conference on Neural Networks (IJCNN), 2015.

S. Ruder, An overview of gradient descent optimization algorithms, 2016.

S. R. Safavian and D. Landgrebe, A survey of decision tree classifier methodology, IEEE transactions on systems, man, and cybernetics, vol.21, issue.3, pp.660-674, 1991.

L. K. Saul and S. T. Roweis, Think globally, fit locally : unsupervised learning of low dimensional manifolds, Journal of machine learning research, vol.4, pp.119-155, 2003.

B. Scholkopf and A. J. Smola, Learning with kernels : support vector machines, regularization, optimization, and beyond, 2001.

R. Setiono and H. Liu, Neural-network feature selector, IEEE transactions on neural networks, vol.8, issue.3, pp.654-662, 1997.

J. Shi and J. Malik, Normalized cuts and image segmentation, 1997.

L. Shi, L. Du, and Y. D. Shen, Robust spectral learning for unsupervised feature selection, Data Mining (ICDM), pp.977-982, 2014.

N. Simon, J. Friedman, T. Hastie, and R. Tibshirani, A sparse-group lasso, Journal of Computational and Graphical Statistics, vol.22, issue.2, pp.231-245, 2013.

E. H. Simpson, The interpretation of interaction in contingency tables, Journal of the Royal Statistical Society, vol.13, pp.238-241, 1951.

J. L. Skeem, J. Lowenkamp, and C. T. , Risk, race, and recidivism : predictive bias and disparate impact, Criminology, vol.54, issue.4, pp.680-712, 2016.

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, Dropout : a simple way to prevent neural networks from overfitting, The Journal of Machine Learning Research, vol.15, issue.1, pp.1929-1958, 2014.

J. Steppe and K. W. Bauer, Improved feature screening in feedforward neural networks, Neurocomputing, vol.13, pp.47-58, 1996.

A. Strehl and J. Ghosh, Cluster ensembles-a knowledge reuse framework for combining multiple partitions, Journal of machine learning research, vol.3, pp.583-617, 2002.

J. B. Tenenbaum, V. D. Silva, and J. C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science, vol.290, issue.5500, pp.2319-2323, 2000.

R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society Series B (Methodological), pp.267-288, 1996.

W. S. Torgerson, Theory and methods of scaling, 1958.

G. V. Trunk, Stastical estimation of the intrinsic dimensionality of a noisy signal collection, IEEE Transactions on Computers, vol.100, issue.2, pp.165-171, 1976.

L. J. Van-der-maaten, E. O. Postma, and H. J. Van-den-herik, Dimensionality reduction : A comparative review, 2008.

D. Varga, A. Csiszárik, and Z. Zombori, Gradient regularization improves accuracy of discriminative models, 2017.

A. Verikas and M. Bacauskiene, Feature selection with neural networks, Pattern Recognition Letters, vol.23, issue.11, pp.1323-1335, 2002.

P. Verveer and R. Duin, An evaluation of intrinsic dimensionality estimators, IEEE Trans. on PAMI, vol.17, issue.1, pp.81-86, 1995.

P. Vincent, H. Larochelle, Y. Bengio, and P. A. Manzagol, Extracting and composing robust features with denoising autoencoders, Proceedings of the 25th international conference on Machine learning, pp.1096-1103, 2008.

P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. A. Manzagol, Stacked denoising autoencoders : Learning useful representations in a deep network with a local denoising criterion, Journal of machine learning research, vol.11, pp.3371-3408, 2010.

N. X. Vinh, J. Epps, and J. Bailey, Information theoretic measures for clusterings comparison, Proceedings of the 26th Annual International Conference on Machine Learning, 2009.

N. X. Vinh, J. Epps, and J. Bailey, Information theoretic measures for clusterings comparison : Variants, properties, normalization and correction for chance, 2010.

V. Luxburg and U. , A tutorial on spectral clustering, Statistics and computing, vol.17, issue.4, pp.395-416, 2007.

D. Wang, F. Nie, and H. Huang, Unsupervised feature selection via unified trace ratio formulation and k-means clustering (track), Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp.306-321, 2014.

B. Xu, N. Wang, T. Chen, L. , and M. , Empirical evaluation of rectified activations in convolutional network, 2015.

Y. Y. Yao, Information-theoretic measures for knowledge discovery and data mining, Entropy measures, maximum entropy principle and emerging applications, pp.115-136, 2003.

M. Ye and Y. Sun, Variable selection via penalized neural network : a drop-out-one loss approach, International Conference on Machine Learning, pp.5616-5625, 2018.

S. Yu and J. Shi, Multiclass spectral clustering, 2003.

M. Yuan and Y. Lin, Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society : Series B (Statistical Methodology), vol.68, issue.1, pp.49-67, 2007.

M. L. Zhang and Z. H. Zhou, Ml-knn : A lazy learning approach to multi-label learning, Pattern recognition, vol.40, issue.7, pp.2038-2048, 2007.

M. L. Zhang and Z. H. Zhou, A review on multi-label learning algorithms, IEEE transactions on knowledge and data engineering, vol.26, issue.8, pp.1819-1837, 2013.

H. Zhao, S. Sun, Z. Jing, Y. , and J. , Local structure based supervised feature extraction, Pattern Recognition, vol.39, issue.8, pp.1546-1550, 2006.

Z. Zhao and H. Liu, Spectral feature selection for supervised and unsupervised learning, 2007.

Z. Zhao and H. Liu, Multi-source feature selection via geometry-dependent covariance analysis, New Challenges for Feature Selection in Data Mining and Knowledge Discovery, pp.36-47, 2008.

Z. Zhao, L. Wang, H. Liu, Y. , and J. , On similarity preserving feature selection, IEEE Transactions on Knowledge and Data Engineering, vol.25, issue.3, pp.619-632, 2013.

J. M. Zurada, A. Malinowski, and S. Usui, Pertubation method for deleting redundant inputs of perceptron networks, Neurocomputing, vol.14, pp.177-193, 1997.