Skip to Main content Skip to Navigation
Theses

Agnostic Feature Selection

Abstract : With the advent of Big Data, databases whose size far exceed the human scale are becoming increasingly common. The resulting overabundance of monitored variables (friends on a social network, movies watched, nucleotides coding the DNA, monetary transactions...) has motivated the development of Dimensionality Reduction (DR) techniques. A DR algorithm such as Principal Component Analysis (PCA) or an AutoEncoder typically combines the original variables into new features fewer in number, such that most of the information in the dataset is conveyed by the extracted feature set. A particular subcategory of DR is formed by Feature Selection (FS) methods, which directly retain the most important initial variables. How to select the best candidates is a hot topic at the crossroad of statistics and Machine Learning. Feature importance is usually inferred in a supervised context, where variables are ranked according to their usefulness for predicting a specific target feature. The present thesis focuses on the unsupervised context in FS, i.e. the challenging situation where no prediction goal is available to help assess feature relevance. Instead, unsupervised FS algorithms usually build an artificial classification goal and rank features based on their helpfulness for predicting this new target, thus falling back on the supervised context. Additionally, the efficiency of unsupervised FS approaches is typically also assessed in a supervised setting. In this work, we propose an alternate model combining unsupervised FS with data compression. Our Agnostic Feature Selection (AgnoS) algorithm does not rely on creating an artificial target and aims to retain a feature subset sufficient to recover the whole original dataset, rather than a specific variable. As a result, AgnoS does not suffer from the selection bias inherent to clustering-based techniques. The second contribution of this work (Agnostic Feature Selection, G. Doquet and M. Sebag, ECML PKDD 2019) is to establish both the brittleness of the standard supervised evaluation of unsupervised FS, and the stability of the new proposed AgnoS.
Complete list of metadatas

Cited literature [160 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/tel-02436845
Contributor : Guillaume Florent Doquet <>
Submitted on : Monday, January 13, 2020 - 2:13:56 PM
Last modification on : Wednesday, October 14, 2020 - 3:56:44 AM
Long-term archiving on: : Tuesday, April 14, 2020 - 3:11:52 PM

File

these_doquet.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : tel-02436845, version 1

Citation

Guillaume Florent Doquet. Agnostic Feature Selection. Artificial Intelligence [cs.AI]. Université Paris-Saclay/Université Paris-Sud, 2019. English. ⟨tel-02436845⟩

Share

Metrics

Record views

136

Files downloads

267