Beyond Two-sample-tests: Localizing Data Discrepancies in High-dimensional Spaces

Frédéric Cazals 1 Alix Lhéritier 1
1 ABS - Algorithms, Biology, Structure
CRISAM - Inria Sophia Antipolis - Méditerranée
Abstract : Comparing two sets of multivariate samples is a central problem in data analysis. From a statistical standpoint, the simplest way to perform such a comparison is to resort to a non-parametric two-sample test (TST), which checks whether the two sets can be seen as i.i.d. samples of an identical unknown distribution (the null hypothesis). If the null is rejected, one wishes to identify regions accounting for this difference. This paper presents a two-stage method providing feedback on this difference, based upon a combination of statistical learning (regression) and computational topology methods. dConsider two populations, each given as a point cloud in R^d. In the first step, we assign a label to each set and we compute, for each sample point, a discrepancy measure based on comparing an estimate of the conditional probability distribution of the label given a position versus the global unconditional label distribution. In the second step, we study the height function defined at each point by the aforementioned estimated discrepancy. Topological persistence is used to identify persistent local minima of this height function, their basins defining regions of points with high discrepancy and in spatial proximity. Experiments are reported both on synthetic and real data (satellite images and handwritten digit images), ranging in dimension from d = 2 to d = 784, illustrating the ability of our method to localize discrepancies. On a general perspective, the ability to provide feedback downstream TST may prove of ubiquitous interest in exploratory statistics and data science.
Type de document :
Communication dans un congrès
P. Gallinari and J. Kwok and G. Pasi and O. Zaiane. IEEE/ACM International Conference on Data Science and Advanced Analytics, Oct 2015, Paris, France. IEEE/ACM International Conference on Data Science and Advanced Analytics, pp.29, 2015, IEEE/ACM International Conference on Data Science and Advanced Analytics
Liste complète des métadonnées

https://hal.inria.fr/hal-01245408
Contributeur : Frederic Cazals <>
Soumis le : jeudi 17 décembre 2015 - 10:44:07
Dernière modification le : jeudi 11 janvier 2018 - 16:48:48

Identifiants

  • HAL Id : hal-01245408, version 1

Collections

Citation

Frédéric Cazals, Alix Lhéritier. Beyond Two-sample-tests: Localizing Data Discrepancies in High-dimensional Spaces. P. Gallinari and J. Kwok and G. Pasi and O. Zaiane. IEEE/ACM International Conference on Data Science and Advanced Analytics, Oct 2015, Paris, France. IEEE/ACM International Conference on Data Science and Advanced Analytics, pp.29, 2015, IEEE/ACM International Conference on Data Science and Advanced Analytics. 〈hal-01245408〉

Partager

Métriques

Consultations de la notice

82