An Abstract Interpretation-Based Data Leakage Static Analysis

Filip Drobnjaković; Pavle Subotić; Caterina Urban

Rapport (Rapport Technique) Année : 2022

An Abstract Interpretation-Based Data Leakage Static Analysis

(1) , (1) , (2, 3)

1
2
3

Filip Drobnjaković

Fonction : Auteur

Microsoft Development Center Serbia

Pavle Subotić

Fonction : Auteur

Microsoft Development Center Serbia

Caterina Urban

Fonction : Auteur
PersonId : 1061085
IdHAL : caterina

Département d'informatique - ENS Paris

Analyse Statique par Interprétation Abstraite

Résumé

Data leakage is a well-known problem in machine learning which occurs when the training and testing datasets are not independent. This phenomenon leads to overly optimistic accuracy estimates at training time, followed by a significant drop in performance when models are deployed in the real world. This can be dangerous, notably when models are used for risk prediction in high-stakes applications. In this paper, we propose an abstract interpretation-based static analysis to prove the absence of data leakage. We implemented it in the NBLyzer framework and we demonstrate its performance and precision on 2111 Jupyter notebooks from the Kaggle competition platform.

Domaines

Langage de programmation [cs.PL]

Fichier principal

main (1).pdf (838.21 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Caterina Urban : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03926245

Soumis le : mardi 23 avril 2024-15:50:03

Dernière modification le : jeudi 25 avril 2024-03:35:49

Dates et versions

hal-03926245 , version 1 (09-01-2023)

hal-03926245 , version 2 (23-04-2024)

Licence

Paternité

Identifiants

HAL Id : hal-03926245 , version 2

Citer

Filip Drobnjaković, Pavle Subotić, Caterina Urban. An Abstract Interpretation-Based Data Leakage Static Analysis. Microsoft Research; Inria Paris; École Normale Supérieure. 2022. ⟨hal-03926245v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-PARIS CNRS INRIA INRIA2 LARA PSL

32 Consultations

32 Téléchargements

An Abstract Interpretation-Based Data Leakage Static Analysis

Résumé

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Partager