An Abstract Interpretation-Based Data Leakage Static Analysis - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Rapport (Rapport Technique) Année : 2022

An Abstract Interpretation-Based Data Leakage Static Analysis

Résumé

Data leakage is a well-known problem in machine learning which occurs when the training and testing datasets are not independent. This phenomenon leads to overly optimistic accuracy estimates at training time, followed by a significant drop in performance when models are deployed in the real world. This can be dangerous, notably when models are used for risk prediction in high-stakes applications. In this paper, we propose an abstract interpretation-based static analysis to prove the absence of data leakage. We implemented it in the NBLyzer framework and we demonstrate its performance and precision on 2111 Jupyter notebooks from the Kaggle competition platform.
Fichier principal
Vignette du fichier
main (1).pdf (838.21 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03926245 , version 1 (09-01-2023)
hal-03926245 , version 2 (23-04-2024)

Licence

Paternité

Identifiants

  • HAL Id : hal-03926245 , version 2

Citer

Filip Drobnjaković, Pavle Subotić, Caterina Urban. An Abstract Interpretation-Based Data Leakage Static Analysis. Microsoft Research; Inria Paris; École Normale Supérieure. 2022. ⟨hal-03926245v2⟩
32 Consultations
32 Téléchargements

Partager

Gmail Facebook X LinkedIn More