Abstract Interpretation-Based Data Leakage Static Analysis

Filip Drobnjaković; Pavle Subotić; Caterina Urban

Rapport (Rapport Technique) Année : 2022

Abstract Interpretation-Based Data Leakage Static Analysis

(1) , (1) , (2, 3)

1
2
3

Filip Drobnjaković

Fonction : Auteur

Microsoft Development Center Serbia

Pavle Subotić

Fonction : Auteur

Microsoft Development Center Serbia

Caterina Urban

Fonction : Auteur
PersonId : 1061085
IdHAL : caterina

Département d'informatique - ENS Paris

Analyse Statique par Interprétation Abstraite

Résumé

Data leakage is a well-known problem in machine learning. Data leakage occurs when information from outside the training dataset is used to create a model. This phenomenon renders a model excessively optimistic or even useless in the real world since the model tends to leverage greatly on the unfairly acquired information. To date, detection of data leakages occurs post-mortem using run-time methods. However, due to the insidious nature of data leakage, it may not be apparent to a data scientist that a data leakage has occurred in the first place. For this reason, it is advantageous to detect data leakages as early as possible in the development life cycle. In this paper, we propose a novel static analysis to detect several instances of data leakages during development time. We define our analysis using the framework of abstract interpretation: we define a concrete semantics that is sound and complete, from which we derive a sound and computable abstract semantics. We implement our static analysis inside the open-source NBLyzer static analysis framework and demonstrate its utility by evaluating its performance and precision on over 2000 Kaggle competition notebooks.

Domaines

Langage de programmation [cs.PL]

Fichier principal

main.pdf (1.16 Mo)

Caterina Urban : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03926245

Soumis le : lundi 9 janvier 2023-15:37:22

Dernière modification le : jeudi 25 avril 2024-03:35:49

Archivage à long terme le : lundi 10 avril 2023-18:10:04

Dates et versions

hal-03926245 , version 1 (09-01-2023)

hal-03926245 , version 2 (23-04-2024)

Identifiants

HAL Id : hal-03926245 , version 1

Citer

Filip Drobnjaković, Pavle Subotić, Caterina Urban. Abstract Interpretation-Based Data Leakage Static Analysis. Microsoft Research; Inria Paris; École Normale Supérieure. 2022. ⟨hal-03926245v1⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

32 Consultations

39 Téléchargements

Abstract Interpretation-Based Data Leakage Static Analysis

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Partager