Debiasing Android Malware Datasets: How Can I Trust Your Results If Your Dataset Is Biased? - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Article Dans Une Revue IEEE Transactions on Information Forensics and Security Année : 2022

Debiasing Android Malware Datasets: How Can I Trust Your Results If Your Dataset Is Biased?

Résumé

Android security has received a lot of attention over the last decade, especially malware investigation. Researchers attempt to highlight applications' security-relevant characteristics to better understand malware and effectively distinguish malware from benign applications. The accuracy and the completeness of their proposals are evaluated experimentally on malware and goodware datasets. Thus, the quality of these datasets is of critical importance: if the datasets are outdated or not representative of the studied population, the conclusions may be flawed. We specify different types of experimental scenarios. Some of them require unlabeled but representative datasets of the entire population. Others require datasets labeled with valuable characteristics that may be difficult to compute, such as malware datasets. We discuss the irregularities of datasets used in experiments, questioning the validity of the performances reported in the literature. This article focuses on providing guidelines for designing debiased datasets. First, we propose guidelines for building representative datasets from unlabeled ones. Second, we propose and experiment a debiasing algorithm that, given a biased labeled dataset and a target representative dataset, builds a representative and labeled dataset. Finally, from the previous debiased datasets, we produce datasets for experiments on Android malware detection or classification with machine learning algorithms. Experiments show that debiased datasets perform better when classifying with machine learning algorithms.

Mots clés

Fichier principal
Vignette du fichier
Debiasing_Android_Malware_Datasets_How_Can_I_Trust_Your_Results_If_Your_Dataset_Is_Biased-1.pdf (1.78 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03700082 , version 1 (20-06-2022)

Licence

Paternité

Identifiants

Citer

Tomás Concepción Miranda, Pierre-Francois Gimenez, Jean-François Lalande, Valérie Viet Triem Tong, Pierre Wilke. Debiasing Android Malware Datasets: How Can I Trust Your Results If Your Dataset Is Biased?. IEEE Transactions on Information Forensics and Security, 2022, 17, pp.2182-2197. ⟨10.1109/tifs.2022.3180184⟩. ⟨hal-03700082⟩
95 Consultations
134 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More