Performance and explainability of feature selection-boosted tree-based classifiers for COVID-19 detection

Jesús Rufino; Juan Marcos Ramírez; Jose Aguilar; Carlos Baquero; Jaya Champati; Davide Frey; Rosa Elvira Lillo; Antonio Fernández-Anta

doi:10.1016/j.heliyon.2023.e23219

Article Dans Une Revue Heliyon Année : 2024

Performance and explainability of feature selection-boosted tree-based classifiers for COVID-19 detection

(1) , (1) , (1) , (2) , (1) , (3) , (4) , (4)

1
2
3
4

Jesús Rufino

Fonction : Auteur

Institute IMDEA Networks [Madrid]

Juan Marcos Ramírez

Fonction : Auteur
PersonId : 1338820
ORCID : 0000-0003-0000-1073

Institute IMDEA Networks [Madrid]

Jose Aguilar

Fonction : Auteur

Institute IMDEA Networks [Madrid]

Carlos Baquero

Fonction : Auteur

Universidade do Minho = University of Minho [Braga]

Jaya Champati

Fonction : Auteur
PersonId : 1338821
ORCID : 0000-0002-5127-8497

Institute IMDEA Networks [Madrid]

Davide Frey

Fonction : Auteur
PersonId : 842
IdHAL : dfrey
ORCID : 0000-0002-6730-5744
IdRef : 195962060

the World Is Distributed Exploring the tension between scale and coordination

Rosa Elvira Lillo

Fonction : Auteur
PersonId : 1338822
ORCID : 0000-0003-0802-4691

Universidad Carlos III de Madrid [Madrid]

Antonio Fernández-Anta

Fonction : Auteur
PersonId : 1186791
ORCID : 0000-0001-6501-2377

Universidad Carlos III de Madrid [Madrid]

Résumé

In this paper, we evaluate the performance and analyze the explainability of machine learning models boosted by feature selection in predicting COVID-19-positive cases from self-reported information. In essence, this work describes a methodology to identify COVID-19 infections that considers the large amount of information collected by the University of Maryland Global COVID-19 Trends and Impact Survey (UMD-CTIS). More precisely, this methodology performs a feature selection stage based on the recursive feature elimination (RFE) method to reduce the number of input variables without compromising detection accuracy. A tree-based supervised machine learning model is then optimized with the selected features to detect COVID-19active cases. In contrast to previous approaches that use a limited set of selected symptoms, the proposed approach builds the detection engine considering a broad range of features including self-reported symptoms, local community information, vaccination acceptance, and isolation measures, among others. To implement the methodology, three different supervised classifiers were used: random forests (RF), light gradient boosting (LGB), and extreme gradient boosting (XGB). Based on data collected from the UMD-CTIS, we evaluated the detection performance of the methodology for four countries (Brazil, Canada, Japan, and South Africa) and two periods (2020 and 2021). The proposed approach was assessed in terms of various quality metrics: F1score, sensitivity, specificity, precision, receiver operating characteristic (ROC), and area under the ROC curve (AUC). This work also shows the normalized daily incidence curves obtained by the proposed approach for the four countries. Finally, we perform an explainability analysis using Shapley values and feature importance to determine the relevance of each feature and the corresponding contribution for each country and each country/year.

Mots clés

COVID-19 Detection Explainability Analysis Gradient Boosting Classifiers Random Forest Recursive Feature Elimination Shapley Values

Domaines

Informatique [cs]

Fichier principal

main.pdf (11.26 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Davide Frey : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-04406767

Soumis le : vendredi 19 janvier 2024-21:33:59

Dernière modification le : jeudi 1 février 2024-14:26:32

Dates et versions

hal-04406767 , version 1 (19-01-2024)

Licence

Paternité

Identifiants

HAL Id : hal-04406767 , version 1
DOI : 10.1016/j.heliyon.2023.e23219
PUBMED : 38170121

Citer

Jesús Rufino, Juan Marcos Ramírez, Jose Aguilar, Carlos Baquero, Jaya Champati, et al.. Performance and explainability of feature selection-boosted tree-based classifiers for COVID-19 detection. Heliyon, 2024, 10 (1), pp.e23219. ⟨10.1016/j.heliyon.2023.e23219⟩. ⟨hal-04406767⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 CNRS INRIA INSA-RENNES IRISA CENTRALESUPELEC INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES UR1-MATH-NUM

21 Consultations

7 Téléchargements

Performance and explainability of feature selection-boosted tree-based classifiers for COVID-19 detection

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Altmetric

Partager