Efficiently Identifying Disguised Missing Values in Heterogeneous, Text-Rich Data

Théo Bouganim; Ioana Manolescu; Helena Galhardas

doi:10.1007/978-3-662-66111-6_4

Chapitre D'ouvrage Année : 2022

Efficiently Identifying Disguised Missing Values in Heterogeneous, Text-Rich Data

(1) , (1) ,

Théo Bouganim

Fonction : Auteur
PersonId : 748336
IdHAL : theo-bouganim

Rich Data Analytics at Cloud Scale

Ioana Manolescu

Fonction : Auteur
PersonId : 742652
IdHAL : ioana-manolescu
ORCID : 0000-0002-0425-2462

Rich Data Analytics at Cloud Scale

Helena Galhardas

Fonction : Auteur
PersonId : 1109600

Résumé

Digital data is produced in many data models, ranging from highly structured (typically relational) to semi-structured models (XML, JSON) to various graph formats (RDF, property graphs) or text. Most real-world datasets contain a certain amount of null values, denoting missing, unknown, or inapplicable information. While some data models allow representing nulls by special tokens, so-called disguised missing values (DMVs, in short) are also frequently encountered: these are values that are not syntactically speaking nulls, but which do, nevertheless, denote the absence, unavailability, or inapplicability of the information. In this work, we tackle the detection of a particular kind of DMV: texts freely entered by human users. This problem is not tackled by DMV detection methods focused on numeric or categoric data; further, it also escapes DMV detection methods based on value frequency, since such free texts are often different from each other, thus most DMVs are unique. We encountered this problem within the ConnectionLens [6,7,8,12] project where heterogeneous data is integrated into large graphs. We present two DMV detection methods for our specific problem: (i) leveraging Information Extraction, already applied in ConnectionLens graphs; and (ii) through text embeddings and classification. We detail their performanceprecision trade-offs on real-world datasets.

Mots clés

Disguised Missing Values Data cleaning Heterogeneous database

Domaines

Base de données [cs.DB]

Fichier principal

paper_008.pdf (1.56 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Théo Bouganim : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03817900

Soumis le : lundi 17 octobre 2022-15:15:00

Dernière modification le : mercredi 30 août 2023-12:28:41

Archivage à long terme le : mercredi 18 janvier 2023-20:25:35

Dates et versions

hal-03817900 , version 1 (17-10-2022)

Identifiants

HAL Id : hal-03817900 , version 1
DOI : 10.1007/978-3-662-66111-6_4

Citer

Théo Bouganim, Ioana Manolescu, Helena Galhardas. Efficiently Identifying Disguised Missing Values in Heterogeneous, Text-Rich Data. Transactions on Large-Scale Data- and Knowledge-Centered Systems LI, 13410, Springer Berlin Heidelberg, pp.97-118, 2022, Lecture Notes in Computer Science, ⟨10.1007/978-3-662-66111-6_4⟩. ⟨hal-03817900⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

X CNRS INRIA LIX X-LIX X-DEP-INFO INRIA2 IP_PARIS ANR GS-COMPUTER-SCIENCE

44 Consultations

56 Téléchargements