Efficiently Identifying Disguised Missing Values in Heterogeneous, Text-Rich Data - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Chapitre D'ouvrage Année : 2022

Efficiently Identifying Disguised Missing Values in Heterogeneous, Text-Rich Data

Théo Bouganim
Ioana Manolescu
Helena Galhardas
  • Fonction : Auteur
  • PersonId : 1109600

Résumé

Digital data is produced in many data models, ranging from highly structured (typically relational) to semi-structured models (XML, JSON) to various graph formats (RDF, property graphs) or text. Most real-world datasets contain a certain amount of null values, denoting missing, unknown, or inapplicable information. While some data models allow representing nulls by special tokens, so-called disguised missing values (DMVs, in short) are also frequently encountered: these are values that are not syntactically speaking nulls, but which do, nevertheless, denote the absence, unavailability, or inapplicability of the information. In this work, we tackle the detection of a particular kind of DMV: texts freely entered by human users. This problem is not tackled by DMV detection methods focused on numeric or categoric data; further, it also escapes DMV detection methods based on value frequency, since such free texts are often different from each other, thus most DMVs are unique. We encountered this problem within the ConnectionLens [6,7,8,12] project where heterogeneous data is integrated into large graphs. We present two DMV detection methods for our specific problem: (i) leveraging Information Extraction, already applied in ConnectionLens graphs; and (ii) through text embeddings and classification. We detail their performanceprecision trade-offs on real-world datasets.
Fichier principal
Vignette du fichier
paper_008.pdf (1.56 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03817900 , version 1 (17-10-2022)

Identifiants

Citer

Théo Bouganim, Ioana Manolescu, Helena Galhardas. Efficiently Identifying Disguised Missing Values in Heterogeneous, Text-Rich Data. Transactions on Large-Scale Data- and Knowledge-Centered Systems LI, 13410, Springer Berlin Heidelberg, pp.97-118, 2022, Lecture Notes in Computer Science, ⟨10.1007/978-3-662-66111-6_4⟩. ⟨hal-03817900⟩
44 Consultations
56 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More