Text Indexing for Long Patterns: Anchors are All you Need - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2023

Text Indexing for Long Patterns: Anchors are All you Need

Résumé

In many real-world database systems, a large fraction of the data is represented by strings: sequences of letters over some alphabet. This is because strings can easily encode data arising from different sources. It is often crucial to represent such string datasets in a compact form but also to simultaneously enable fast pattern matching queries. This is the classic text indexing problem. The four absolute measures anyone should pay attention to when designing or implementing a text index are: (i) index space; (ii) query time; (iii) construction space; and (iv) construction time. Unfortunately, however, most (if not all) widely-used indexes (e.g., suffix tree, suffix array, or their compressed counterparts) are not optimized for all four measures simultaneously, as it is difficult to have the best of all four worlds. Here, we take an important step in this direction by showing that text indexing with locally consistent anchors (lcanchors) offers remarkably good performance in all four measures, when we have at hand a lower bound ℓ on the length of the queried patterns-which is arguably a quite reasonable assumption in practical applications. Specifically, we improve on the construction of the index proposed by Loukides and Pissis, which is based on bidirectional string anchors (bd-anchors), a new type of lc-anchors, by: (i) designing an average-case linear-time algorithm to compute bd-anchors; and (ii) developing a semi-external-memory implementation to construct the index in small space using near-optimal work. We then present an extensive experimental evaluation, based on the four measures, using real benchmark datasets. The results show that, for long patterns, the index constructed using our improved algorithms compares favorably to all classic indexes: (compressed) suffix tree; (compressed) suffix array; and the FM-index.
Fichier principal
Vignette du fichier
p2117-loukides.pdf (4.01 Mo) Télécharger le fichier
Origine : Fichiers éditeurs autorisés sur une archive ouverte

Dates et versions

hal-04385571 , version 1 (10-01-2024)

Licence

Paternité

Identifiants

Citer

Lorraine a K Ayad, Grigorios Loukides, Solon P Pissis. Text Indexing for Long Patterns: Anchors are All you Need. VLDB 2023 - 49th International Conference on Very Large Data Bases, Aug 2023, Vancouver, Canada. pp.2117-2131, ⟨10.14778/3598581.3598586⟩. ⟨hal-04385571⟩
15 Consultations
13 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More