Constructing Antidictionaries of Long Texts in Output-Sensitive Space

Lorraine a K Ayad; Golnaz Badkobeh; Gabriele Fici; Alice Héliou; Solon P Pissis

doi:10.1007/s00224-020-10018-5

Article Dans Une Revue Theory of Computing Systems Année : 2021

Constructing Antidictionaries of Long Texts in Output-Sensitive Space

(1) , (2) , (3) , (4) , (5, 6)

1
2
3
4
5
6

Lorraine a K Ayad

Fonction : Auteur
PersonId : 1038878

Department of Informatics [King's College London]

Golnaz Badkobeh

Fonction : Auteur

Goldsmiths, University of London

Gabriele Fici

Fonction : Auteur
PersonId : 1139982

Dipartimento di Matematica e Informatica [Palermo]

Alice Héliou

Fonction : Auteur
PersonId : 1121002

Algorithms and Models for Integrative BIOlogy

Solon P Pissis

Fonction : Auteur
PersonId : 1078597
ORCID : 0000-0002-1445-1932

Centrum Wiskunde & Informatica

Equipe de recherche européenne en algorithmique et biologie formelle et expérimentale

Résumé

A word x that is absent from a word y is called minimal if all its proper factors occur in y . Given a collection of k words y 1 , … , y k over an alphabet Σ , we are asked to compute the set $\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$ M { y 1 , … , y k } ℓ of minimal absent words of length at most ℓ of the collection { y 1 , … , y k }. The set $\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$ M { y 1 , … , y k } ℓ contains all the words x such that x is absent from all the words of the collection while there exist i , j , such that the maximal proper suffix of x is a factor of y i and the maximal proper prefix of x is a factor of y j . In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. Indeed, the set $\mathrm {M}^{\ell }_{y}$ M y ℓ of minimal absent words of a word y is equal to $\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$ M { y 1 , … , y k } ℓ for any decomposition of y into a collection of words y 1 , … , y k such that there is an overlap of length at least ℓ − 1 between any two consecutive words in the collection. This computation generally requires Ω ( n ) space for n = | y | using any of the plenty available $\mathcal {O}(n)$ O ( n ) -time algorithms. This is because an Ω ( n )-sized text index is constructed over y which can be impractical for large n . We do the identical computation incrementally using output-sensitive space. This goal is reasonable when $\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| =o(n)$ ∥ M { y 1 , … , y N } ℓ ∥ = o ( n ) , for all N ∈ [1, k ], where ∥ S ∥ denotes the sum of the lengths of words in set S . For instance, in the human genome, n ≈ 3 × 10 9 but $\| \mathrm {M}^{12}_{\{y_1,\ldots ,y_k\}}\| \approx 10^{6}$ ∥ M { y 1 , … , y k } 12 ∥ ≈ 1 0 6 . We consider a constant-sized alphabet for stating our results. We show that all $\mathrm {M}^{\ell }_{y_{1}},\ldots ,\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$ M y 1 ℓ , … , M { y 1 , … , y k } ℓ can be computed in $\mathcal {O}(kn+{\sum }^{k}_{N=1}\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| )$ O ( k n + ∑ N = 1 k ∥ M { y 1 , … , y N } ℓ ∥ ) total time using $\mathcal {O}(\textsc {MaxIn}+\textsc {MaxOut})$ O ( MaxIn + MaxOut ) space, where MaxIn is the length of the longest word in { y 1 , … , y k } and $\textsc {MaxOut}=\max \limits \{\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| :N\in [1,k]\}$ MaxOut = max { ∥ M { y 1 , … , y N } ℓ ∥ : N ∈ [ 1 , k ] } . Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.

Domaines

Informatique [cs]

Fichier principal

1902.04785.pdf (622.37 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Marie-France Sagot : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03498331

Soumis le : mardi 21 décembre 2021-09:36:49

Dernière modification le : mercredi 21 février 2024-03:30:52

Archivage à long terme le : mardi 22 mars 2022-18:47:23

Dates et versions

hal-03498331 , version 1 (21-12-2021)

Identifiants

HAL Id : hal-03498331 , version 1
DOI : 10.1007/s00224-020-10018-5

Citer

Lorraine a K Ayad, Golnaz Badkobeh, Gabriele Fici, Alice Héliou, Solon P Pissis. Constructing Antidictionaries of Long Texts in Output-Sensitive Space. Theory of Computing Systems, 2021, 65 (5), pp.777-797. ⟨10.1007/s00224-020-10018-5⟩. ⟨hal-03498331⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

X CNRS INRIA LIX X-LIX X-DEP-INFO INRIA2 IP_PARIS

16 Consultations

32 Téléchargements

Constructing Antidictionaries of Long Texts in Output-Sensitive Space

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager