Constructing Antidictionaries of Long Texts in Output-Sensitive Space - Archive ouverte HAL Access content directly
Journal Articles Theory of Computing Systems Year : 2021

## Constructing Antidictionaries of Long Texts in Output-Sensitive Space

(1) , (2) , (3) , (4) , (4, 5)
1
2
3
4
5
• Function : Author
• PersonId : 1038878
• Function : Author
Gabriele Fici
• Function : Author
• PersonId : 1139982
Alice Héliou
• Function : Author
• PersonId : 1121002
Solon P Pissis

#### Abstract

A word x that is absent from a word y is called minimal if all its proper factors occur in y . Given a collection of k words y 1 , … , y k over an alphabet Σ , we are asked to compute the set $\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$ M { y 1 , … , y k } ℓ of minimal absent words of length at most ℓ of the collection { y 1 , … , y k }. The set $\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$ M { y 1 , … , y k } ℓ contains all the words x such that x is absent from all the words of the collection while there exist i , j , such that the maximal proper suffix of x is a factor of y i and the maximal proper prefix of x is a factor of y j . In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. Indeed, the set $\mathrm {M}^{\ell }_{y}$ M y ℓ of minimal absent words of a word y is equal to $\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$ M { y 1 , … , y k } ℓ for any decomposition of y into a collection of words y 1 , … , y k such that there is an overlap of length at least ℓ − 1 between any two consecutive words in the collection. This computation generally requires Ω ( n ) space for n = | y | using any of the plenty available $\mathcal {O}(n)$ O ( n ) -time algorithms. This is because an Ω ( n )-sized text index is constructed over y which can be impractical for large n . We do the identical computation incrementally using output-sensitive space. This goal is reasonable when $\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| =o(n)$ ∥ M { y 1 , … , y N } ℓ ∥ = o ( n ) , for all N ∈ [1, k ], where ∥ S ∥ denotes the sum of the lengths of words in set S . For instance, in the human genome, n ≈ 3 × 10 9 but $\| \mathrm {M}^{12}_{\{y_1,\ldots ,y_k\}}\| \approx 10^{6}$ ∥ M { y 1 , … , y k } 12 ∥ ≈ 1 0 6 . We consider a constant-sized alphabet for stating our results. We show that all $\mathrm {M}^{\ell }_{y_{1}},\ldots ,\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$ M y 1 ℓ , … , M { y 1 , … , y k } ℓ can be computed in $\mathcal {O}(kn+{\sum }^{k}_{N=1}\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| )$ O ( k n + ∑ N = 1 k ∥ M { y 1 , … , y N } ℓ ∥ ) total time using $\mathcal {O}(\textsc {MaxIn}+\textsc {MaxOut})$ O ( MaxIn + MaxOut ) space, where MaxIn is the length of the longest word in { y 1 , … , y k } and $\textsc {MaxOut}=\max \limits \{\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| :N\in [1,k]\}$ MaxOut = max { ∥ M { y 1 , … , y N } ℓ ∥ : N ∈ [ 1 , k ] } . Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.

#### Domains

Computer Science [cs]

### Dates and versions

hal-03498331 , version 1 (21-12-2021)

### Identifiers

• HAL Id : hal-03498331 , version 1
• DOI :

### Cite

Lorraine a K Ayad, Golnaz Badkobeh, Gabriele Fici, Alice Héliou, Solon P Pissis. Constructing Antidictionaries of Long Texts in Output-Sensitive Space. Theory of Computing Systems, 2021, 65 (5), pp.777-797. ⟨10.1007/s00224-020-10018-5⟩. ⟨hal-03498331⟩

### Export

BibTeX TEI Dublin Core DC Terms EndNote Datacite

15 View