Faster Algorithms for Longest Common Substring - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2021

Faster Algorithms for Longest Common Substring

Résumé

In the classic longest common substring (LCS) problem, we are given two strings S and T , each of length at most n, over an alphabet of size σ, and we are asked to find a longest string occurring as a fragment of both S and T. Weiner, in his seminal paper that introduced the suffix tree, presented an O(n log σ)-time algorithm for this problem [SWAT 1973]. For polynomially-bounded integer alphabets, the linear-time construction of suffix trees by Farach yielded an O(n)-time algorithm for the LCS problem [FOCS 1997]. However, for small alphabets, this is not necessarily optimal for the LCS problem in the word RAM model of computation, in which the strings can be stored in O(n log σ/ log n) space and read in O(n log σ/ log n) time. We show that, in this model, we can compute an LCS in time O(n log σ/ √ log n), which is sublinear in n if σ = 2 o(√ log n) (in particular, if σ = O(1)), using optimal space O(n log σ/ log n). We then lift our ideas to the problem of computing a k-mismatch LCS, which has received considerable attention in recent years. In this problem, the aim is to compute a longest substring of S that occurs in T with at most k mismatches. Flouri et al. showed how to compute a 1-mismatch LCS in O(n log n) time [IPL 2015]. Thankachan et al. extended this result to computing a k-mismatch LCS in O(n log k n) time for k = O(1) [J. Comput. Biol. 2016]. We show an O(n log k−1/2 n)-time algorithm, for any constant k > 0 and irrespective of the alphabet size, using O(n) space as the previous approaches. We thus notably break through the well-known n log k n barrier, which stems from a recursive heavy-path decomposition technique that was first introduced in the seminal paper of Cole et al. [STOC 2004] for string indexing with k errors.
Fichier principal
Vignette du fichier
2105.03106.pdf (1.08 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03498345 , version 1 (21-12-2021)

Identifiants

Citer

Panagiotis Charalampopoulos, Tomasz Kociumaka, Solon P Pissis, Jakub Radoszewski. Faster Algorithms for Longest Common Substring. ESA 2021 - 29th Annual European Symposium on Algorithms, Sep 2021, Virtual, Portugal. ⟨hal-03498345⟩

Collections

INRIA INRIA2
34 Consultations
228 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More