A Word Counting Graph

Mireille Regnier 1 Zara Kirakossian 2 Eugenia Furletova 3 Mikhail Roytberg 3
1 AMIB - Algorithms and Models for Integrative Biology
CNRS - Centre National de la Recherche Scientifique : UMR8623, X - École polytechnique, Inria Saclay - Ile de France, UP11 - Université Paris-Sud - Paris 11, LRI - Laboratoire de Recherche en Informatique, LIX - Laboratoire d'informatique de l'École polytechnique [Palaiseau]
Abstract : We study methods for counting occurrences of words from a given set H over an alphabet V in a given text. All words have the same length m. Our goal is the computation of the probability to find p occurrences of words from a set H in a random text of size n, assuming that the text is generated by a Bernoulli or Markov model. We have designed an algorithm solving the problem; the algorithm relies on traversals of a graph, whose set of vertices is associated with the overlaps of words from H. Edges define two oriented subgraphs that can be interpreted as equivalence relations on words of H. Let P (H) be the set of equivalence classes and S be the set of other vertices. The run time for the Bernoulli model is O(np(|P (H)| +|S|)) time and the space complexity is O(pm|S| +|P (H)|). In a Markov model of order K, additional space complexity is O(pm|V | K ) and additional time complexity is O(npm|V | K). Our preprocessing uses a variant of Aho-Corasick automaton and achieves O(m|H|) time complexity. Our algorithm is implemented and provides a significant space improvement in practice. We compare its complexity to the additional improvement due to AhoCorasick minimization.
Type de document :
Chapitre d'ouvrage
Joseph Chan, Jacqueline W. Daykin and M. Sohel Rahman. London Algorithmics 2008: Theory and Practice (Texts in Algorithmics), London College Publications, 31 p., 2009, 978-1904987970
Liste complète des métadonnées

Littérature citée [11 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/inria-00437147
Contributeur : Mireille Regnier <>
Soumis le : dimanche 29 novembre 2009 - 16:17:13
Dernière modification le : jeudi 10 mai 2018 - 02:06:03
Document(s) archivé(s) le : jeudi 17 juin 2010 - 22:42:21

Fichier

Final_refukiro.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : inria-00437147, version 1

Collections

Citation

Mireille Regnier, Zara Kirakossian, Eugenia Furletova, Mikhail Roytberg. A Word Counting Graph. Joseph Chan, Jacqueline W. Daykin and M. Sohel Rahman. London Algorithmics 2008: Theory and Practice (Texts in Algorithmics), London College Publications, 31 p., 2009, 978-1904987970. 〈inria-00437147〉

Partager

Métriques

Consultations de la notice

354

Téléchargements de fichiers

165