Skip to Main content Skip to Navigation
Conference papers

q-gram analysis and urn models

Abstract : Words of fixed size q are commonly referred to as $q$-grams. We consider the problem of $q$-gram filtration, a method commonly used to speed upsequence comparison. We are interested in the statistics of the number of $q$-grams common to two random texts (where multiplicities are not counted) in the non uniform Bernoulli model. In the exact and dependent model, when omitting border effects, a $q$-gramin a random sequence depends on the $q-1$ preceding $q$-grams. In an approximate and independent model, we draw randomly a $q$-gram at each position, independently of the others positions. Using ball and urn models, we analyze the independent model. Numerical simulations show that this model is an excellent first order approximationto the dependent model. We provide an algorithm to compute the moments.
Complete list of metadata

Cited literature [32 references]  Display  Hide  Download

https://hal.inria.fr/hal-01183917
Contributor : Coordination Episciences Iam <>
Submitted on : Wednesday, August 12, 2015 - 9:06:10 AM
Last modification on : Thursday, March 5, 2020 - 6:29:18 PM
Long-term archiving on: : Friday, November 13, 2015 - 11:36:48 AM

File

dmAC0124.pdf
Publisher files allowed on an open archive

Identifiers

  • HAL Id : hal-01183917, version 1

Collections

Citation

Pierre Nicodème. q-gram analysis and urn models. Discrete Random Walks, DRW'03, 2003, Paris, France. pp.243-258. ⟨hal-01183917⟩

Share

Metrics

Record views

187

Files downloads

669