q-gram analysis and urn models

Abstract : Words of fixed size q are commonly referred to as $q$-grams. We consider the problem of $q$-gram filtration, a method commonly used to speed upsequence comparison. We are interested in the statistics of the number of $q$-grams common to two random texts (where multiplicities are not counted) in the non uniform Bernoulli model. In the exact and dependent model, when omitting border effects, a $q$-gramin a random sequence depends on the $q-1$ preceding $q$-grams. In an approximate and independent model, we draw randomly a $q$-gram at each position, independently of the others positions. Using ball and urn models, we analyze the independent model. Numerical simulations show that this model is an excellent first order approximationto the dependent model. We provide an algorithm to compute the moments.
Keywords :
Type de document :
Communication dans un congrès
Cyril Banderier and Christian Krattenthaler. Discrete Random Walks, DRW'03, 2003, Paris, France. Discrete Mathematics and Theoretical Computer Science, DMTCS Proceedings vol. AC, Discrete Random Walks (DRW'03), pp.243-258, 2003, DMTCS Proceedings
Domaine :

Littérature citée [32 références]

https://hal.inria.fr/hal-01183917
Contributeur : Coordination Episciences Iam <>
Soumis le : mercredi 12 août 2015 - 09:06:10
Dernière modification le : jeudi 10 mai 2018 - 02:06:09
Document(s) archivé(s) le : vendredi 13 novembre 2015 - 11:36:48

Fichier

dmAC0124.pdf
Fichiers éditeurs autorisés sur une archive ouverte

Identifiants

• HAL Id : hal-01183917, version 1

Citation

Pierre Nicodème. q-gram analysis and urn models. Cyril Banderier and Christian Krattenthaler. Discrete Random Walks, DRW'03, 2003, Paris, France. Discrete Mathematics and Theoretical Computer Science, DMTCS Proceedings vol. AC, Discrete Random Walks (DRW'03), pp.243-258, 2003, DMTCS Proceedings. 〈hal-01183917〉

Métriques

Consultations de la notice

160

Téléchargements de fichiers