Near Duplicate Document Detection for Large Information Flows

Abstract : Near duplicate documents and their detection are studied to identify info items that convey the same (or very similar) content, possibly surrounded by diverse sets of side information like metadata, advertisements, timestamps, web presentations and navigation supports, and so on. Identification of near duplicate information allows the implementation of selection policies aiming to optimize an information corpus and therefore improve its quality.In this paper, we introduce a new method to find near duplicate documents based on q-grams extracted from the text. The algorithm exploits three major features: a similarity measure comparing document q-gram occurrences to evaluate the syntactic similarity of the compared texts; an indexing method maintaining an inverted index of q-gram; and an efficient allocation of the bitmaps using a window size of 24 hours supporting the documents comparison process.The proposed algorithm has been tested in a multifeed news content management system to filter out duplicated news items coming from different information channels. The experimental evaluation shows the efficiency and the accuracy of our solution compared with other existing techniques. The results on a real dataset report a F-measure of 9.53 with a similarity threshold of 0.8.
Type de document :
Communication dans un congrès
Gerald Quirchmayr; Josef Basl; Ilsun You; Lida Xu; Edgar Weippl. International Cross-Domain Conference and Workshop on Availability, Reliability, and Security (CD-ARES), Aug 2012, Prague, Czech Republic. Springer, Lecture Notes in Computer Science, LNCS-7465, pp.203-217, 2012, Multidisciplinary Research and Practice for Information Systems. 〈10.1007/978-3-642-32498-7_16〉
Liste complète des métadonnées

Littérature citée [13 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01542467
Contributeur : Hal Ifip <>
Soumis le : lundi 19 juin 2017 - 17:01:46
Dernière modification le : mardi 20 juin 2017 - 01:06:36
Document(s) archivé(s) le : vendredi 15 décembre 2017 - 18:57:32

Fichier

978-3-642-32498-7_16_Chapter.p...
Fichiers produits par l'(les) auteur(s)

Licence


Distributed under a Creative Commons Paternité 4.0 International License

Identifiants

Citation

Daniele Montanari, Piera Puglisi. Near Duplicate Document Detection for Large Information Flows. Gerald Quirchmayr; Josef Basl; Ilsun You; Lida Xu; Edgar Weippl. International Cross-Domain Conference and Workshop on Availability, Reliability, and Security (CD-ARES), Aug 2012, Prague, Czech Republic. Springer, Lecture Notes in Computer Science, LNCS-7465, pp.203-217, 2012, Multidisciplinary Research and Practice for Information Systems. 〈10.1007/978-3-642-32498-7_16〉. 〈hal-01542467〉

Partager

Métriques

Consultations de la notice

65

Téléchargements de fichiers

26