Chinese Web Content Extraction Based on Naïve Bayes Model

Abstract : As the web content extraction becomes more and more difficult, this paper proposes a method that using Naive Bayes Model to train the block attributes eigenvalues of web page. Firstly, this method denoising the web page, represents it as a DOM tree and divides web page into blocks, then uses Naive Bayes Model to get the probability value of the statistical feature about web blocks. At last, it extracts theme blocks to compose content of web page. The test shows that the algorithm could extract content of web page accurately. The average accuracy has reached up to 96.2%.The method has been adopted to extract content for the off-portal search of Hunan Farmer Training Website, and the efficiency is well.
Type de document :
Communication dans un congrès
Daoliang Li; Yingyi Chen. 7th International Conference on Computer and Computing Technologies in Agriculture (CCTA), Sep 2013, Beijing, China. Springer, IFIP Advances in Information and Communication Technology, AICT-420 (Part II), pp.404-413, 2014, Computer and Computing Technologies in Agriculture VII. 〈10.1007/978-3-642-54341-8_42〉
Liste complète des métadonnées

Littérature citée [14 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01220851
Contributeur : Hal Ifip <>
Soumis le : mardi 27 octobre 2015 - 08:29:45
Dernière modification le : mercredi 17 janvier 2018 - 10:46:41
Document(s) archivé(s) le : jeudi 28 janvier 2016 - 10:24:38

Fichier

978-3-642-54341-8_42_Chapter.p...
Fichiers produits par l'(les) auteur(s)

Licence


Distributed under a Creative Commons Paternité 4.0 International License

Identifiants

Citation

Wang Jinbo, Wang Lianzhi, Gao Wanlin, Yu Jian, Cui Yuntao. Chinese Web Content Extraction Based on Naïve Bayes Model. Daoliang Li; Yingyi Chen. 7th International Conference on Computer and Computing Technologies in Agriculture (CCTA), Sep 2013, Beijing, China. Springer, IFIP Advances in Information and Communication Technology, AICT-420 (Part II), pp.404-413, 2014, Computer and Computing Technologies in Agriculture VII. 〈10.1007/978-3-642-54341-8_42〉. 〈hal-01220851〉

Partager

Métriques

Consultations de la notice

92

Téléchargements de fichiers

15