Chinese Web Content Extraction Based on Naïve Bayes Model - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2014

Chinese Web Content Extraction Based on Naïve Bayes Model

Résumé

As the web content extraction becomes more and more difficult, this paper proposes a method that using Naive Bayes Model to train the block attributes eigenvalues of web page. Firstly, this method denoising the web page, represents it as a DOM tree and divides web page into blocks, then uses Naive Bayes Model to get the probability value of the statistical feature about web blocks. At last, it extracts theme blocks to compose content of web page. The test shows that the algorithm could extract content of web page accurately. The average accuracy has reached up to 96.2%.The method has been adopted to extract content for the off-portal search of Hunan Farmer Training Website, and the efficiency is well.
Fichier principal
Vignette du fichier
978-3-642-54341-8_42_Chapter.pdf (4 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01220851 , version 1 (27-10-2015)

Licence

Paternité

Identifiants

Citer

Wang Jinbo, Wang Lianzhi, Gao Wanlin, Yu Jian, Cui Yuntao. Chinese Web Content Extraction Based on Naïve Bayes Model. 7th International Conference on Computer and Computing Technologies in Agriculture (CCTA), Sep 2013, Beijing, China. pp.404-413, ⟨10.1007/978-3-642-54341-8_42⟩. ⟨hal-01220851⟩
69 Consultations
75 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More