Chinese Web Content Extraction Based on Naïve Bayes Model

Wang Jinbo; Wang Lianzhi; Gao Wanlin; Yu Jian; Cui Yuntao

doi:10.1007/978-3-642-54341-8_42

Communication Dans Un Congrès Année : 2014

Chinese Web Content Extraction Based on Naïve Bayes Model

(1) , (1) , (1) , (1) , (1)

Wang Jinbo

Fonction : Auteur
PersonId : 972195

College of Information and Electrical Engineering [Beijing]

Wang Lianzhi

Fonction : Auteur
PersonId : 972196

College of Information and Electrical Engineering [Beijing]

Gao Wanlin

Fonction : Auteur
PersonId : 972197

College of Information and Electrical Engineering [Beijing]

Yu Jian

Fonction : Auteur
PersonId : 972198

College of Information and Electrical Engineering [Beijing]

Cui Yuntao

Fonction : Auteur
PersonId : 972199

College of Information and Electrical Engineering [Beijing]

Résumé

As the web content extraction becomes more and more difficult, this paper proposes a method that using Naive Bayes Model to train the block attributes eigenvalues of web page. Firstly, this method denoising the web page, represents it as a DOM tree and divides web page into blocks, then uses Naive Bayes Model to get the probability value of the statistical feature about web blocks. At last, it extracts theme blocks to compose content of web page. The test shows that the algorithm could extract content of web page accurately. The average accuracy has reached up to 96.2%.The method has been adopted to extract content for the off-portal search of Hunan Farmer Training Website, and the efficiency is well.

Mots clés

Web Content Extraction DOM Tree Page Segmentation Naive Bayes Model

Domaines

Informatique [cs]

Fichier principal

978-3-642-54341-8_42_Chapter.pdf (4 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Hal Ifip : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-01220851

Soumis le : mardi 27 octobre 2015-08:29:45

Dernière modification le : mercredi 17 janvier 2018-10:46:41

Archivage à long terme le : jeudi 28 janvier 2016-10:24:38

Dates et versions

hal-01220851 , version 1 (27-10-2015)

Licence

Paternité

Identifiants

HAL Id : hal-01220851 , version 1
DOI : 10.1007/978-3-642-54341-8_42

Citer

Wang Jinbo, Wang Lianzhi, Gao Wanlin, Yu Jian, Cui Yuntao. Chinese Web Content Extraction Based on Naïve Bayes Model. 7th International Conference on Computer and Computing Technologies in Agriculture (CCTA), Sep 2013, Beijing, China. pp.404-413, ⟨10.1007/978-3-642-54341-8_42⟩. ⟨hal-01220851⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

IFIP IFIP-AICT IFIP-TC IFIP-AICT-420 IFIP-TC5 IFIP-WG5-14 IFIP-CCTA

69 Consultations

75 Téléchargements

Chinese Web Content Extraction Based on Naïve Bayes Model

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Altmetric

Partager