Skip to Main content Skip to Navigation

Web page segmentation, evaluation and applications

Andrés Sanoja Vargas 1
1 BD - Bases de Données
LIP6 - Laboratoire d'Informatique de Paris 6
Abstract : Web pages are becoming more complex than ever, as they are generated by Content Management Systems (CMS). Thus, analyzing them, i.e. automatically identifying and classifying different elements from Web pages, such as main content, menus, among others, becomes difficult. A solution to this issue is provided by Web page segmentation which refers to the process of dividing a Web page into visually and semantically coherent segments called blocks.The quality of a Web page segmenter is measured by its correctness and its genericity, i.e. the variety of Web page types it is able to segment. Our research focuses on enhancing this quality and measuring it in a fair and accurate way. We first propose a conceptual model for segmentation, as well as Block-o-Matic (BoM), our Web page segmenter. We propose an evaluation model that takes the content as well as the geometry of blocks into account in order to measure the correctness of a segmentation algorithm according to a predefined ground truth. The quality of four state of the art algorithms is experimentally tested on four types of pages. Our evaluation framework allows testing any segmenter, i.e. measuring their quality. The results show that BoM presents the best performance among the four segmentation algorithms tested, and also that the performance of segmenters depends on the type of page to segment.We present two applications of BoM. Pagelyzer uses BoM for comparing two Web pages versions and decides if they are similar or not. It is the main contribution of our team to the European project Scape (FP7-IP). We also developed a migration tool of Web pages from HTML4 format to HTML5 format in the context of Web archives.
Document type :
Complete list of metadata

Cited literature [63 references]  Display  Hide  Download
Contributor : Abes Star :  Contact
Submitted on : Monday, March 9, 2015 - 10:03:05 AM
Last modification on : Thursday, December 9, 2021 - 6:14:07 PM
Long-term archiving on: : Wednesday, June 10, 2015 - 1:00:24 PM


Version validated by the jury (STAR)


  • HAL Id : tel-01128002, version 1


Andrés Sanoja Vargas. Web page segmentation, evaluation and applications. Other [cs.OH]. Université Pierre et Marie Curie - Paris VI, 2015. English. ⟨NNT : 2015PA066004⟩. ⟨tel-01128002⟩



Les métriques sont temporairement indisponibles