Web page segmentation, evaluation and applications

Abstract : Web pages are becoming more complex than ever, as they are generated by Content Management Systems (CMS). Thus, analyzing them, i.e. automatically identifying and classifying different elements from Web pages, such as main content, menus, among others, becomes difficult. A solution to this issue is provided by Web page segmentation which refers to the process of dividing a Web page into visually and semantically coherent segments called blocks.The quality of a Web page segmenter is measured by its correctness and its genericity, i.e. the variety of Web page types it is able to segment. Our research focuses on enhancing this quality and measuring it in a fair and accurate way. We first propose a conceptual model for segmentation, as well as Block-o-Matic (BoM), our Web page segmenter. We propose an evaluation model that takes the content as well as the geometry of blocks into account in order to measure the correctness of a segmentation algorithm according to a predefined ground truth. The quality of four state of the art algorithms is experimentally tested on four types of pages. Our evaluation framework allows testing any segmenter, i.e. measuring their quality. The results show that BoM presents the best performance among the four segmentation algorithms tested, and also that the performance of segmenters depends on the type of page to segment.We present two applications of BoM. Pagelyzer uses BoM for comparing two Web pages versions and decides if they are similar or not. It is the main contribution of our team to the European project Scape (FP7-IP). We also developed a migration tool of Web pages from HTML4 format to HTML5 format in the context of Web archives.
Document type :
Theses
Complete list of metadatas

Cited literature [63 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-01128002
Contributor : Abes Star <>
Submitted on : Monday, March 9, 2015 - 10:03:05 AM
Last modification on : Friday, March 22, 2019 - 1:33:12 AM
Long-term archiving on : Wednesday, June 10, 2015 - 1:00:24 PM

File

2015PA066004.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-01128002, version 1

Citation

Andrés Sanoja Vargas. Web page segmentation, evaluation and applications. Web. Université Pierre et Marie Curie - Paris VI, 2015. English. ⟨NNT : 2015PA066004⟩. ⟨tel-01128002⟩

Share

Metrics

Record views

738

Files downloads

2157