Skip to Main content Skip to Navigation
Conference papers

Change Rate Estimation and Optimal Freshness in Web Page Crawling

Abstract : For providing quick and accurate results, a search engine maintains a local snapshot of the entire web. And, to keep this local cache fresh, it employs a crawler for tracking changes across various web pages. However, finite bandwidth availability and server restrictions impose some constraints on the crawling frequency. Consequently, the ideal crawling rates are the ones that maximise the freshness of the local cache and also respect the above constraints. Azar et al. 2018 recently proposed a tractable algorithm to solve this optimisation problem. However, they assume the knowledge of the exact page change rates, which is unrealistic in practice. We address this issue here. Specifically, we provide two novel schemes for online estimation of page change rates. Both schemes only need partial information about the page change process, i.e., they only need to know if the page has changed or not since the last crawled instance. For both these schemes, we prove convergence and, also, derive their convergence rates. Finally, we provide some numerical experiments to compare the performance of our proposed estimators with the existing ones (e.g., MLE).
Complete list of metadata

https://hal.inria.fr/hal-03123809
Contributor : Konstantin Avrachenkov Connect in order to contact the contributor
Submitted on : Thursday, January 28, 2021 - 11:04:13 AM
Last modification on : Wednesday, December 1, 2021 - 1:32:46 PM
Long-term archiving on: : Thursday, April 29, 2021 - 6:28:50 PM

File

ValueTools_2020.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-03123809, version 1

Collections

Citation

Konstantin Avrachenkov, Kishor Patil, Gugan Thoppe. Change Rate Estimation and Optimal Freshness in Web Page Crawling. VALUETOOLS 2020 - 13th EAI International Conference on Performance Evaluation Methodologies and Tools, May 2020, Tsukuba, Japan. ⟨hal-03123809⟩

Share

Metrics

Les métriques sont temporairement indisponibles