Deep Reinforcement Learning for Web Crawling - Archive ouverte HAL Access content directly
Conference Papers Year :

Deep Reinforcement Learning for Web Crawling

(1) , (2) , (1)
1
2
Kishor Patil
  • Function : Author
  • PersonId : 1118981

Abstract

A search engine uses a web crawler to crawl the pages from the world wide web (WWW) and aims to maintain its local cache as fresh as possible. Unfortunately, the rates at which different pages change in WWW are highly nonuniform and also, unknown in many real-life scenarios. In addition, the finite available bandwidth and possible server restrictions on crawling frequency make it very difficult for the crawler to find the optimal scheduling policy that maximises the freshness of the local cache. We model this problem in a multi-armed restless bandits framework, where each arm represents a web page or an aggregate of statistically identical web pages. The objective is to find the scheduling policy that gives the exact indices of the pages to be crawled at a particular instance. We provide an online learning scheme using deep reinforcement learning (DRL) framework which learns the unknown page change dynamics on the fly along with the optimal crawling policy. Finally, we run numerical simulations to compare our approach with state-of-the-art algorithms such as static optimisation and Thompson sampling. We observe better performance for DRL.
Fichier principal
Vignette du fichier
Deep_RL_Crawling_ICC21_Author.pdf (2.86 Mo) Télécharger le fichier
Origin : Files produced by the author(s)

Dates and versions

hal-03461189 , version 1 (01-12-2021)

Identifiers

  • HAL Id : hal-03461189 , version 1

Cite

Konstantin Avrachenkov, Vivek Borkar, Kishor Patil. Deep Reinforcement Learning for Web Crawling. ICC 2021 - 7th Indian Control Conference, Dec 2021, Mumbai, India. ⟨hal-03461189⟩
89 View
222 Download

Share

Gmail Facebook Twitter LinkedIn More