Deep Reinforcement Learning for Web Crawling

Konstantin Avrachenkov; Vivek Borkar; Kishor Patil

Communication Dans Un Congrès Année : 2021

Deep Reinforcement Learning for Web Crawling

(1) , (2) , (1)

1
2

Konstantin Avrachenkov

Fonction : Auteur
PersonId : 11963
IdHAL : konstantin-avrachenkov
ORCID : 0000-0002-8124-8272
IdRef : 087245280

Network Engineering and Operations

Vivek Borkar

Fonction : Auteur
PersonId : 994265
ORCID : 0000-0003-0756-5402

Department of Electrical Engineering [IIT-Bombay]

Kishor Patil

Fonction : Auteur
PersonId : 1118981

Network Engineering and Operations

Résumé

A search engine uses a web crawler to crawl the pages from the world wide web (WWW) and aims to maintain its local cache as fresh as possible. Unfortunately, the rates at which different pages change in WWW are highly nonuniform and also, unknown in many real-life scenarios. In addition, the finite available bandwidth and possible server restrictions on crawling frequency make it very difficult for the crawler to find the optimal scheduling policy that maximises the freshness of the local cache. We model this problem in a multi-armed restless bandits framework, where each arm represents a web page or an aggregate of statistically identical web pages. The objective is to find the scheduling policy that gives the exact indices of the pages to be crawled at a particular instance. We provide an online learning scheme using deep reinforcement learning (DRL) framework which learns the unknown page change dynamics on the fly along with the optimal crawling policy. Finally, we run numerical simulations to compare our approach with state-of-the-art algorithms such as static optimisation and Thompson sampling. We observe better performance for DRL.

Mots clés

Reinforcement Learning Adaptive Web Crawling Thompson Sampling Multi-armed Restless Bandits

Domaines

Recherche d'information [cs.IR] Réseaux et télécommunications [cs.NI] Apprentissage [cs.LG] Optimisation et contrôle [math.OC]

Fichier principal

Deep_RL_Crawling_ICC21_Author.pdf (2.86 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Konstantin Avrachenkov : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03461189

Soumis le : mercredi 1 décembre 2021-11:47:07

Dernière modification le : lundi 8 avril 2024-16:14:39

Archivage à long terme le : mercredi 2 mars 2022-19:01:18

Dates et versions

hal-03461189 , version 1 (01-12-2021)

Identifiants

HAL Id : hal-03461189 , version 1

Citer

Konstantin Avrachenkov, Vivek Borkar, Kishor Patil. Deep Reinforcement Learning for Web Crawling. ICC 2021 - 7th Indian Control Conference, Dec 2021, Mumbai, India. ⟨hal-03461189⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INRIA INRIA2 TDS-MACS UNIV-COTEDAZUR

107 Consultations

495 Téléchargements

Deep Reinforcement Learning for Web Crawling

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager