Focused Crawling through Reinforcement Learning

Miyoung Han 1, 2 Pierre-Henri Wuillemin 3 Pierre Senellart 4, 2, 1
2 VALDA - Value from Data
DI-ENS - Département d'informatique de l'École normale supérieure, Inria de Paris
3 DECISION
LIP6 - Laboratoire d'Informatique de Paris 6
Abstract : Focused crawling aims at collecting as many Web pages relevant to a target topic as possible while avoiding irrelevant pages, reflecting limited resources available to a Web crawler. We improve on the efficiency of focused crawling by proposing an approach based on reinforcement learning. Our algorithm evaluates hyperlinks most profitable to follow over the long run, and selects the most promising link based on this estimation. To properly model the crawling environment as a Markov decision process, we propose new representations of states and actions considering both content information and the link structure. The size of the state-action space is reduced by a generalization process. Based on this generalization, we use a linear-function approximation to update value functions. We investigate the trade-off between synchronous and asynchronous methods. In experiments, we compare the performance of a crawling task with and without learning; crawlers based on reinforcement learning show better performance for various target topics.
Document type :
Conference papers
Complete list of metadatas

https://hal.inria.fr/hal-01851547
Contributor : Pierre Senellart <>
Submitted on : Monday, July 30, 2018 - 1:58:23 PM
Last modification on : Wednesday, June 19, 2019 - 3:10:02 PM
Long-term archiving on : Wednesday, October 31, 2018 - 1:14:22 PM

File

crawling_2018.pdf
Files produced by the author(s)

Licence


Copyright

Identifiers

Citation

Miyoung Han, Pierre-Henri Wuillemin, Pierre Senellart. Focused Crawling through Reinforcement Learning. 18th International Conference on Web Engineering (ICWE 2018), Jun 2018, Cáceres, Spain. pp.261-278, ⟨10.1007/978-3-319-91662-0_20⟩. ⟨hal-01851547⟩

Share

Metrics

Record views

565

Files downloads

226