Design and analyses of web scraping on burstable virtual machines - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Article Dans Une Revue Concurrency and Computation: Practice and Experience Année : 2023

Design and analyses of web scraping on burstable virtual machines

Résumé

Web scraping is a widely used technique for decision‐making, collecting, and structuring public data from the internet. As the volume of data continues to grow, the need for more efficient methods of data extraction becomes crucial. This article introduces a novel web scraping framework that utilizes Burstable virtual machines (VMs) on Amazon Web Services with the objective of reducing the monetary cost of execution while ensuring compliance with service level agreements (SLAs). To achieve this, the framework utilizes a combination of fixed and temporary Burstable VMs in a mixed cluster, which can be elastically scaled up to fulfill the SLA and scaled down to minimize monetary costs. Two strategies for handling VM allocation are proposed and evaluated: (i) a queue and SLA‐based strategy that employs queue size information and SLA criteria to determine the required number of VMs for the current scraping requests, and (ii) a credit‐based strategy that incorporates information about Burstable VM credits to effectively manage instance creation and termination. Experimental tests show that the proposed framework meets the defined SLA while achieving cost reductions of up to 74% compared to an approach that executes on fixed‐size clusters of Burstable instances.
Fichier non déposé

Dates et versions

hal-04388372 , version 1 (11-01-2024)

Identifiants

Citer

Lúcia Maria A. Drummond, Luciano Andrade, Pedro de Brito Muniz, Matheus Marotti Pereira, Thiago Do Prado Silva, et al.. Design and analyses of web scraping on burstable virtual machines. Concurrency and Computation: Practice and Experience, 2023, ⟨10.1002/cpe.7999⟩. ⟨hal-04388372⟩
14 Consultations
0 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More