An Approximate L^p Difference Algorithm for Massive Data Streams - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Article Dans Une Revue Discrete Mathematics and Theoretical Computer Science Année : 2001

An Approximate L^p Difference Algorithm for Massive Data Streams

Résumé

Several recent papers have shown how to approximate the difference ∑ _i|a_i-b_i| or ∑ |a_i-b_i|^2 between two functions, when the function values a_i and b_i are given in a data stream, and their order is chosen by an adversary. These algorithms use little space (much less than would be needed to store the entire stream) and little time to process each item in the stream. They approximate with small relative error. Using different techniques, we show how to approximate the L^p-difference ∑ _i|a_i-b_i|^p for any rational-valued p∈(0,2], with comparable efficiency and error. We also show how to approximate ∑ _i|a_i-b_i|^p for larger values of p but with a worse error guarantee. Our results fill in gaps left by recent work, by providing an algorithm that is precisely tunable for the application at hand. These results can be used to assess the difference between two chronologically or physically separated massive data sets, making one quick pass over each data set, without buffering the data or requiring the data source to pause. For example, one can use our techniques to judge whether the traffic on two remote network routers are similar without requiring either router to transmit a copy of its traffic. A web search engine could use such algorithms to construct a library of small ''sketches,'' one for each distinct page on the web; one can approximate the extent to which new web pages duplicate old ones by comparing the sketches of the web pages. Such techniques will become increasingly important as the enormous scale, distributional nature, and one-pass processing requirements of data sets become more commonplace.
Fichier principal
Vignette du fichier
dm040217.pdf (150.64 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-00958964 , version 1 (13-03-2014)

Identifiants

Citer

Jessica H. Fong, Martin Strauss. An Approximate L^p Difference Algorithm for Massive Data Streams. Discrete Mathematics and Theoretical Computer Science, 2001, Vol. 4 no. 2 (2), pp.301-322. ⟨10.46298/dmtcs.290⟩. ⟨hal-00958964⟩

Collections

TDS-MACS
89 Consultations
929 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More