An Approximate L^p Difference Algorithm for Massive Data Streams

Abstract : Several recent papers have shown how to approximate the difference ∑ _i|a_i-b_i| or ∑ |a_i-b_i|^2 between two functions, when the function values a_i and b_i are given in a data stream, and their order is chosen by an adversary. These algorithms use little space (much less than would be needed to store the entire stream) and little time to process each item in the stream. They approximate with small relative error. Using different techniques, we show how to approximate the L^p-difference ∑ _i|a_i-b_i|^p for any rational-valued p∈(0,2], with comparable efficiency and error. We also show how to approximate ∑ _i|a_i-b_i|^p for larger values of p but with a worse error guarantee. Our results fill in gaps left by recent work, by providing an algorithm that is precisely tunable for the application at hand. These results can be used to assess the difference between two chronologically or physically separated massive data sets, making one quick pass over each data set, without buffering the data or requiring the data source to pause. For example, one can use our techniques to judge whether the traffic on two remote network routers are similar without requiring either router to transmit a copy of its traffic. A web search engine could use such algorithms to construct a library of small ''sketches,'' one for each distinct page on the web; one can approximate the extent to which new web pages duplicate old ones by comparing the sketches of the web pages. Such techniques will become increasingly important as the enormous scale, distributional nature, and one-pass processing requirements of data sets become more commonplace.
Type de document :
Article dans une revue
Discrete Mathematics and Theoretical Computer Science, DMTCS, 2001, 4 (2), pp.301-322
Liste complète des métadonnées

Littérature citée [9 références]  Voir  Masquer  Télécharger
Contributeur : Service Ist Inria Sophia Antipolis-Méditerranée / I3s <>
Soumis le : jeudi 13 mars 2014 - 16:53:47
Dernière modification le : lundi 20 novembre 2017 - 14:04:03
Document(s) archivé(s) le : vendredi 13 juin 2014 - 12:05:24


Fichiers produits par l'(les) auteur(s)


  • HAL Id : hal-00958964, version 1



Jessica H. Fong, Martin Strauss. An Approximate L^p Difference Algorithm for Massive Data Streams. Discrete Mathematics and Theoretical Computer Science, DMTCS, 2001, 4 (2), pp.301-322. 〈hal-00958964〉



Consultations de la notice


Téléchargements de fichiers