Skip to Main content Skip to Navigation
Journal articles

An Approximate L^p Difference Algorithm for Massive Data Streams

Abstract : Several recent papers have shown how to approximate the difference ∑ _i|a_i-b_i| or ∑ |a_i-b_i|^2 between two functions, when the function values a_i and b_i are given in a data stream, and their order is chosen by an adversary. These algorithms use little space (much less than would be needed to store the entire stream) and little time to process each item in the stream. They approximate with small relative error. Using different techniques, we show how to approximate the L^p-difference ∑ _i|a_i-b_i|^p for any rational-valued p∈(0,2], with comparable efficiency and error. We also show how to approximate ∑ _i|a_i-b_i|^p for larger values of p but with a worse error guarantee. Our results fill in gaps left by recent work, by providing an algorithm that is precisely tunable for the application at hand. These results can be used to assess the difference between two chronologically or physically separated massive data sets, making one quick pass over each data set, without buffering the data or requiring the data source to pause. For example, one can use our techniques to judge whether the traffic on two remote network routers are similar without requiring either router to transmit a copy of its traffic. A web search engine could use such algorithms to construct a library of small ''sketches,'' one for each distinct page on the web; one can approximate the extent to which new web pages duplicate old ones by comparing the sketches of the web pages. Such techniques will become increasingly important as the enormous scale, distributional nature, and one-pass processing requirements of data sets become more commonplace.
Document type :
Journal articles
Complete list of metadata

Cited literature [9 references]  Display  Hide  Download

https://hal.inria.fr/hal-00958964
Contributor : Service Ist Inria Sophia Antipolis-Méditerranée / I3s <>
Submitted on : Thursday, March 13, 2014 - 4:53:47 PM
Last modification on : Wednesday, October 14, 2020 - 4:08:34 AM
Long-term archiving on: : Friday, June 13, 2014 - 12:05:24 PM

File

dm040217.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00958964, version 1

Collections

Citation

Jessica H. Fong, Martin Strauss. An Approximate L^p Difference Algorithm for Massive Data Streams. Discrete Mathematics and Theoretical Computer Science, DMTCS, 2001, 4 (2), pp.301-322. ⟨hal-00958964⟩

Share

Metrics

Record views

200

Files downloads

1040