Products of Euclidean Metrics and Applications to Proximity Questions among Curves

The problem of Approximate Nearest Neighbor (ANN) search is fundamental in computer science and has beneﬁted from signiﬁcant progress in the past couple of decades. However, most work has been devoted to pointsets whereas complex shapes have not been suﬃciently treated. Here, we focus on distance functions between discretized curves in Euclidean space: they appear in a wide range of applications, from road segments and molecular backbones to time-series in general dimension. For ‘ p -products of Euclidean metrics, for any p ≥ 1, we design simple and eﬃcient data structures for ANN, based on randomized projections, which are of independent interest. They serve to solve proximity problems under a notion of distance between discretized curves, which generalizes both discrete Fréchet and Dynamic Time Warping distances. These are the most popular and practical approaches to comparing such curves. We oﬀer the ﬁrst data structures and query algorithms for ANN with arbitrarily good approximation factor, at the expense of increasing space usage and preprocessing time over existing methods. Query time complexity is comparable or signiﬁcantly improved by our algorithms; our approach is especially eﬃcient when the length of the curves is bounded.


Introduction
The problem of Approximate Nearest Neighbor (ANN) search is fundamental in computer science: one has to preprocess a dataset so as to answer proximity queries efficiently, for a given query object.ANN has been enjoying a lot of attention and significant progress has been achieved in the past couple of decades.However, most work has been devoted to vector spaces, and complex objects have not been sufficiently treated.Here, we focus on distance functions for polygonal curves which lie in the Euclidean space.Polygonal curves are essentially point sequences of varying length and have a wide range of applications ranging from road segments in low dimensions to time-series in arbitrary dimension and protein backbone structures.In general, the problem we aim to solve is as follows.
Definition 1 (ANN).Input are n polygonal curves V 1 , . . ., V n , where each V i is a sequence v i1 , . . ., v imi with each v ij ∈ R d , and each m i ≤ m for some pre-specified m.Given distance function d(•, •), > 0, preprocess V 1 , . . ., V n into a data structure s.t. for any query polygonal curve Q, the data structure reports V j for which the following holds There are various ways to define dissimilarity or distance between two curves.Two popular dissimilarity measures are the Discrete Fréchet Distance (DFD) and the Dynamic Time Warping (DTW) distance which are both widely studied and applied to classification and retrieval problems for various types of data.DFD satisfies the triangular inequality, unlike DTW.
It is common, in distance functions of curves, to involve the notion of a traversal for two curves.Intuitively, a traversal corresponds to a time plan for traversing the two curves simultaneously, starting from the first point of each curve and finishing at the last point of each curve.With time advancing, the traversal advances in at least one of the two curves.DFD is the minimum over traversals, maximum distance of points while traversing.DTW is the minimum over traversals, sum of distances while traversing.
We denote by d p the normed space (R d , • p ), where for any x = (x 1 , . . .,

Previous work
The ANN problem has been mainly addressed for datasets consisting of points.Efficient deterministic solutions exist when the dimension is constant, e.g.[5], while for high-dimensional data the state-of-the-art solutions are mainly based on the notion of Locality Sensitive Hashing, e.g.[8,4], or on random projections, e.g.[1,2].Another line of work focuses on subsets of general metrics which satisfy some sort of low intrinsic dimension assumption, e.g.[9].This is only a small fraction of the body of work on pointsets; however, very little is known about distances between curves.Let us start with point sequences, which are closely related to curves.For metrics M 1 , . . ., M k , we define the p -product of M 1 , . . ., M k as the metric with domain If there exists an ANN data structure with approximation factor c for each M 1 , . . ., M k , then one can build a data structure for the p -product with approximation factor O(c log log n) [3,10].
Let us now focus on curves: the two existing approaches both solve the approximate near neighbor problem, instead of the optimization ANN.It is known that a data structure for the approximate near neighbor problem can be used as a building block for solving the ANN problem.This procedure has provable guarantees on metrics [8], but it is not clear whether it can be extended to non-metric distances such as the DTW.
The first result for DFD by Indyk [10], defined by any metric (X, d(•, •)), achieved approximation factor O((log m + log log n) t−1 ), where m is the maximum length of a curve, and t > 1 is a trade-off parameter.The solution is based on an efficient data structure for ∞products of arbitrary metrics, and achieves space and preprocessing time in O(m 2 |X|) tm 1/t •n 2t , and query time in (m log n) O(t) .Table 1 states these bounds for appropriate t = 1 + o(1), hence a constant approximation factor.It is not clear whether the approach may achieve a 1 + approximation factor by employing more space.
Quite recently, a new data structure was devised for the DFD of curves defined by the Euclidean metric [7].The approximation factor is O(d 3/2 ).The space required is O(2 4md n log n + mn) and each query costs O(2 4md m log n).They also provide a trade-off between space/query time, and the approximation factor.At the other extreme of this trade-off, they achieve space in O(n log n + mn), query time in O(m log n) and approximation factor O(m).Our methods can achieve any user-desired approximation factor at the expense of a reasonable increase in the space and time complexities.
Furthermore, they show in [7] that the result establishing an O(m) approximation extends to DTW, whereas the other extreme of the trade-off has remained open.
Table 1 summarizes space and query time complexities, and approximation factors of the main methods for searching among discrete curves under the two main dissimilarity measures.

Our contribution
Our first contribution is a simple data structure for the ANN problem in p -products of finite subsets of d 2 , for any p.The key ingredient is a random projection from points in 2 to points in p .Although this has proven a relevant approach for ANN of pointsets, it is quite unusual to employ randomized embeddings from 2 to p , p > 2, because such norms are considered "harder" than 2 in the context of proximity searching.After the random projection, the algorithm "vectorizes" all point sequences.The original problem is then translated to the ANN problem for points in d p , for d ≈ d • m to be specified later, and can be solved by simple bucketing methods in space Õ d n • (1/ ) d and query time Õ(d log n), which is very efficient when d • m is low.
Then, we present a notion of distance between two polygonal curves, which generalizes both DFD and DTW (for a formal definition see Definition 13).The p -distance of two curves minimizes, over all traversals, the p norm of the vector of all Euclidean distances between paired points.Hence, DFD corresponds to ∞ -distance of polygonal curves, and DTW corresponds to 1 -distance of polygonal curves.
Our main contribution is an ANN structure for the p -distance of curves, when 1 ≤ p < ∞.This easily extends to ∞ -distance of curves by solving for the p -distance, where p is sufficiently large.Our target are methods with approximation factor 1+ .Such approximation factors are obtained for the first time, at the expense of larger space or time complexity.Moreover, a further advantage is that our methods solve ANN directly instead of requiring to reduce it to near neighbor search.While a reduction to the near neighbor problem has provable guarantees on metrics [8], we are not aware of an analogous result for non-metric distances such as the DTW.
Specifically, when p > 2, there exists a data structure with space and preprocessing time in 37:4

Products of Euclidean Metrics and Proximity
Table 1 Summary of previous results compared to this paper's: X denotes the domain set of the input metric.The first method is deterministic while the rest are randomized.All previous results are tuned to optimize the approximation factor.The parameters ρu, ρq satisfy (1 + ) O( 1) where α p, depends only on p, , and query time in Õ(2 2m log n).When specialized to DFD and compared to [7], our space and preprocessing time complexity is higher by the exponent log(1/ ) but our query time is linear instead of being exponential in d.
, there exists a data structure with space and preprocessing time in where α p, depends only on p, , and query time in Õ 2 2m log n .This leads to the first approach that achieves 1 + approximation for DTW at the expense of space, preprocessing and query time complexities being exponential in m.Hence our method is best suited when the curve size is small.Our results for DTW and DFD are summarized in Table 1 and juxtaposed to existing approaches in [7,10].
The rest of the paper is structured as follows.In Section 2, we present a data structure for ANN in p -products of 2 , which is of independent interest.In Section 3, we employ this result to address the p -distance of curves.We conclude with future work.

p -products of 2
In this section, we present a simple data structure for ANN in p -products of finite subsets of 2 .Recall that the p -product of X 1 , . . ., X m , which are finite subsets of 2 , is a metric space with ground set X 1 × X 2 × • • • × X m and distance function: For ANN, the algorithm first randomly embeds points from 2 to p .Then, it is easy to translate the original problem to ANN in p for large vectors corresponding to point sequences.

Concentration inequalities
In this subsection, we prove concentration inequalities for central absolute moments of the normal distribution.Most of these results are probably folklore and the reasoning is quite similar to the one followed by proofs of the Johnson-Lindenstrauss lemma, e.g.[11].The 2-stability property of standard normal variables, along with standard facts about their absolute moments imply the following claim.Claim 2. Let v ∈ R d and let G be k × d matrix with i.i.d random variables following N (0, 1).Then, where c p = is a constant depending only on p > 1.
Proof.Let g = (X 1 , . . ., X d ) be a vector of random variables which follow N (0, 1) and any vector v ∈ R d .The 2-stability property of gaussian random variables implies that g, v ∼ N (0, v 2 2 ).Recall the following standard fact for central absolute moments of Z ∼ N (0, σ 2 ): Hence, In the following lemma, we give a simple upper bound on the moment generating function of |X| p , where X ∼ N (0, 1).
Proof.We use the easily verified fact that for any x ≤ 1, exp(x) ≤ 1 + x + x 2 and the standard inequality 1 + x ≤ e x , for all x ∈ R.
Claim 4. Let X ∼ N (0, 1).Then, there exists constant C > 0 s.t. for any p ≥ 1, Proof.In the following, we denote by f (p) ≈ g(p) the fact that there exist constants 0 < c < C s.t. for any p > 1, ).In the following we make use of the Stirling approximation and standard facts about moments of normal variables.
The following lemma is the main ingredient of our embedding, since it provides us with a lower tail inequality for one projected vector.

Products of Euclidean Metrics and Proximity
Proof.For X ∼ N (0, 1) and any t > 0, The last inequality derives from Claim 4. Now, we set t = , and we assume wlog C > 4. Hence, for some constant c > 1.
Finally, we make use of the following one-sided Johnson-Lindenstrauss lemma (see e.g.[11]).Theorem 6.Let G be a k × d matrix with i.i.d.random variables following N (0, 1) and consider vector v ∈ R d .Then, for constant C > 0, Standard properties of p norms imply a loose upper tail inequality.

Corollary 7.
Let G be a k × d matrix with i.i.d.random variables following N (0, 1) and Proof.Since p ≥ 2, we have that ∀x ∈ R d x p ≤ x 2 .Hence, by Theorem 6, However, an improved upper tail inequality can be derived when p ∈ [1,2].
Lemma 8. Let G be a k × d matrix with i.i.d.random variables following N (0, 1) and where c p =

Embedding 2 into p
In this subsection, we present our main results concerning ANN for p -products of 2 .First, we show that a simple random projection maps points from d 2 to k p , where k = Õ(d), without arbitrarily contracting norms.The probability of failure decays exponentially with k.For our purposes, there is no need for an almost isometry between norms.Hence, our efforts focus on proving lower tail inequalities which imply that, with good probability, no far neighbor corresponds to an approximate nearest neighbor in the projected space.
We now prove bounds concerning the contraction of distances of the embedded points.Our proof builds upon the inequalities developed in Subsection 2.1.
Proof.By Lemma 5: In order to bound the probability of contraction among all distances, we argue that it suffices to use the strong bound on distance contraction, which is derived in Lemma 5, and the weak bound on distance expansion from Corollary 7 or Lemma 8, for a δ-dense set N ⊂ S d−1 for δ to be specified later.First, a simple volumetric argument [8] shows that there exists We first consider the case p > 2. From now on, we assume that for any u ∈ N , Gu p ≥ (c p • k) 1/p /(1 + ) and Gu p ≤ 2 √ k which is achieved with probability Now let x be an arbitrary vector in R d s.t.x 2 = 1.Then, there exists u ∈ N s.t.x − u 2 ≤ δ.Also, by the triangular inequality we obtain the following, The existence of M is implied by the fact that S d−1 is compact and x → x p , x → Gx are continuous functions.Then, by plugging M into (1), 37:8

Products of Euclidean Metrics and Proximity
where the last inequality is implied by Corollary 7. Again, by the triangular inequality, In the case p ∈ [1,2], we are able to use a better bound on the distance expansion; namely Lemma 8. We now assume that for any u ∈ N , Once again, we use inequality (1) to obtain: Theorem 9 implies that the ANN problem for p products of 2 translates to the ANN problem for p products of p .The latter easily translates to the ANN problem in d p .One can then solve the approximate near neighbor problem in d p , by approximating d p balls of radius 1 with a regular grid with side length /(d ) 1/p .Each approximate ball is essentially a set of O(1/ ) d cells [8].Building not-so-many approximate near neighbor data structures for various radii leads to an efficient solution for the ANN problem [8].
Theorem 10.There exists a data structure which solves the ANN problem for point sequences in p -products of 2 , and satisfies the following bounds on performance: If p ∈ [1, 2], then space usage and preprocessing time is in If 2 < p < ∞, then space usage and preprocessing time is in , query time is in Õ(dm log n), and α p, = log(1/ ) • (2 + p ) 2 • (p ) −2 .We assume ∈ (0, 1/2].The probability of success is Ω( ) and can be amplified to 1 − δ, by building Ω(log(1/δ)/ ) independent copies of the data-structure.
Proof.Let δ p, = p /(2 + p ).We first consider the case p > 2. We employ Theorem 9 and we map point sequences to point sequences in k p , for .
Hence, Theorem 9 implies that, Then, by concatenating vectors, we map point sequences to points in km p .Now, fix query point sequence Q = q 1 , . . ., q m ∈ R d m and its nearest neighbor U * = u 1 , . . ., u m ∈ R d m .By a union bound, the probability of failure for the embedding is at most since we already know that Hence, we now bound the second probability.Notice that By Markov's inequality, we obtain, Hence, the total probability of failure is 1+ /10 (1+ ) p .In the projected space, we build AVDs [8].The total space usage, and the preprocessing time is .
The query time is Õ(dm log n).The probability of success can be amplified by repetition.
By building Θ log(1/δ) data structures as above, the probability of failure becomes δ.
The same reasoning is valid in the case p ∈ [1,2], but it suffices to set .
When p ∈ [1, 2], we can also utilize "high-dimensional" solutions for p and obtain data structures with complexities polynomial in d • m.Combining Theorem 9 with the data structure of [4], we obtain the following result.

Products of Euclidean Metrics and Proximity
Theorem 11.There exists a data structure which solves the ANN problem for point sequences in p -products of 2 , p ∈ [1,2], and satisfies the following bounds on performance: space usage and preprocessing time is in Õ(n 1+ρu ), and the query time is in Õ(n ρq ), where ρ q , ρ u satisfy: We assume ∈ (0, 1/2].The probability of success is Ω( ) and can be amplified to 1 − δ, by building Ω(log(1/δ)/ ) independent copies of the data-structure.
Proof.We proceed as in the proof of Theorem 10.We employ Theorem 9 and by Markov's inequality, we obtain, Then, by concatenating vectors, we map point sequences to points in km p , where k = Õ(d).For the mapped points in km p , we build the LSH-based data structure from [4] which succeeds with high probability 1 − o (1).By independence, both the random projection and the LSH-based structure succeed with probability Ω( ) × (1 − o( 1)) = Ω( ).

Polygonal curves
In this section, we show that one can solve the ANN problem for a certain class of distance functions defined on polygonal curves.Since this class is related to p -products of 2 , we invoke results of Section 2, and we show an efficient data structure for the case of short curves, i.e. when m is relatively small compared to the other complexity parameters.First, we need to introduce a formal definition of the traversal of two curves.
Definition 12.Given polygonal curves V = v 1 , . . ., v m1 , U = u 1 , . . ., u m2 , a traversal T = (i 1 , j 1 ), . . ., (i t , j t ) is a sequence of pairs of indices referring to a pairing of vertices from the two curves such that: Now, we define a class of distance functions for polygonal curves.In this definition, it is implied that we use the Euclidean distance to measure distance between any two points.However, the definition could be easily generalized to arbitrary metrics.Definition 13 ( p -distance of polygonal curves).Given polygonal curves V = v 1 , . . ., v m1 , U = u 1 , . . ., u m2 , we define the p -distance between V and U as the following function: , where T denotes the set of all possible traversals for V and U .
The above class of distances for curves includes some widely known distance functions.For instance, d ∞ (V, U ) coincides with the DFD of V and U (defined for the Euclidean distance).Moreover d 1 (V, U ) coincides with DTW for curves V , U .Theorem 14. Suppose that there exists a randomized data structure for the ANN problem in p products of 2 , with space in S(n), preprocessing time T (n) and query time Q(n), with probability of failure less than 2 −2m−1 m −1 .Then, there exists a data structure for the ANN problem for the p -distance of polygonal curves, , where m denotes the maximum length of a polygonal curve, and the probability of failure is less than 1/2.Proof.We denote by X the input dataset.Given polygonal curves V = v 1 , . . ., v m1 , Q = q 1 , . . ., q m2 , and traversal T , one can define V T = v 1 , . . ., v l , Q T = q 1 , . . ., q l , sequences of l points (allowing consecutive duplicates) s.t.∀k, v i k = V T [k] and q j k = Q T [k], if and only if (i k , j k ) ∈ T .
One traversal of V , Q is uniquely defined by its length l ∈ {max(m 1 , m 2 ), . . ., m 1 + m 2 }, the set of indices A = {k ∈ {1, . . ., l} | i k+1 − i k = 0 and j k+1 − j k = 1} for which only Q is progressing and the set of indices B = {k ∈ {1, . . ., l} | i k+1 − i k = 1 and j k+1 − j k = 1} for which both Q and V are progressing.We can now define V l,A,B , Q l,A,B to be the corresponding sequences of l points.In other words if l, A, B corresponds to traversal T , V l,A,B = V T , Q l,A,B = Q T .Observe that it is possible that curve V is not compatible with some triple l, A, B.
We build one ANN data structure, for p products of 2 , for each possible l, A, B. Each data structure contains at most |X| point sequences which correspond to curves that are compatible to l, A, B. We denote by m = max(m 1 , m 2 ).The total number of data structures is upper bounded by For any query curve Q, we create all possible combinations of l, A, B and we perform one query per ANN data structure.We report the best answer.The probability that the building of one of the ≤ m • 2 2m data structures is not successful is less than 1/2 due to a union bound.
We now investigate applications of the above results, to the ANN problem for some popular distance functions for curves.

Conclusion
Thanks to the simplicity of the approach, it should be easy to implement it and should have practical interest.We plan to apply it to real scenarios with data from road segments or time series.
The key ingredient of our approach is a randomized embedding from 2 to p which is the first step to the ANN solution for p -products of 2 .The embedding is essentially a gaussian projection and it exploits the 2-stability property of normal variables, along with standard properties of their tails.We expect that a similar result can be achieved for p -products of q , where q ∈ [1, 2).One related result for ANN [6], provides with dimension reduction for q , q ∈ [1, 2).