A New Topology-Preserving Distance Metric with Applications to Multi-dimensional Data Clustering

. In many cases of high dimensional data analysis, data points may lie on manifolds of very complex shapes/geometries. Thus, the usual Euclidean distance may lead to suboptimal results when utilized in clustering or visualization operations. In this work, we introduce a new distance definition in multi-dimensional spaces that preserves the topology of the data point manifold. The parameters of the proposed distance are discussed and their physical meaning is explored through 2 and 3-dimensional synthetic datasets. A robust method for the parameterization of the algorithm is suggested. Finally, a modification of the well-known k-means clustering algorithm is introduced, to exploit the benefits of the proposed distance metric for data clustering. Comparative results including other established clustering algorithms are presented in terms of cluster purity and V-measure, for a number of well-known datasets.


Introduction
Clustering high-dimensional data is an area that has attracted considerable research interest over the past two decades [1,3,6].The existence of irrelevant features and correlations between subsets of features, which are commonly encountered in such datasets, renders the task of identifying clusters much harder as distances between observations become less informative about the cluster structure.Dimensionality reduction and Feature Embedding is widely used to improve clustering performance and to enable the visualization of the resulting cluster structure in such data.Although well-established methods like Principal Component Analysis (PCA) and metric Multi-Dimensional Scaling (MDS) [2] have been successfully applied on a plethora of highdimensional applications, there is no guarantee that the cluster structure in the highdimensional space will be preserved in the low-dimensional subspace since in many cases clusters could be defined by highly non-linear structure.For this purpose nonlinear dimensionality reduction techniques have been explicitly designed to identify a lower dimensional manifold along which the data lie, and are therefore appropriate to distinguish nonlinearly separable clusters.
Kernel-based clustering is amongst the most popular methods for nonlinear clustering, based on the projection of the input data points into a high dimensional kernel space in order to make nonlinear clusters linearly separable [5].In particular, kernel kmeans combines the k-means method with the kernel trick in an attempt to deal with nonlinearly separable data, however specifying a suitable kernel function and appropriate parameters most of the times is a hard task.Another widely used manifold learning method is isometric mapping (ISOMAP).Instead of using the Euclidean distance, Isomap is based on approximating geodesic distance along the manifold [7].However, isomap operates on neighboring data-points defined by Euclidean threshold distance, which accelerates the algorithm but presents problems in case of outlier points.In [8] the authors applied the k-means clustering algorithm after Isomap and proposed a modified definition of the geodesic distances but concluded that even their modified method was unsatisfactory in real-data cases where the data is noisy or the clusters are highly nonlinear.
In this work we propose a new topology preserving distance that follows the geodetic of the underlying manifold.Instead of imposing a threshold on distance between points, we construct a graph with all available points and impose a penalty function that penalizes distant points.The definition of distant points uses a characteristic distance parameter whose value is automatically estimated from the available dataset.Furthermore, we propose a modification of the k-means algorithm incorporating benefits of a newly introduced distance metric that preserves the topology of the data point manifold.A critical advantage of the proposed approach is the feasible robust parameterization of the algorithm.Extensive experiments on both simulated and real data sets employing the Purity and V-measure metrics for comparison, as described in [4] provide further evidence on the wide applicability of the proposed method.

The proposed topology-preserving distance metric
Let P be a data matrix of dimensions N×K, each row of which is a feature vector (equivalently a data point) p of dimensionality K. Any given set of such points may be arranged on an unknown manifold in the K  space.Thus, the Euclidean distance metric between any two points may not represent their actual distance.
Let us define an auxiliary distance metric between any pair of data points as: where is the Euclidean distance between the two points, λ is a sufficiently big value and d 0 is a characteristic distance, whose value is estimated for the current dataset, as it will be described later.Let us stretch that D ij is not the distance metric proposed in this work, rather an auxiliary definition.
For a given set of data points is defined with vertices P as the set of all data points and edges E as the set of all possible connections between vertices, . Thus, each point is connected to all other points in the data set.The cost of the connection (edge) between any pair of points j i p p , is set equal to their auxiliary distance D ij , as defined in Eq.( 1).The proposed topology preserving distance between j i p p , is defined as the cost of the minimum-cost path ij  according to the well-known Dijk- tra's algorithm, between the two points.
Since any generated path j i,  between j i p p , consists of an ordered series of data points with indices   , where the proposed topology preserving distance between j i p p , is calculated as The parameter d 0 is a characteristic length in K  that defines the scale of local linearity in a given set of data points.It is self-evident that any two points with Euclidean distance less than or equal to d 0 will be connected without any intermediate points.On the other hand, for any two points with Euclidean distance greater than d 0 , the proposed algorithm will generate a connecting path with intermediate points, if sufficient pairs of these points have Euclidean distance not greater than d 0 .Fig. 1 shows the paths generated by the Dijkstra's algorithm using the auxiliary distance D ij , in the case of data points generated using the Swiss roll data set [11], for λ=10 8 , d 0 =2, using one randomly selected point i 0 as the source for Dijkstra's algorithm (the point from which all other distances are calculated).The dataset is constructed to contain 2 classes (N c =2), 400 points each, denoted by different color.The points lie on a manifold that is defined by a parametric equation.The paths j , i 0  are also plotted as blue lines for all points j=1,2,…, 800, j≠ i 0 .As it can be observed, the selected value of d 0 generates shortest paths that lie on the manifold, rather than crossing the gap as it would be dictated by the Euclidean distance.The use of this propose distance metric in any clustering or classification process on this dataset is expected to significantly increase the achieved accuracy.for almost all pairs of points j i p p , , also resulting in a single-edge path, just as described previously.

 
Let i 0 be the index of a randomly selected point.The proposed algorithm is executed N-1 times connecting i 0 with all the rest N-1 points j p , calculating distance j i A 0 and generating the corresponding paths j i 0  .Let us denote the sequence of points that constitute the path from i 0 to j as   By its definition, i d is calculated for a selected point i 0 and it is a function of d 0 .
It is easily proven that when d 0 has very low values, below the minimum Euclidean distance d min in the dataset, i d is equal to the mean Euclidean distance between i 0 and all data points.In the case of data points being equally distributed (eg. on a regular grid), the quantity i d is expected to be monotonically increasing when d 0 in [d min , d max ].When d 0 approaches values equal to d max , the i d becomes equal to the mean Euclidean distance between i 0 and all data points and remains constant for larger values of d 0 .In the case however of anisotropic data point distribution, i d drops sharply when d 0 has an appropriate intermediate value, since the proposed algorithm generates connecting paths between data points that consist of steps with smaller distances.Finally, when d 0 approaches values equal to the maximum Euclidean distance in the dataset d max , the i d becomes equal to the mean Euclidean distance between i 0 and all data points.Thus, for any point in the dataset i, the quantity is calculated for different values of 0 d and the value of In the special case of data points lying on a manifold with shape of large scale concavities, then i d min has a value that indicates the characteristic length of the concavities.Thus, setting d 0 to a value less than i d min will cause the proposed algorithm to produce connecting paths between data points that do not cross the concavities, but lie on the manifold, thus behaving like geodetic curves.In order to obtain a good estimation of i d min , the calculation is repeated for many randomly selected points in the dataset and the average i d min is obtained.Figure 2 shows the  The effect of d 0 is demonstrated in Fig. 4. One point is randomly selected from the dataset and the minimum cost connections with all other points are shown.The cost of the connecting path is calculated according to Eq. ( 3), as a function of d 0 .It can be observed that for very low or very high values of d 0 , the connecting paths cross the topological gap between points, whereas for intermediate values, the paths follow the manifold, as geodesic curves.
Fig. 4. The minimum cost paths generated by the proposed algorithm applied on the Swiss roll dataset, for different values of d 0 .Intermediate values of d 0 , as suggested by the estimation of the average i d min , produce paths that follow the underlying manifold.

A k-means variant for the proposed topology-preserving distance metric
In this subsection a variant of the k-means clustering algorithm is proposed, that utilizes the proposed topology preserving distance metric.The main differences from the classic k-means algorithm can be summarized as following: The N×N distance matrix is calculated using the proposed distance metric in Eq.(2) (it requires the characteristic length d 0 ): The class centers in each iteration are selected from the data points, so that they minimize the average (proposed metric) distance from the members of the specific class.The algorithm is terminated when all class centers remain unchanged in two consecutive iterations.The details of the proposed algorithm are given below.
Input: the data matrix P , the number of classes

Results
The proposed k-means variant that uses the proposed distance metric is evaluated against the classic k-means, the kernel k-means (implemented as in [9]) and the spectral clustering (implemented in [10] and [12]) in terms of purity of clustering, as well as V-measure.The proposed method has been executed for 20 times with random initialization and the resulting average purity, as well as the standard deviation are plotted for different values of d 0 in Fig. , the proposed algorithm behaves very similarly to the classic k-means.This is expected, since as described above the proposed distance definition becomes similar to the Euclidean one.Fig. 5.The achieved purity and V-measure using the proposed method, applied to the Swiss roll dataset, for different values of d 0 , against the classic k-means, the kernel k-means [9] and the spectral clustering [10].
Fig. 6.The achieved purity and V-measure using the proposed method applied to the COIL dataset, for different values of d 0 , against the classic k-means, the kernel k-means [9].The spectral clustering [9] produced worse results and it was therefore not included in the graph.
Table 1 shows the clustering purity and V-measure achieved by the proposed method, k-means, kernel k-means and spectral clustering.The values for the proposed method were calculated by using the corresponding value for d 0 slightly less than the estimated i d min .The standard Matlab implementation was used for the k-means method.Ker- nel k-means was used as provided in [9].Spectral clustering was used as provided in [10] and/or in [12] that implements the algorithm described in [13].

Conclusions
A new distance metric for high-dimensional data has been presented that preserves the topology of the underlying manifold.A variant of the k-means clustering algorithm has been suggested that utilizes this metric.The value of the main parameter of the proposed distance metric can be obtained with a standard and efficient process.
The performance of the proposed method has been analyzed theoretically and validated experimentally, on a number of benchmark datasets.Comparative results with well-established clustering algorithms show that the proposed method is a competent alternative with consistent behavior that systematically performs equally well or better than the other techniques under comparison.Future work includes the algorithmic fine tuning of the proposed k-means variant and the extension of the application of the distance metric to visualization and dimensionality reduction techniques.Comparative results will also be expanded to include more optimized implementations of other state of the art methods.

Fig. 1 .
Fig. 1.The paths generated by the proposed distance definition in the case of the Swiss Roll dataset, for d 0 =2.The points lie on a manifold and consist of two classes shown in green and red color length of the steps in the path from i 0 to j.Then the average max 0 j i d can be calculated for the selected point i 0 over all other points j in the data set:

idFig. 2 .
Fig. 2. The i d as a function of d 0 , for a number of data points and the determined i d min plot- ted as a green circle, for data points a) on a 2D regular grid and (b) randomly distributed.

cN
, the N×N distance matrix A using the proposed metric.Experimentation shows (see Results section) that the proposed variant of the k-means method produces consistently optimal results for values of d 0 slightly smaller than the estimated i d min .

5 .
The same quantities achieved by the classic k-means clustering, the kernel k-means and the spectral clustering are also shown.It can be observed that the proposed algorithm clearly outperforms the classic and the kernel k-means.The behavior of the proposed algorithm with respect to parameter d 0 is consistent with the estimated value of i d min : for values of less than the achieved clustering is consistently high.For values of

Fig. 6
Fig.6shows the same results for the COIL dataset.The determination of i d min as shown in Fig.6(a) is unambiguous.The behavior of the proposed k-means variant with respect to d 0 , is also very consistent, with best performance occurring at values of d 0 slightly smaller than the estimated i d min .The proposed k-means variant with the suggested distance metric outperforms the other methods in comparison.

Table 1 .
Dataset description, with the achieved clustering purity and V-measure by the proposed method, classic and kernel k-means and spectral clustering.