Exemplar Selection via Leave-One-Out Kernel Averaged Gradient Descent and Subtractive Clustering

. Scalable data mining and machine learning require data abstractions. This work presents a scheme for automatic selection of representative real data points as exemplars. Currently few algorithms can select representative exem-plars from the data. K-medoids and Affinity Propagation are such algorithms. K-medoids requires the number of exemplars to be given in advance, as well as a dissimilarity matrix in memory. Affinity propagation automatically finds ex-emplars as well as their k number but it requires a similarity matrix in memory. A fast algorithm, which works without the need of any matrix in memory, is Subtractive Clustering, but it requires user-defined bandwidth parameters. The essence of the proposed solution relies on a leave-one-out kernel averaged gradient descent that automatically estimates a suitable bandwidth parameter from the data in conjunction with Subtractive Clustering algorithm that further uses this bandwidth for extracting the most representative exemplars, without initial knowledge of their number. Experimental simulations and comparisons of the proposed solution with Affinity propagation exemplar selection on various benchmark datasets seem promising.


Introduction
A common problem in applications that collect and store their data is that the number of training examples may be large.Hence, many machine learning and data mining algorithms become slow [1] [2].One of the solutions is to select most representative exemplars from the data.These exemplars are real data points that form an abstract view of the whole dataset, can represent the structure of the data and can also be used for recognizing patterns [2].Finding exemplars is a hard problem [3] but is more interesting and informative than dividing data into clusters.Detecting exemplars goes beyond simple clustering, as the exemplars store compressed information [3].Hence, exemplar selection techniques try to find additional regional information in order to extract representative k-exemplars or k-medoids or k-centers which are close to any given training point so as to minimize the maximum distance from a point to its nearest exemplar.The first exemplar-based algorithm was k-medoids [4] which requires the number k of exemplars to be given in advance, as well as a dissimilarity matrix in memory.Yet, finding exemplars without knowing the k number is a challenge since this k-centers problem or k median objective is NP hard [5] [6] [7].
Currently, Affinity Propagation (AP) introduced by Frey and Dueck [8] is the state-of-the-art algorithm for detecting exemplars and subsequently clustering the data around them.AP has been applied in various fields and many applications.In AP all data points are simultaneously considered as exemplars, but exchange deterministic messages until a good set of exemplars gradually emerges.AP finds an approximate solution by using this message passing optimization strategy that is based on maxsum algorithm in a factor graph [8].Hence, AP does not require the number of exemplars, since this number gradually emerges automatically during the process.However AP does require a similarity matrix in main memory as well as a user defined parameter, the preferences, which are the diagonal values of the similarity matrix.
A fast algorithm, which works without the need of any similarity matrix in main memory, is Subtractive Clustering (SC) [9] [10].This algorithm was also employed in RBF neural network training [11] [12].Subtractive Clustering can determine both the exemplars and their number [10] but it requires carefully selected user-defined parameters for the bandwidth and the stopping criteria.
In this work we propose a leave-one-out kernel average gradient descent procedure that estimates a bandwidth parameter from the data, and then we use this bandwidth in a modified subtractive clustering algorithm.We demonstrate that the proposed scheme can provide an automatic estimate of most representative exemplars from the data and in the same time can recognize shapes of patterns.
The rest of the paper is organized as follows.Section 2 provides the basics for Subtractive Clustering.Section 3 introduces the proposed gradient descent of the leaveone-out kernel averaged regression function.Section 4 describes all the initializations and the parameter settings for the proposed scheme.Section 5 presents several experimental simulations and comparisons, while section 6 concludes the paper.

Subtractive Clustering basics
Subtractive clustering algorithm [9] [10] [11] [12] selects a set of exemplars from the most representative real data points by using their density.Subtractive clustering can work without any priori information about the number of exemplars.In the first step it computes a density-based potential for every point and then gradually subtracts exemplars by updating all the remaining potentials.The potential P(i) for each point x i is defined as a sum of Gaussian kernels over all the N data points as: where a = (2/σ a ) 2 and the bandwidth σ a represents a neighbourhood radius.A data point will have high potential P(i) and high density if it has many neighbour points.
After finding all P(i) the algorithm iteratively executes an updating cycle as: 1) Find data point x * (cluster center) with the highest potential value P * 2) Revise the potential of all other points using P(i The updating cycle for the potentials P(i) terminates if the current max potential P * drops below a certain value and the algorithm stops if (P * < e P 1 * ) [10] [11] [12] where * is the first max potential and e a small percentage.In each iteration the highest potential P * of the selected point x* will substantially affect all the revised potentials of the points near by.Thus, the data points near the selected point x * will have significantly reduced density.The updates of the potentials use b = (2/σ b ) 2 where bandwidth σ b is another positive constant which also defines a neighbourhood radius.Usually σ b is taken to be as 1.5σ a , in order to avoid the selection of closely located exemplars.
The main problem is choosing an appropriate value for the bandwidth parameter σ a .This choice is of crucial importance and is usually done via extensive experimentation and trial-and-error.The potentials P(i) represent density.So, one can subjectively try to choose a bandwidth σ a by looking at potentials produced by a wide range of bandwidths, starting with large values of σ a and gradually decreasing them until a reasonable density is reached.However, such an approach is impractical and too many validations are needed, since there is no way to define a-priori a suitable density value.This is what we are looking for in the first place.A more important issue is that the potentials affect the number of exemplars and their locations.If the bandwidth is very small this will result in neglecting the effect of neighbouring points and then all points will be selected as exemplars.If the bandwidth is small then many exemplars will be selected.If the bandwidth is large then the density function will be affected by accounting all the points and few exemplars will be selected.If the bandwidth is too large then even fewer exemplars will be selected.It is very easy for anyone to see these limits by using trial-and-error.Furthermore, the bandwidth is dataset dependent and the previous limits depend on the formation of a given dataset.An automatic or semi-automatic process is essential as part of a more global analysis in order to avoid many user-defined parameters.In our scheme the proposed leave-one-out gradient descent provides proper bandwidth values for Subtractive Clustering automatically.

Proposed gradient descent of leave-one-out Kernel averaged
We propose gradient descent learning of the kernel averaged (or weighted average) regression function to automatically estimate a bandwidth parameter.Given a training set {x i , y i N i 1 } = where x i are the points and y j are the desired labels (which we will define later in eq.4), the conventional kernel averaged regression function f(x i ) is: where 2 is the squared Euclidean distance.The kernel averaged f(x i ) has a nominator Σφ j (x i )y j , and a denominator Σφ j (x i ) defined as a sum of φ k (x i ) Gaussian kernels over all N data points.Since in subtractive clustering the potential P(i) = Σφ j (x i ) we can see that actually this potential is the normalization factor of f(x i ).

Gradient of the leave-one-out kernel averaged
The proposed leave-one-out kernel averaged regression function f loo (x i ,γ) is given by leaving out from the sum in eq. 2 a percentage γ of the self-contribution of x i as: where γ is the small leave-one-out parameter which takes values in the range [0, 1].
The proposed method uses desired labels y i for the points x i .We define them as: Thus, each desired label y i is considered as the variance of the corresponding x i , if this x i was the center of the training set.So The gradient ∂E(σ,x)/∂σ, with respect to bandwidth σ, is computed from the squared error E(σ,x) which is a convex function defined as E(σ,x) = (f loo (x,γ)−y) 2  where f loo (x,γ) is the leave-one-out kernel averaged regression function.
Without the leave-one-out such a gradient will not work.Taking a gradient of the kernel averaged with respect to the bandwidth will not result in a suitable solution, since eventually all points will converge to tiny bandwidth values (they will be correct for predicting themselves).
The classical squared error E i (σ,x i ) for each x i is: The gradient descent update for the σ parameter can be defined from the gradient of the squared error as: The chain rule of the gradient gives: where the derivate (∂f loo (x i ,γ) /∂σ) is: where we only need to find the derivate ∂g k (x i )/∂σ given by: This equation by using and by replacing the expression for g k (x i ) from eq.2b into eq.10 it gives: Eq. 11 is the general derivate for any function g k (x i ).
The derivate ∂g i (x i )/∂σ (of the contribution of x i to itself) has a shorter expression produced by eq.11 which after simplifications (by setting δ i (x i ) = 0 and (12) Finally by substituting eq.11 and eq. 12 into eq.8 we can compute (∂f loo (x i ,γ)/∂σ.In a more shorthanded notation it gives: The small leave-one-out parameter γ ∈ [0, 1] prevents the gradient from converging into tiny values of the bandwidth σ.There exists a trade-off between γ=1 which gives large bandwidths and γ=0 which gives tiny bandwidths.
Stochastic mode (or online) of gradient descent learning computes the gradient by using a single example at a time.The algorithm randomly selects an example x i and its label y i and updates the current parameter σ by using: Hence, an epoch ends after all examples are introduced in a random order.Then the gradient updates of σ are averaged over all N examples as σ epoch = avg(σ (t) ) with t=1,…,N.The learning rate ξ can be constant or can vary at each epoch.For one epoch step the leave-one-out kernel averaged gradient descent is: for t = 1 to N pick randomly a point x t without replacement update the parameter σ by using σ (t+1) = σ (t) − ξ ∂E(σ (t) , x t )/∂σ end for

Initializations and parameter settings
As usual the first thing to do is to scale the data features into the range [0, 1].Without scaling the gradient might not converge, since the learning rate ξ depends on the scale of the feature space.By scaling the data features first, we can then use a fixed value for ξ for all datasets and hence avoid searching for suitable learning rates each time we use a different dataset.Such scaling also avoids over-fitting which occurs when some features are in large numeric ranges.
In Subtractive Clustering (SC) the potential updating cycle terminates if the current max potential P* become less that a threshold (P * < e P 1 * ).If e is selected to be very small, a large number of exemplars will be selected.On the contrary, a large value of e will lead to a small exemplar set.In order to avoid any other user-defined parameter we set e = 1/P 1 * .That is, Subtractive Clustering terminates at j-th iteration when P j * < 1.Thus, every point starts with potential P(i)>=1 and finally ends up with potential P(i)<1.There is a theoretical justification for this limit since P(i)=1 is the selfcontribution of every i-th point to itself.
For 2-dimensional datasets in Subtractive Clustering we set σ b = 1.5σ a as recommended.High dimensional density estimates may suffer from the curse of dimensionality.For higher dimensions there is a problem since the 1.5 percentage influences more strongly the nearby points and we use a variable σ b = σ a + 0.5 (1.0 -k sofar /N) σ a , which starts from σ b = 1.5σ a and decays.As k sofar (the number of selected exemplars so far) increases from 1 to k during the P(i) updating cycle of SC, the parameter σ b gradually decreases and in the theoretical limit k=N the value σ b becomes equal to σ a .
For the online gradient descent we set a fixed learning rate ξ = 0.2 and maximum epochs = 10.Usually it converges after the first epoch if the dataset size is larger than 10000.So, for larger datasets we can set maximum epochs = 2.
For the leave-one-out kernel averaged regression function we set the leave-one-out parameter γ = 0.1.The value γ = 1 removes the self-contribution completely and will give a large bandwidth and very few exemplars, while γ = 0 will give a tiny bandwidth and almost all points as exemplars.Since the goal is just to avoid this, we found after some experimentation that a value γ = 0.1 is always sufficient enough to prevent bandwidth from converging into tiny values, so as to provide a stable solution without producing large bandwidths.
Initializing the bandwidth σ in the beginning of gradient descent (epoch = 0) is an issue, since for different datasets we may need to search for different initial values of σ each time.However there is a simple automatic way that works around this.We set the initial bandwidth equal to the trace of covariance matrix R. Hence, given N points r , where r ii are the diagonal elements of R. Thus, the gradient descent starts with a relative large bandwidth σ which decreases immediately after the first epoch, until it converges.
It is important to note that we use the same settings for all the datasets and no userdefined parameter is needed.

Experimental simulations
The first set of experimental simulations present results for visual comparisons of AP with the proposed algorithm using four 2-d datasets.The second set present performance comparisons and quality analysis on several real world benchmark datasets.
The code for Affinity Propagation (AP) was downloaded from the official site (http://www.psi.toronto.edu/affinitypropagation). AP uses as input a similarity matrix S in which the pair-wise similarities between data points are defined from their distances as s(i,k) = −||x i − x k || 2 for every i≠k, as suggested in [8].There are two more parameters: the damping factor λ and the prior preferences s(k,k) which are the diagonal values of the similarity matrix.The dumping factor is usually λ = 0.5 as suggested.For the preferences, a good choice [8] is to set all the diagonal elements s(k,k) equal to the median value of all the similarities between data points.We use as preference the one half of the mean value of all similarities (1/(2N 2 )) ∑ ∑ in a moderate number of exemplars which emerge automatically.This choice selects much more exemplars than the median choice while it still avoids selecting outliers.

Evaluation Criteria and Quality indexes
The sum of squared errors (SSE) which quantifies the clustering error is the most widely used quality criterion [8] and is given by the sum of the squared distance between each point x i and its corresponding exemplar c(x i ) as: The maximum distance (maxD) between any point x i and its exemplar c(x i ) that can quantify if all points are compactly represented (no cluster is larger than maxD) is: The normalized Hubert gamma statistic [13] is a well known cluster evaluation criterion which is invariant to the number of clusters, given by: where M = N(Ν−1)/2, and it uses two proximity matrices P and Q both of size N×N.An element P(i, j) is the distance between points x i and x j .An element Q(i, j) is the distance between the cluster representative centroids to which x i and x j belong.µ P is the mean of all elements of matrix P, µ Q is the mean of all elements of matrix Q, while σ P and σ Q are their standard deviations from their means.A high value of this statistic (close to 1) indicates the existence of well-separated compact clusters.
The net similarity cost is defined as a cost function specifically for AP [8] [14] and it is the sum of similarities s(i,k) between data points and their exemplars, minus the exemplar costs s(k,k), (the preferences of the exemplars).AP identifies a set of exemplars K so as to maximize this cost given by [14]:

Visual comparisons of AP with the proposed KG-SC
For the visual comparisons we use four datasets with 2 dimensions each.We compare the results of Affinity Propagation (AP) algorithm with the proposed leave-oneout kernel gradient subtractive clustering (KG-SC in short).Table 1 illustrates the quality indexes that correspond to the exemplar selections and clustering solutions of the datasets in figs 1-4 for the AP and KG-SC.From table 1 it seems that both algorithms can provide high quality results for the 2-dimensional datasets, while a slight precedence could be given to KG-SC.

AP found 20 exemplars automatically KG-SC found 39 exemplars automatically
In addition, it is apparent in table 1 that AP delivers exactly what it promises, that is to identify a set of exemplars K so as to maximize the net similarity cost [14].The net similarity cost is better for AP than KG-SC.So AP remains the best algorithm for the k-centers problem.
However the clustering error that quantifies the distortion and the normalized Hubert index that quantifies the cluster compactness are better for KG-SC.Note that ideal clustering solutions usually have the normalized Hubert index close to 1 as they are in the last column of table 1.So KG-SC delivers more well defined exemplars.
It is the k-centers problem itself which might not be able to guarantee the best exemplar selection.That is why KG-SC takes a different path; the density based, and tries to find the most important representatives from the densest ones.The better quality of the KG-SC solutions is evident from the maximum Distance, clustering error and normalized Hubert statistic in table 1.
The computational complexity cost of the proposed KG-SC is quadratic O(N 2 ) of the same order with the cost of SC.Actually, for large datasets and max epochs = 2 it is two times that of SC.This cost is much lower than the AP cost.The memory requirements for KG-SC is O(Nd), since only the dataset is needed in main memory.On the other hand, AP does require three matrices (similarities, availabilities, responsibilities) of size N×N in main memory and this could limit the algorithm.However, one can argue that for the special case of ultra high-dimensional datasets where the data dimension is of the same order with the number of examples (d ≈N) the memory requirements become the same.
What will happen in a case where someone needs a fixed number of exemplars less than the KG-SC algorithm finally selects is a question that could be answered.Note that an advantage of Subtractive Clustering is that it returns exemplars in decreasing order from the most important to the least important.So, picking the first K in this list is one simple solution.

Quality Comparisons on real world benchmark datasets
Quality comparisons are also performed on a number of publicly available realworld benchmark problems which are downloaded from the UCI machine learning data repository (http://archive.ics.uci.edu/ml).The specific details of these datasets (dataset name, N examples before duplicate removal, d dimensions) are illustrated in table 2 together with the results.
We found that while KG-SC as a density-based algorithm does not suffer from the existence of duplicates, AP does.Thus for a fair comparison we first remove all duplicates from the benchmark datasets.Also, since the net similarity is not a quality index but a specific cost suitable only for AP (it was always better for AP) we do not illustrate it in table 2. Note, for future considerations that we detect several duplicates in the datasets Banknote Authentication, Blood Transfusion, Phoneme, Wisconsin Breast Cancer, Haberman, Yacht Hydrodynamics, Red Wine Quality, White Wine Quality, Concrete Compressive Strength.Both algorithms find well defined representative exemplars and deliver high quality solutions, since the normalized Hubert gamma index is very high in both of them for all benchmark datasets.In low dimensional datasets KG-SC seems better, while in high dimensional datasets AP seems better.
There are some limitations.AP has limits in the number of examples, while the proposed KG-SC is density-based and might limited by the number of dimensions (features).The Dermatology dataset has many dimensions (d = 34) and the leave-oneout gradient could not converge for γ = 0.1, so we use a minimum value γ = 0.01.The Shuttle dataset has quite many examples (N = 58000) and the Affinity Propagation runs out of memory (it needs 39 GB).For the Shuttle dataset the proposed leave-oneout Kernel Gradient Subtractive Clustering produces 956 exemplars and a normalized Hubert gamma statistic 0.999 which indicates very well formed compact clusters.

Conclusions
We present a scheme that can potentially permit automatic selection of representative exemplar points from the data without the need of any used-defined parameter.By computing a gradient descent for a simple leave-one-out kernel averaged regression function that can automatically estimate a suitable bandwidth parameter for the density-based Subtractive Clustering algorithm we can extract most representative exemplars, without initial knowledge of their number.Evaluating with classical quality indexes the data clustering solutions around these exemplars reveal that the proposed KG-SC algorithm produce well separated compact and dense clusters.Experimental comparisons with the state-of-the-art Affinity Propagation exemplar selection algorithm show that both algorithms select well defined representative and can deliver high quality solutions.KG-SC is simply parallelizable, a point worthwhile studying in the future.We also plan to explore the possibility of using either minibatch gradients, or a dual tree for speeding up KG-SC.Interesting future works could extend KG-SC in order to explore a possible automation in other density based algorithms.Currently we study the proposed KG-SC for training Neural Networks.

Table 1 .
Quality indexes for the exemplar selection and clustering solutions of the datasets in figs 1-4.For each algorithm (AP and KG-SC) we illustrate the net similarity cost, maximum Distance, clustering error, normalized Hubert statistic.Best indexes are marked in bold.

Table 2 .
Quality indexes for the exemplar selection of various benchmark datasets with N examples and d dimensions.For each algorithm (AP and KG-SC) we illustrate the k number of selected exemplars which emerge automatically, the maximum Distance, the clustering error and the normalized Hubert gamma statistic.Best quality indexes are marked in bold.