Efficient Support Vector Machine Classification Using Prototype Selection and Generation

. Although Support Vector Machines (SVMs) are considered eﬀective supervised learning methods, their training procedure is time-consuming and has high memory requirements. Therefore, SVMs are inappropriate for large datasets. Many Data Reduction Techniques have been proposed in the context of dealing with the drawbacks of k -Nearest Neighbor classiﬁcation. This paper adopts the concept of data reduction in order to cope with the high computational cost and memory requirements in the training process of SVMs. Experimental results illustrate that Data Reduction Techniques can eﬀectively improve the performance of SVMs when applied as a preprocessing step on the training data.


Introduction
The effectiveness, efficiency and scalability of machine learning and data mining algorithms are crucial research issues that have attracted the attention of both the industry and academia. Many proposed algorithms cannot handle high volumes of data that nowadays is easily available from several data sources. For those algorithms, data reduction 3 is an important preprocessing step.
In classification tasks data reduction processes are guided by the class labels. Many Data Reduction Techniques (DRTs) have been proposed in the context of dealing with the drawbacks of k-Nearest Neighbors (k-NN) classifier [9]. These drawbacks are: (i) the high computational cost during classification, (ii) the high memory requirements and (iii) the noise sensitivity of the classifier. However, that kind of data reduction has not been adopted by other classification methods which cannot manage large datasets.
A DRT can be either a Prototype Selection algorithm (PS) [10] or a Prototype Generation algorithm (PG) [20]. PS algorithms select representative instances from the training set. PG algorithms generate representatives by summarizing similar training instances. These representatives are called Prototypes. PS algorithms can be either editing or condensing. Editing aims at improving accuracy by removing noise, outliers and mislabeled instances and by smoothing the decision boundaries between classes. PG and PS-condensing algorithms try to build a small condensing set that represents the initial training data. Using a condensing dataset instead of the original dataset has the avail of low cost while accuracy remains almost as high as that achieved by using the original data. Please note that some PG and PS-condensing algorithms are called hybrid because they integrate the concept of editing.
To the best of our knowledge, DRTs have been used in the context of k-NN classification. There is no work that explores the application of data reduction on large datasets in order to render the usage of SVMs applicable on them. This is the key observation behind the motivation of the present work. Another motive is to check whether PG algorithms we proposed in the past can aid the development of fast and accurate SVM based classifiers.
This paper also contributes an experimental study on several datasets where SVM based classifiers, which are trained by the original training data and the corresponding condensing sets built by state-of-the-art DRTs, are compared to each other and against the corresponding k-NN classifiers. The paper reviews in detail the algorithms that are used in the experimental study. Our study reveals that the usage of DRTs leads to fast and accurate SVM-based classifiers.
The rest of this paper is organized as follows. Section 2 briefly reviews SVMs and the k-NN classifier and Section 3 presents in detail the PG and PScondensing algorithms that we use in our experimental setup. Section 4 presents the experimental study and results. Finally, Section 5 concludes the paper.
2 Support Vector Machines and the k-NN classifier

Support Vector Machines
SVMs are supervised learning models introduced in 1995 by Cortes and Vapnik [7] although the roots of the idea lie in the theory of statistical learning introduced by Vapnik almost two decades earlier [21]. They are suitable for pattern classification but can be easily extended to handle nonlinear regression problems in which case they are known as Support Vector Regressors (SVRs). The separating surface offered by an SVM classifier maximizes the margin, i.e., the distance of the closest patterns to it. This helps the generalization performance of the model and in fact it is related to the idea of Structural Risk Minimization [23,22] which avoids over-fitting. With the use of nonlinear kernel functions such as Gaussian (RBF) or n-th order polynomials, SVM models can produce nonlinear separating surfaces achieving very good performance in complex problems.
Due to their good generalization performance these models have become very popular with a wide range of applications, including document classification, image classification, bio-informatics, handwritten character recognition, etc. One of the major drawbacks of these models is the memory and the computational complexity requirements for large datasets. The reason is that the separating surface is obtained by solving a quadratic programming problem involving an N × N matrix, where N is the number of items in the dataset. Although there are techniques that can reduce the complexity to O(N 2 ) [5], the problem remains hard and the size of the problem can easily become prohibitively large calling for methods for data reduction such as the ones discussed in the following sections.

k-Nearest Neighbor classifier
The k-NN classifier [9] is an extensively used lazy (or instance-based) classification algorithm. Contrary to eager classifiers, it does not build any classification model. Some of its major properties are: (i) it is a quite simple and easy to implement algorithm, (ii) contrary to many other classifiers, it is easy to understand how a prediction has been made, (iii) it is analytically tractable and (iv) for k = 1 and unlimited instances the error rate is asymptotically never worse than twice the minimum possible, which is the Bayes rate [8].
The algorithm classifies a new instance by retrieving from the training set the k nearest instances to it. These instances are called neighbors. Subsequently, the algorithm assigns the new instance to the most common class among the classes of the k nearest neighbors. This class is called the major class. The process that indicates the major class is usually called nearest neighbors voting. Although any distance metric can be used, the Euclidean distance is the commonly-used distance metric. The k-NN classifier does not spend time in training any model. However, the classification step is time-consuming because in the worst case the algorithm must compute all distances between the new instance and all the training instances.
The selection of the value of k affects the accuracy of the classifier. The value of k that has the highest accuracy depends on the data. Its determination implies tuning via trial-and-error. Usually, large k values are appropriate for datasets with noise since they examine larger neighborhoods, whereas, small k values render the classifier noise-sensitive. In binary problems, an odd value for k should be used. Hence, possible ties in the nearest neighbors voting are avoided. In problems with more than two classes, ties are resolved by choosing a random "most common" class or the class voted by the nearest neighbor. The later is adopted in the experimental study of this paper.

Prototype Generation and Condensing algorithms
Several PG and PS-condensing algorithms are available in the literature. Here we review only the ones used in our experimental study. For the interested reader, abstraction and selection algorithms are reviewed, categorized and compared to each other in [20] and [10]. Other interesting reviews are presented in [19,24,4,13].

Condensing Nearest Neighbor rule
The Condensing Nearest Neighbor (CNN) rule [11] is the earliest condensing algorithm. Its condensing set is built by the following simple idea. Instances that are far from decision boundaries ("internal") data area of a class can be removed without loss of accuracy. Thus, CNN-rule tries to keep only the instances that lie in the close-border areas. The close-border instances are selected as follows. Initially, an instance of the training set (T S) is moved to the condensing set (CS). CNN-rule uses the 1-NN rule and classifies the instances of T S by examining the instances of CS. If an instance is wrongly classified, it is probably close to decision boundaries. Therefore, it is moved from T S to CS. This procedure is repeated and if there are no moves from T S to CS in a complete pass of T S, the algorithm terminates.
CNN-rule is misled by noise. It wrongly selects "noisy" instances with their neighborhood. Consequently, noise affects the reduction rates. CNN-rule determines the number of the prototypes automatically, without user-defined parameters. Another property is that the multiple passes over the data guarantees that the removed training instances are correctly classified by 1-NN classifier in the context of the condensing set. A disadvantage is that CNN-rule builds a different condensing set by examining the same training instances in a different order.

The IB2 algorithm
IB2 is an one pass version of CNN-rule. IB2 is one of the Instance-Based Learning (IBL) algorithms presented in [2,1]. Each training instance x ∈ T S is classified by the 1-NN rule on the current CS. If x is classified correctly, x is discarded. Else, x is moved to CS.
IB2 determines the size of the condensing set automatically. However, the condensing set highly depends on the order of training instances. Since it is a one-pass algorithm, it is very fast. Also, IB2 does not guarantee that the removed instances can be correctly classified by the condensing set. In addition, it builds its condensing set in an incremental manner. This means than new training instances can update an existing condensing set without considering the "old" instances that had been used for the creation of the condensing set. Hence, IB2 can be applied in streaming environments where new instances are gradually available.

The AIB2 algorithm
In [15] we presented a PG variation of IB2. It is called Abstraction IB2 (AIB2) and inherits all the properties of IB2. AIB2 considers that the prototypes should be close to the center of the data area they represent. Contrary to IB2, AIB2 does not ignore the instances that were correctly classified. These instances contribute to the condensing set by repositioning the nearest prototype. To achieve this, each prototype has a weight value that denotes the number of instances it represents.
In an early step, a random training instance is placed in the condensing set and its weight becomes one. For each training instance x, AIB2 fetches its nearest prototype P from the current condensing set. If x has a class label different than the one of P , it is moved to the condensing set and plays the role of a prototype. Its weight becomes one. if x has the class label of P , the attributes of P are updated by taking into account the attributes of x and its weight. More formally, each attribute attr(i) of P becomes P attr(i) ← . Thus, P moves towards x. Of course, the weight of P is increased by one and x is discarded.

The Reduction by Space Partitioning algorithms
Chen's algorithm The ancestor of the Reduction by Space Partitioning (RSP) algorithms is the PG algorithm proposed by Chen and Jozwik (Chen's algorithm) [6]. Chen's algorithm retrieves the instances that define the diameter of the training data, in other words the two most distant instances, a and b. Then, the algorithm splits the training data into two subsets. All the instances that are closer to a are moved to C a . All other instances are placed in C b . Subsequently, Chen's algorithm selects to split the non-homogeneous subset with the largest diameter. Non-homogeneous are called the subsets that have instances of more than one class. If there is no non-homogeneous subsets, the algorithm proceeds by spitting the homogeneous subsets. When the number of subsets is equal to a value specified by the user, the aforementioned procedure ends. The final step is the generation of prototypes. Each subset C is replaced by its mean instance. The class label of the mean instance is the major class in C. The mean instances constitute the condensing set.
The idea of splitting the homogeneous subset with the largest diameter is based on that this subset probably has more instances and thus, if it is split first, higher reduction will be achieved. Chen's algorithm generates the same condensing set regardless of the ordering of the instances. A drawback is that the user has to specify the number of subsets. Chen and Jozwik claim that this allows the user to define the trade-off between reduction rate and accuracy. However, the determination of this parameter implies costly trial-and-error procedures. Another weak point is that the instances that do not belong to the major class of the subset are not represented in the condensing set (they are ignored).
The RSP1 algorithm RSP1 [18] is similar to Chen's algorithm, but it does not ignore instances. It computes as many means as the number of distinct classes in the non-homogeneous subsets. RSP1 builds larger condensing sets than Chen's algorithm. However, it tries to improve the quality of the condensing set by taking into account all training instances.
The RSP2 algorithm RSP2 selects the subset that will be split first by examining the overlapping degree. The overlapping degree of a subset is the ratio of the average distance between instances belonging to different classes and the average distance between instances that belong to the same class. This splitting criterion assumes that instances that belong to a class are as close to each other as possible whereas instances that belong to different classes lie as far as possible. As stated in [18], it is better to split the subset with the highest overlapping degree than that with the largest diameter.
The RSP3 algorithm RSP3 [18] is the only RSP algorithm (Chen's algorithm included) that builds its condensing set without any user specified parameter. RSP3 eliminates both weaknesses of Chen's algorithm. It splits all the nonhomogeneous subsets. In other words, it terminates when all subsets become homogeneous. RSP3 can use either the diameter or the overlapping degree as spiting criterion. In effect, the selection of splitting criterion is an issue of secondary importance because all non-homogeneous subsets are eventually split. Certainly, the order of the training instances is irrelevant.
RSP3 generates many prototypes for close-border areas and few prototypes for "internal" areas. The size of the condensing set depends on the level of noise in the data. The higher the level of noise, the smaller subsets constructed and the lower reduction is achieved. Please note that the discovery of the most distant instances is a time-consuming procedure since all distances between the instances of the subset should be estimated. Thus, the usage of RSP3 may be prohibitive in the case of large datasets. Since we wanted to consider only non-parametric algorithms in our experimental study, we used only RSP3.

Reduction through Homogeneous Clusters
The RHC algorithm We have recently proposed the Reduction through Homogeneous Clusters (RHC) algorithm [16,14]. It belongs to PG algorithms. Like RSP3, RHC is based on the concept of homogeneity but employs k-means clustering [12,25]. Initially, the training data is considered as a non-homogeneous cluster in C. The algorithm computes a mean instance for each class in C. These mean instances are called class-means. Subsequently, RHC uses k-means clustering on C by adopting the class-means as initial means for k-means. The result is the creation of as many clusters as the number of discrete classes in C. This clustering process is applied on each non-homogeneous cluster. In the end, all clusters are homogeneous and each cluster contributes a prototype in the condensing set that is constructed by averaging the instances of the cluster.
RHC generates many prototypes for close-border areas and fewer for the "internal" areas. RHC uses the class-means as initial means for the k-means clustering in order to quickly find large homogeneous clusters. This property has the advantage of achieving a high reduction rate (the larger clusters discovered, the higher reduction rates achieved). Obviously, the instances that are noise can affect the reduction rates. Since RHC is based on k-means clustering, it is fast. Also, its condensing set does not depend on the ordering of the training data. The experimental study presented in [16,14] shows that RHC has higher reduction rates and is faster than RSP3 and CNN-rule, whereas accuracy remains high. Please note, that dRHC [16] is a variation of RHC that handles large datasets that cannot reside in the main memory.
The ERHC algorithm The Editing and Reduction through Homogeneous Clusters (ERHC) [17] algorithm is a simple variation of RHC that tries to deal with noisy data. ERHC differs from RHC on the following point: Whenever a homogeneous cluster with only one instance is discovered, ERHC discards it. Thus, the final condensing set contains the means of the homogeneous clusters that have more than one instance. Obviously, ERHC integrates an editing mechanism. It simultaneously removes noise and reduces the size of the training set. Therefore, it can be characterized as hybrid PG algorithm. The experimental study in [17] proves that this simple editing mechanism can improve classification performance when data contains noise.

Experimental setup
We conducted several experiments on thirteen datasets distributed by the KEEL repository 4 [3]. Their profiles are presented in Table 1. Five datasets do not contain noise. All the other datasets have noise of various levels (see column "Noise" in Table 1). We do not use any editing algorithm for noise removal. For each dataset, we built six condensing sets. They were built by applying the algorithms presented in Section 3. More specifically, we used CNN-rule, IB2, RSP3, RHC, ERHC and AIB2.
We trained several SVMs on the original training set (without data reduction) and for each condensing set by using several parameter values. Finally, we kept the most accurate SVMs. In Subsection 4.2, we report only the accuracy measurements for that SVM. The RBF kernel was used and the hyper-parameters γ, C where obtained through grid-search. Due to space restrictions, the parameter values we adopted are not reported 5 .
For the five "noise-free" datasets, the k-NN classifier was run over the original training data and over the six condensing sets by setting k = 1. Most of the time k = 1 is the best choice for noise-free data. For the other eight datasets, we adopted four k values, namely, 1, 5, 9 and 13.
All measurements presented in Subsection 4.2 are average values obtained via a five-fold cross-validation. We used the Euclidean distance as distance metric. Since CNN-rule, IB2 and AIB2 depend on the order of class labels in the training set, we randomized all the datasets. Excluding CAR, we did not perform any other transformations. The CAR dataset has ordinal attributes. We transformed the attribute values into numerical values. Furthermore, we normalized to the interval [0-1] all attribute values of CAR.

Experimental results
We compared the six DRTs to each other by estimating the Preprocessing Cost (PC) and the Reduction Rate (RR) that they achieved. Since the larger the training set used, the higher the cost for k-NN classifier to classify a new item and the higher the cost of the training procedure of SVMs (it is at least O(N 2 ) see Subsection 2.1), the RR measurements reflect the computational cost (the higher the RR, the lower the computational cost of k-NN classification and SVM training). Therefore, we do not include time measurements in our study. Table 2 presents the RR and PC measurements. Best measurements are in bold. The last row shows the averages values. We observe that ERHC achieved the highest RR. This means that the SVM that uses the condensing set built by ERHC requites the least time for its training. AIB2 is the fastest DRT. It builds its condensing set by computing the fewest distances. On the other hand, RSP3 needs the highest computational cost in order to build its condensing set. In addition RSP3 seems to build the largest condensing sets. As expected, ERHC achieves higher RR than RHC and AIB2 is better in terms of RR and PC than IB2. Tables 3 and 4 show the accuracy measurements achieved by the SVM and k-NN classifiers ( Table 4 is the continuation of Table 3). Both tables contain seven rows for each dataset. Each row represents the different versions of the same dataset. The first one concerns the original data (i.e., without data reduction). The other six rows concern the condensing set constructed by the DRTs. Each column of the table concerns a classifier. In particular, the third column concerns the SVM classifiers while the other columns concern the k-NN classifiers. The best accuracy measurements of the different classifiers are in bold. The best accuracy among the different condensing sets is emphasized with italic style.
The results depicted by both tables are quite interesting. Almost in all cases, SVM classifiers are more accurate than the k-NN classifier. In addition, all DRTs seem to not affect accuracy achieved by SVMs. In most cases, a SVM trained by any condensing set is as accurate as the SVM trained by the initial training set. In eight datasets, the SVMs trained by the condensing set of RSP3 are the most accurate classifier. However, RSP3 has the highest PC and the lowest RR measurements. In the cases of the rest five datasets, the most accurate classifier is the SVM built by the condensing set of CNN-rule. The accuracy achieved by IB2 is close enough to that of CNN-rule, but IB2 is faster and achieved higher RR. A final comment is that the PG and PS-condensing algorithms can effectively be used for speeding-up the training process of SVMs without sacrifying accuracy. Furthermore, we observe that, in the case of SVMs, the editing mechanism of ERHC is not as effective as it is when the k-NN classifier is used. In addition, although AIB2 achieves higher accuracy than IB2 in the case of k-NN classification, it is not true in the case of SVMs. Consequently, for SVMs, ERHC and AIB2 are not efficient extensions of RHC and IB2 respectively.

Conclusions
This paper demonstrated that the DRTs proposed for the k-NN classifier can also be applied for speeding-up SVMs. More specifically, the experimental measurements of our study showed that the usage of a DRT can reduce the time needed for the training process of SVMs without negatively affecting accuracy.  Although the particular DRTs have been proposed for speeding up the k-NN classifier, our study illustrated that the benefits are larger when SVMs are used.
The experimental results showed that in contrast to the k-NN classifier that can be affected by data reduction, the accuracy of SVMs is not affected.