Combining Prototype Selection with Local Boosting

. Real life classiﬁcation problems require an investigation of relationships between features in heterogeneous data sets, where diﬀerent predictive models can be more proper for diﬀerent regions of the data set. A solution to this problem is the application of the local boosting of weak classiﬁers ensemble method. A main drawback of this approach is the time that is required at the prediction of an unseen instance as well as the decrease of the classiﬁcation accuracy in the presence of noise in the local regions. In this research work, an improved version of the local boosting of weak classiﬁers, which incorporates prototype selection, is presented. Experimental results on several benchmark real-world data sets show that the proposed method signiﬁcantly outperforms the local boosting of weak classiﬁers in terms of predictive accuracy and the time that is needed to build a local model and classify a test instance.


Introduction
In machine learning, instance-based (or memory-based) learners classify an unseen object by comparing it to a database of pre-classified objects. The fundamental assumption is that similar instances will share similar class labels.
Machine learning models' assumptions would not necessarily hold globally. Local learning [1] methods come to solve this problem. The latter allow to extend learning algorithms, that are designed for simple models, to the case of complex data, for which the models' assumptions are valid only locally. The most common case is the assumption of linear separability, which is usually not fulfilled globally in classification problems. Despite this, any supervised learning algorithm that is able to find only a linear separation, can be used inside a local learning process, producing a model that is able to model complex non-linear class boundaries.
A technique of boosting local weak classifiers, that is based on a reduced training set after the usage of prototype selection [11], is proposed. It is common that boosting algorithms are well-known to be susceptible to noise [2]. In the case of local boosting, the algorithm should manage reasonable noise and be at least as good as boosting, if not better. For the experiments, we used two variants of Decision Trees [21] as weak learning models: one-level Decision Trees, which are known as Decision Stumps [12] and two-level Decision Trees. An extensive comparison over several data sets was performed and the results show that the proposed method outperforms simple and local boosting in terms of classification accuracy.
In the next Section, specifically in subsection 2.1, the localized experts are discussed, while boosting approaches are described in subsection 2.2. In Section 3 the proposed method is presented. Furthermore, in Section 4 the results of the experiments on several UCI data sets, after being compared with standard boosting and local boosting, are portrayed and discussed. Finally, Section 5 concludes the paper and suggests further directions in current research.

Background Material
For completeness purposes, local weighted learning, prototype selection methods as well as boosting classifier techniques are briefly described in the following subsections.

Local Weighted Learning and Prototype Selection
Supervised learning algorithms are considered global if they use all available training sets, in order to build a single predictive model, that will be applied in any unseen test instance. On the other hand, a method is considered local if only the nearest training instances around the testing instance contribute to the class probabilities.
When the size of the training data set is small in contrast to the complexity of the classifier, the predictive model frequently overfits the noise in the training data. Therefore, the successful control of the complexity of a classifier has a high impact in accomplishing good generalization. Several theoretical and experimental results [23] indicate that a local learning algorithm provides a reasonable solution to this problem.
In local learning [1], each local model is built completely independent of all other models in a way that the total number of local models in the learning method indirectly influences how complex a function can be estimated -complexity can only be controlled by the level of adaptability of each local model. This feature prevents overfitting if a strong learning pattern exists for training each local model.
Prototype selection is a technique that aims to decrease the training size without surfacing the prediction performance of a memory based learner [18]. Besides this, by reducing the training set size it might decrease the computational cost that will be applied in the prediction phase.
Prototype selection techniques can be grouped in three categories: preservation techniques, which aim to find a consistent subset from the training data set, ignoring the presence of noise, noise removal techniques, which aim to remove noise, and hybrid techniques, which perform both objectives concurrently [22].

Boosting Classifiers
Experimental research works have proven that ensemble methods usually perform better, in terms of classification accuracy, than the individual base classifier [2], and lately, several theoretical explanations have been advised to explain the success of some commonly used ensemble methods [13]. In this work, a local boosting technique that is based on a reduced training set, after the usage of prototype selection [11], is proposed and for this reason this section introduces the boosting approach.
Boosting constructs the ensemble of classifiers by subsequently tweaking the distribution of the training set based on the accuracy of the previously created classifiers. There are several boosting variants. These methods assign a weight to each training instance. Firstly, all instances are equally weighted. In each iteration a new classification model, named base classifier, is generated using the base learning algorithm. The creation of the base classifier has to consider the weight distribution. Then, the weight of each instance is adjusted, depending on the accuracy of the prediction of the base classifier for that instance. Thus, Boosting attempts to construct new classification models that are able to better classify the "hard" instances for the previous ensemble members. The final classification is obtained from a weighted vote of the base classifiers. AdaBoost [8] is the most well-known boosting method and the one that is used over the experimental analysis that is presented in Section 3.
Adaboost is able to use weights in two ways to generate a new training data set to provide to the base classifier. In boosting by sampling, the training instances are sampled with replacement with probability relative to their weights. In [26] authors showed empirically that a local boosting-by-resampling technique is more robust to noise than the standard AdaBoost. The authors of [17] proposed a Boosted k-NN algorithm that creates an ensemble of models with locally modified distance weighting that has increased generalization accuracy and never performs worse than standard k-NN. In [10] the authors present a novel method for instance selection based on boosting instance selection algorithms in the same way boosting is applied to classification.

The Proposed Algorithm
Two main disadvantages of simple local boosting are: i) When the amount of noise is large, simple local boosting does not have the same performance [26] as Bagging [3] and Random Forest [4]. ii) Saving the data for each pattern increases storage complexity. This might restrict the use of this method to limited training sets [21]. The proposed algorithm incorporates prototype selection to handle, among others, the two previous problems. In the learning phase, a prototype selection [11] method based on the Edited Nearest Neighbor (ENN) [24] technique reduces the training set by removing the training instances that do not agree with the majority of the k nearest neighbors. In the application phase, it constructs a model for each test instance to be estimated, considering only a subset of the training instances. This subset is selected according to the distance between the testing sample and the available training samples. For each testing instance, a boosting ensemble of a weak learner is built using only the training instances that are lying close to the current testing instance. The prototype selection aims to improve the classification accuracy as well as the time that is needed to build a model for each test instance at the prediction.
The proposed ensemble method has some free parameters, such as the number of neighbors (k 1 ) to be considered when the prototype selection is executed, the number of neighbors (k 2 ) to be selected in order to build the local model, the distance metric and the weak learner. In the experiments, the most well -known Euclidean similarity function was used as a distance metric.
In general, the distance between points x and y in a Euclidean space R n is given by (1). (1) The most common value for the nearest neighbor rule is 5. Thus, the k 1 was set to 5 and k 2 =50. since at about this size of instances, it is appropriate for a simple algorithm to build a precise model [14]. The proposed method is presented in Algorithm 1.
for each training instance do Find the k1 nearest neighbors using the selected distanceM etric if instance does not agree with the majority of the k1 then Remove this instance from the training set end if end for end procedure procedure Classification(k2, distanceM etric, weakLearner) for each testing instance do Find the k2 nearest neighbors using the selected distanceM etric Apply boosting to the base weakLearner using the k2 nearest neighbors The answer of the boosting ensemble is the prediction for the testing instance end for end procedure

Numerical Experiments
In order to evaluate the performance of the proposed method, an initial version was implemented 1 and a number of experiments were conducted using several data sets from different domains. From the UCI repository [16] several data sets were chosen. Discrete features transformed to numeric by using a simple quantization. Each feature is scaled to have zero mean and standard deviation one. Also all missing values were treated as zero. In Table 1 the name, the number of patterns, the attributes, as well as the number of different classes for each data set are shown. All experiments were run on an Intel Core i3-3217U machine at 1.8GHz, with 8GB of RAM, running Linux Mint 17.3 64bit using Python and the scikit-learn [19] library.
For the experiments, we used two variants of Decision Trees [25] as weak learners. One-level Decision Trees [12], also known as Decision Stumps, and twolevel Decision Trees [20]. We used the Gini Impurity [5] as criterion to measure the quality of the splits in both algorithms. The boosting process for all classifiers performed using the AdaBoost algorithm with 25 iterations in each model. In order to calculate the classifiers accuracy, the whole data set was divided into five mutually exclusive folds and for each fold the classifier was trained on the union of all of the other folds. Then, cross-validation was run five times for each algorithm and the mean value of the five folds was calculated.

Prototype Selection
The prototype selection process is independent of the base classifier and it takes place once in the training phase of the proposed algorithm. It depends only on the k 1 parameter. The number of neighbors to be considered when the prototype selection is executed. In Table 2 the average of training patterns, the average of the removed patterns as well as the average reduction of each data set is presented. The average refers to the average of all training folds during the 5fold cross-validation.

Using Decision Stump as base classifier
In the first part of the experiments, Decision Stumps [12] were used as weak learning classifiers. Decision Stumps (DS) are one-level Decision Trees that classify instances based on the value of just a single input attribute. Each node in a decision stump represents a feature in an instance to be classified and each branch represents a value that the node can take. Instances are classified starting at the root node and are sorted based on their attribute values. In the worst case, a Decision Stump will behave as a base line classifier and will possibly perform better, if the selected attribute is particularly informative. The proposed method, denoted as PSLBDS, is compared with the Boosting Decision Stumps, denoted as BDS and the Local Boosting of Decision Stumps, denoted as LBDS. Since the proposed method uses fifty neighbors, a 50-Nearest Neighbors (50NN) classifier has included in the comparisons. In Table 3 the average accuracy of the compared methods is presented. Table 3 indicates that the hypotheses generated by PSLBDS are apparently better since the PSLBDS algorithm has the best mean accuracy score in nearly all cases.
Demšar [6] suggests that the non-parametric tests should be preferred over the parametric in the context of machine learning problems, since they do not assume normal distributions or homogeneity of variance. Therefore, in the direction of validating the significance of the results, the Friedman test [9], which is a rank-based non-parametric test for comparing several machine learning algorithms on multiple data sets, was used, having as a control method the PSLBDS VII  Table 4. Assuming a significance level of 0.05 in Table 4, the p-value of the Friedman test indicates that the null hypothesis has to be rejected. So, there is at least one method that performs statistically different from the proposed method. With the intention of investigating the aforementioned, Finner's [7] and Li's [15] post hoc procedures were used.
In Table 5 the p-value obtained by applying post hoc procedures over the results of the Friedman statistical test are presented. Finner's and Li's procedure rejects those hypotheses that have a p-value ≤ 0.05. That said, the adjusted p-values obtained through the application of the post hoc procedures are presented in Table 6. Hence, both post hoc procedures agree that the PSLBDS algorithm performs significantly better than the BDS, the LBDS as well as the 50NN rule.

Using two-level Decision Tree as a base classifier
Afterwards, two-level Decision Trees were used as weak learning base classifiers. A two-level Decision Tree is a tree with max depth=2. The proposed method, denoted as PSLBDT, is compared to the Boosting Decision Tree, denoted as BDT and the Local Boosting of Decision Trees, denoted as LBDT. Since the proposed method uses fifty neighbors a 50-Nearest Neighbors (50NN) classifier has included in the comparisons. In Table 7 the average accuracy of the compared methods is presented. Table 7 indicates that the hypotheses generated by PSLBDT are apparently better, since the PSLBDT algorithm has the best mean accuracy score in most cases. The average rankings, according to the Friedman test, are presented in Table 8. The proposed algorithm was ranked in the first place again. Assuming significance level of 0.05 in Table 8, the p-value of the Friedman test indicates that the null hypothesis has to be rejected. So, there is at least one method that performs statistically different from the proposed method. Aiming to investigate the aforesaid, Finner's and Li's post hoc procedures were used again. In Table 9 the p-value obtained by applying post hoc procedures over the results of Friedman's statistical test are presented. Finner's and Li's procedure rejects those hypotheses that have a p-value ≤ 0.05. That said, the adjusted p-values obtained through the application of the post hoc procedures are presented in Table 10. Both post hoc procedures agree that the PSLBDT algorithm performs significantly better than the BDT and the 50NN rule but not significantly better than the LBDT as far as the tested data sets are concerned.

Time Analysis
One of the two contributions of this study was to improve the classification time over the local boosting approach. In order to prove this, the total time that is required to predict all instances in the test folds was recorded. Specifically, the prediction of each test fold was executed three times and the minimum time was recorded for each fold. Then, the average of all folds was calculated. In Table 11 the average prediction time in seconds of LBDS, PSLBDS, LBDT and PSLBDTS is presented. In the case of one-level decision trees (LBDS, PSLBDS) the proposed method reduced the expected prediction time in more than 15% in 6 of 14 cases, while in the case of two-level decision trees (LBDT, PSLBDT) the proposed method reduced the expected prediction time in more than 15% in 7 of 14 cases. In Figure 1 the absolute percentage changes are presented.

Synopsis and Future Work
Local memory-based techniques delay the processing of the training set until they receive a request for an action like classification or local modelling. A data set of observed training examples is always retained and the estimate for a new test instance is obtained from an interpolation based on a neighborhood of the query instance. In this research work at hand, a local boosting after prototype selection method is presented. Experiments on several data sets show that the proposed method significantly outperforms the boosting and local boosting method, in terms of classification accuracy and the time that is required to build a local model and classify a test instance. Typically, boosting algorithms are well known to be subtle to noise [2]. In the case of local boosting, the algorithm should handle sufficient noise and be at least as good as boosting, if not better. By means of the promising results obtained from performed experiments, one can assume that the proposed method can be successfully applied to the classification task in the real world case with more accuracy than the compared machine learning approaches. In a following work the proposed method will be investigated as far as regression problems are concerned as well as the problem of reducing the size of the stored set of instances, by also applying feature selection instead of simple prototype selection.