MeLiF+: Optimization of Filter Ensemble Algorithm with Parallel Computing

. Search of algorithms ensemble – that is, best algorithms combination is common used approach in machine learning. MeLiF algorithm uses this technique for ﬁlter feature selection. In our research we proposed parallel version of this algorithm and showed that it is not only improves algorithm performance signiﬁcantly, but also improves feature selection quality.


Introduction
In modern world, machine learning became one of the most promising and studied science areas, mainly, because of its universal application to any data-related problem. One example of such an area is bioinformatics [3; 4; 6; 10], which produces giant amount of data about gene expression of different organisms. This data could potentially allow to determine which DNA pieces are responsible for some visual change of indiviual, or for reactions to particular environment change. The main problem of such data is its huge number of features and relatively low amount of objects. Because of high-dimensional space, it is very hard to build a model which generalizes such data well. Furthermore, a lot of features in such datasets have nothing in common with results, so, they should be treated as noize.
It seems to be logical in this case to select somehow the most relevant features and to learn a classifier on these only. This idea is implemented in such area of machine learning as feature selection. There are three main methods of feature selection: filter selection based on statistical measures of every single feature or features subsets, wrapper selection based on subspace search with classifier result as an optimization measure, and embedded selection that uses classificators inner properties [12].
The main peculiarity of filter methods is their speed. This leads to the fact that they are frequently used for preprocessing, and resulting subsets of features further passed to other wrapper or embedded method. This is especially important for bioinformatics, where number of features in datasets is sometimes dozens and hundrends of thousands.
These days, many machine learning algorithms use ensembling [1; 4; 8]. MeLiF algorithm [13] tries to apply this method to feature selection. It builds a linear combination of basic filters, that selects the most relevant features. MeLiF has a structural characteristic that it can be easily modified to work in concurrent or distributed manner. At this research, we implemented parallel version of MeLiF called MeLiF+ and achieved significant speed improvement without losing in selection quality.
The remainder of the paper is organized as follows: MeLiF algorithm is described in Section 2, parallelization scheme is proposed in Section 3, experiment setup and used quality measures are outlined in Section 4, and finally experiment results are contained in Section 5.

MeLiF
Algorithm treats some linear combinations of basic filters as starting points. It has been observed during experiments that the best option is this following choice of starting points: (1, 0, ..., 0), (0, 1, ..., 0), ..., (0, 0, ..., 1) -only one basic filter matters at the beginning, and (1, 1, ..., 1) -all basic filters are equal at the beginning. Algorithm iterates over the starting points and tries to shift each coordinate value to small constants +δ and −δ -value of grid spacing for each point. If some of applied changes succeed, i.e. quality measure for a point after a shift is greater than the maximum value: the algorithm chooses that point and starts searching from its first coordinate. If, all coordinates were shifted to +δ and −δ and no quality improvement observed, algorithm stops.

Algorithm 1 MeLiF algorithm
Input: points, delta, evaluate 1: q * ← 0 2: p * ← 0 3: for each p ∈ points do 4: q ← evaluate(p) 5: if q > q * then 6: p * ← p 7: q * ← q 8: smthChanged = true 9: while smthChanged do 10: for each dim ∈ p.size do 11: p + ← p 12: Then, for each point obtained during coordinate descent, the algorithm measures value of resulting linear combination of basic filters for each feature in dataset. After that, results are sorted, and the algorithm selects N best features. Then, the algorithm runs some classifier only with that feature subset. The obtained result is saved for comparing with other points and caching. It helps to reduce working time due to visited points usage.

MeLiF+
We proposed the following improvements to the MeLiF method: each starting point is processed in a distinct thread with global maximum maintained through synchronization point. Moreover, evaluate submethod is run concurrently for +δ and −δ, and selects the best point after retrieving both results. We showed that it not only improves the algorithm performance on multicore system, but also usually improves feature selection quality.
This fact has the following explanation: the original MeLiF algorithm is greedy, so it assumes that if each point it steps in is a local optimum then resulting point will be the global optimum, adding an ability to lookup for two deltas simultaneously allows algorithm to select better local optimum. Also, as starting points are processed in parallel, one thread can find a local optimum. This causes other threads to stop their work even if further descent leads to the better result. This can cause different selection result, better or worse (both cases are presented in Section 5), but experiments show that avarage MeLiF+ results are better.

Experiments
We used SVM [5] from WEKA [14] library, with polynomial kernel and soft margin parameter C = 1 as classifier. To improve stability, we used 5-fold crossvalidation. The number of selected features was constant: N = 100. In order to compare our method with the old one, we used F 1 score [11] of SVM classifier.
As we wanted to know how much our method differs from the original one in terms of space search strategy, we calculated z-score for each dataset.
We ran our experiments on a machine with following characteristics: 32-core CPU AMD Opteron 6272 @ 2.1 GHz, 128 GB RAM. We used N = 50 threads, N = 2 · p · f threads, where p is the number of starting points, f is the number of folders used for cross-validation.
As basic filters, we used Spearman Rank Correlation (SPC), Symmetric Uncertainty (SU), Fit Criterion (FC) and Value Difference Metric (VDM) [2; 9]. For each dataset, we executed MeLiF and MeLiF+ and stored their working time and points with the best classification result.
We used 50 datasets of different sizes: 33 datasets have been taken from Gene Expression Omnibus, 5 from Kent Ridge Bio-Medical Dataset, 5 from RSCTC'2010 Discovery Challenge, 4 from Broad institute Cancer Program Data Sets, 3 from Feature Selection Datasets at Arizona State University. Some datasets were multi-labeled, therefore we splitted them into several derivative binary datasets with commonly used one-versus-all technique. Then we excluded datasets that contained too few instances of one of the classes. After that, we used standard feature scaling and discretized all features to 11 different values from -5 to 5.

Results
Table below contains experiment results. All the datasets are sorted by their total size which is basically a multiplication of their features and objects number. In F 1 score comparison of MeLiF and MeLiF+ better results for each dataset are highlighted in grey, equal results are not highlighted. Runtime is presented in seconds. At the last column, z-score is provided. As it can be seen from the table above, MeLiF+ is always at least 3 times faster than the MeLiF, and this difference gets up to 6 times for some datasets. Although MeLiF and MeLiF+ have almost the same results in F 1 score, there is some difference in their work on 15 datasets as provided via z-score. But only in 5 cases MeLiF+ had worse results than original the MeLiF algorithm. But on 36 datasets, they performed equally and at 11 datasets new algorithm outperformed the original one.

Conclusion
The proposed parallelization scheme made algorithm in average to work 5.5 times faster without affecting selection quality. Unforunately, in this research we did not achieved linear speed improvement because of the fixed maximum of parallel processed points. In our future work, we are planning to use threads pool which is limited by the testing system and achieve linear speed growth with using exploration and exploitation [7] strategy to spread the search points in the search space. Also this should lead to high increase in optimized measure.

Acknowledgements
Authors would like to thank Julia Ugarkina and Andrey Filchenkov for useful comments and proofreading. This work was financially supported by the Government of Russian Federation, Grant 074-U01.