A Dynamic Early Stopping Criterion for Random Search in SVM Hyperparameter Optimization

. We introduce a dynamic early stopping condition for Random Search optimization algorithms. We test our algorithm for SVM hyperparameter optimization for classification tasks, on six commonly used datasets. According to the experimental results, we reduce significantly the number of trials used. Since each trial requires a re-training of the SVM model, our method accelerates the RS optimization. The code runs on a multi-core system and we analyze the achieved scalability for an increasing number of cores.


Introduction
Most Machine Learning (ML) models are described by two sets of parameters.The first set consists in regular parameters that are learned through training.The other set, called hyperparameters or meta-parameters, consists of parameters which are set before the learning starts.It is essential to identify the combination of hyperparameter values which produce the best (or closed to the best) generalization performance.This is done by re-training multiple models with different combinations of hyperparameter values and evaluating their performance.We call this re-training + evaluation for one set of hyperparameter values a trial.Since training a model can be very resource intensive, it is important to reduce the number of trials.
In the specific case of SVM classifiers, the algorithm performance depends on several parameters and it is quite sensitive to changes in any of those parameters [1].The choice of the kernel, for example, can have a dramatic influence on the classification performance [2].The cost parameter , controlling the trade-off between margin maximization and error minimization is also highly important as, for the nonseparable case, the algorithm must allow training errors.For a polynomial kernel, a wrong choice of the degree can easily lead to over-fitting [3].
Random Search (RS) is another standard technique for hyperparameter optimization.A nice feature of RS is the possibility of adaptive early stopping.The key is to define a good stopping criterion, representing a trade-off between accuracy and computation time.The rise of the randomized methods begun with the work of Bergstra and Bengio [11,12].Using the same number of trials, RS generally yields better results than GS or more complicated hyperparameter optimization methods.Especially in higher dimensional spaces, the computation resources required by RS methods are significantly lower than for GS [13].Also, RS methods are relatively simple and easy to implement on parallel computer architectures.
Several software libraries dedicated to hyperparameter optimization exist, some of them being autonomous, while others being built on top of existing ML software.LIBSVM [14] and scikit-learn [15] come with their own implementation of GS, with scikit-learn also offering support for RS.Spearmint [16] and Bayesopt [17] are software packages dedicated to Bayesian optimization.Auto-WEKA [18] is also able to perform Bayesian optimization but, unlike the previous two which are standalone libraries, it is built on top of Weka [19].Hyperopt [20] and Optunity [21] are currently two of the most advanced libraries for hyperparameter optimization.
Our contribution is an improved RS optimization technique, which reduces the number of trials, without a significant impact on the prediction performance.The key is a new dynamically calculated early stopping condition for RS.The method is implemented in parallel and achieves a good scalability.Our experiments are on the SVM classification problem applied to six commonly used datasets and five hyperparameters.According to them, our method accelerates the RS optimization.
The paper proceeds as follows.Section 2 describes our algorithm and the dynamic stopping condition, with an emphasis on the algorithm's parallel nature.Section 3 presents the experimental results and the paper is concluded with Section 4.

Proposed Algorithm and Probabilistic Properties
A highly simplified version of a hyperparameter optimization algorithm is characterized by an objective fitness function f and a generator of samples g .The fitness function returns a classification accuracy measure of the target model, computed either through cross-validation or on a separate validation set.The generator g is in charge of providing the next set of values that will be used to compute the model's fitness.A hasNext method implemented by the generator offers the possibility to terminate the algorithm before the maximum number of N evaluations is reached, if some convergence criteria is satisfied.
In the particular case of RS, the generator g simply draws samples from the specific distribution of each of the hyperparameters to be optimized.Our goal is to reduce the computational complexity of the RS method in terms of number of trials.In other words, we aim to compute less than N trials, without a significant impact on the value of the fitness function.
For this, we introduce a dynamic stopping criterion, included in a randomized Dynamic Early Stopping Criterion for Random Search in SVM Hyperparameter Optimization 3 optimization algorithm (Algorithm 1).The algorithm is a two step optimizer.First, it iterates for a predefined number of steps n , N n << , and finds the optimal combina- tion of hyperparameter values, temp_opt.Then, it searches for the first result better than temp_opt.The optimal result, opt, is either the first result better than temp_opt or The following problems arise: i) Can we determine a value for n that maxim- izes the probability of obtaining the best results?; and ii) Can the algorithm be parallelized without impacting the probability of obtaining an optimal value?

Sequential execution
Algorithm 1 finds the optimum under the assumption that opt is in any position i , n i > , and no result better than temp_opt is in the We denote by i E1 the event that opt is reached on the i -th trial, and by i E 2 the event that no value better than temp_opt is obtained between the n -th and the i - th trial.The probability of i The probability that all values in the range 1 are worse than temp_opt is the same as the probability that the best result among the first 1 Since the two events are independent, the probability that we hit opt after n i > attempts is: 1 The event E of finding opt after at most Since i 1/ is monotonically decreasing, the right term of eq. ( 4) has a lower bound: We differentiate the left term of eq. ( 5): equate to zero and solve for n obtaining: e m n / = (7) Fig. 1.Lower bound heatmap of the probability to obtain best result from a target space of maximum 300 attempts while terminating faster, depending on the values of m (x axis) and n (y axis).Darker shades correspond to greater probability.

Choosing for
n a value larger than the optimal one increases the probability of finding the combination of values that yields the optimal result but with an increased risk of a greater number of trials.The result from eq. ( 7) can be used to implement an improved version of the Algorithm 1 that can automatically set the value of n to e N/ .For example, in order to maximize the chances to obtain the best value, after a target maximum of 150 attempts, we must set n to e 150/ ( 55  ).For a target maximum of 100 attempts, n should be 37, and so on.Fig. 1 shows the lower bound heatmap of the probability to obtain the best results while stopping earlier with respect to the values of m and n .

Parallel execution
We generate a parallel implementation of our method as follows: -Split the work between W workers (can be anything from lightweight threads of execution, OS threads, CPU cores or even different servers).We Dynamic Early Stopping Criterion for Random Search in SVM Hyperparameter Optimization 5 decided to use the GOLANG [22] support for goroutines 2 , which are basically lightweight threads managed by the GO run-time.
-Each worker w executes W N/ trials using the same early stopping criteri- on.In this case, W n w n / = , signifying that on average, with /2 = N m , W workers will terminate after ) /(2W N trials, with W N/ being the worst case.
-The manager gathers the results from all workers and selects the best candidate.
Algorithm 2 implements the above steps.The random values are generated and distributed by the manager, or each worker generates its own random sequence.Any of the following parallel pseudo-random number generation strategies can be selected [23]: Manager-Worker (MW), Sequence Splitting (SS), Leapfrog (LF), and Parametrization (P).

The inverse problem
Given a restricted computational budget, expressed by a target number of trials m , we obtained the optimal value for n .We are now interested in solving the reverse problem: Given an acceptable probability P to achieve the best result among the N trials, which is the optimal value for n ?For the RS algorithm without the dynamic stopping criterion, if all trials are independent, the required number of trials needed to identify the optimum with a probability is given by The problem becomes interesting in the context of our stopping criterion when we are willing to compromise, by accepting a lower probability , for a further reduction of the number of trials.In case of Algorithm 2, according to eq. ( 5), probability P has a lower bound: Adrian Cătălin Florea and Răzvan Andonie This, together with eq. ( 7) gives:  9) represents the probability to identify the optimum regardless of the activation of the stopping criterion -might also be among the first trials in which case the algorithm will test all the possible combinations.The probability to find the optimum after a number of trials strictly lower than N has a lower bound given by relation (5), which translates to: (10) The value of n in eq. ( 8) can be adjusted in the interval ] , / [ m e m to increase the probability of identifying the optimal value, but at the same time increase the computational cost (the number of trials).

Experiments
We use our method to optimize the following five hyperparameters of a SVM [1] classifier: kernel type (RBF, Polynomial or Linear chosen with equal probability),  (drawn from an exponential distribution with 10 =  cost( C , drawn from an expo- nential distribution with 10 =  ); degree (chosen with equal probability from the set {2,3,4,5}) and 0 coef (uniform on [0,1] ).We run our experiments on six of the most popular datasets from UCI Machine Learning Repository 3 : Adult (a1a), Adult (a6a), Breast Cancer, Diabetes, Iris and Wine.Adult (a1a) and Adult (a6a) are variations of the same dataset but with different number of samples; the second one is around six times larger.Details of the datasets are presented in Table 1.We apply ten fold cross-validation to evaluate the classification accuracy [24] and compare the obtained results, both in terms of classification performance and number of trials.We use the following optimizers (all implemented in the Optunity library): GS, RS, Particle Swarm, and Nelder-Mead.We also use the Weka SVM, with its implicit hyperparameters.
We run the Algorithm 2 with 8 = W and 250 = N , which leads to 92 = n .We also run the four optimizers in Optunity, for a maximum number of 250 trials.

Accuracy estimation
Table 2 presents the results of applying Algorithm 2 for four parallelization strategies, compared with the results obtained with Optunity (RS, GS, Particle Swarm and Nelder-Mead) as well as with the results obtained using Weka with each of the three kernels (RBF, Polynomial and Linear) and the implicit values for the other parameters ( 1 ).The best results are marked in bold.Since we compare multiple classifiers on multiple datasets, we have to use additional statistical tests for further investigation, as suggested in [25].
We calculate the Friedman [26] and the Iman-Davenport [27] statistics using eq.( 11), respectively eq. ( 12), with N being the number of datasets, k the number of algorithms and j R the average rank of algorithm j from Table 3, and obtain 6.04 = 32.826,= 2 With 11 algorithms and six data sets, degrees of freedom.The critical value of (10,50) 03, so we reject the null-hypothesis, which means the algorithms are not equivalent in terms of prediction performance.
The critical difference [28,25] is given by: where critical values  q are based on the Studentized range statistic divided by 2 .At significance level of 0.05 =  , the critical difference is .).At significance level 0.1 =  , the critical difference is and we observe that Optunity -NM is significantly worse than GO -LF and GO -P and also that GO -P is significantly better than Optunity -NM and Weka -RBF.

Efficiency of the stopping condition
Based on the above results, we exclude Nelder-Mead (due to its significantly worse classification performance) and the Weka SVM (since is it does not perform a real hyperparameter optimization) from the analysis.Table 4 depicts the rank across all datasets in terms of number of trials.
We perform another Friedman test and, using formulas (11) and (12), and obtain: 28.554 = ) is 2.420.This means that we can rule out the null- hypothesis and state that the algorithms are not equivalent with respect to the number of trials.We compute the critical difference according to formula (13) and obtain 3.678 = 0.05

CD
. Table 5 shows the difference in the average rank values for each pair of algorithms.The values greater than 0.05

CD
are marked in bold font.We can identify two groups of algorithms, the first group (GO -SS and GO -P) performs significantly better than the second group (Optunity -GS, Optunity -RS and Optunity -PS).It is not clear to which of the two groups GO -MW and GO -LF belong to.One possible explanation for the better results obtained by GO -SS and GO -P may be related to the superior parallel implementation of the random generators.However, since the number of random values generated in our tests is relatively small, this difference in Dynamic Early Stopping Criterion for Random Search in SVM Hyperparameter Optimization 9 performance is most probably coincidental.The Holm test rejects all four hypotheses, since the corresponding p values are smaller than the adjusted  's, leading to the conclusion that all four versions of our algorithm are significantly more efficient in terms of number of trials than the standard RS implementation.

Scalability
Besides the accuracy and the number of runs we also measure the algorithm's speedup (the ratio of the sequential execution time to the parallel execution time) as a Adrian Cătălin Florea and Răzvan Andonie measure of its scalability.The values are depicted in Table 7.

Conclusions
We introduced a new dynamic stopping condition for RS based hyperparameter optimization, together with its parallel implementation.In the context of SVM classification, on six of the most commonly used datasets, we obtained on par accuracy values with the existing mainstream hyperparameter optimization techniques.With all four of the parallel random generators used, the algorithm terminates after a significantly reduced number of trials compared to the standard implementation of RS, which leads to an important decrease in the computational budget required for the optimization.
The present work opens further research directions in terms of optimizing the hyperparameters for other ML algorithms where the search space has a larger number of dimensions and the required computational budget is currently a major issue.The algorithm implementation is flexible enough to allow a gradient-free optimization of any function.

Table 1 .
Details on used datasets

Table 2 .
Accuracy and number of trials for Algorithm 2 using different parallelization strategies (MW, SS, LF, P), compared with Optunity (RS, GS, Particle Swarm and Nelder-Mead) and Weka's SVM.

Table 3 .
Algorithms' accuracy ranking on the used datasets.

Table 4 .
Algorithms ranking in terms of number of runs.

Table 5 .
Algorithms ranking difference in terms of number of runs.

Table 6 .
Performance in terms of number of trials required for GO -MW, GO -P, GO -SS and GO -LF against Optunity -RS in terms of the Holm test.

Table 7 .
Algorithm speedup with increasing number of cores.