Algorithm Selector and Prescheduler in the ICON challenge

Acknowledgements This work has been carried out in the framework of IRT SystemX and therefore granted with public funds within the scope of the French Program “Investissements d'Avenir”. Abstract Algorithm portfolios are known to offer robust performances, efﬁciently overcoming the weakness of every single algorithm on some particular problem instances. Two complementary approaches to get the best out of an algorithm port-folio is to achieve algorithm selection (AS), and to deﬁne a scheduler, sequentially launching a few algorithms on a limited computational budget each. The presented Algorithm Selector And Prescheduler system relies on the joint optimization of a pre-scheduler and a per instance AS, selecting an algorithm well-suited to the problem instance at hand. ASAP has been thoroughly evaluated against the state-of-the-art during the ICON challenge for algorithm selection, receiving an honourable mention. Its evaluation on several combinatorial optimization benchmarks exposes surprisingly good results of the simple heuristics used; some extensions thereof are presented and discussed in the paper.


Introduction
In quite a few domains related to combinatorial optimization, such as satisfiability, constraint solving or operations research, it has been acknowledged for some decades that there exists no universal algorithm, dominating all other algorithms on all problem instances.This result, referred to as No Free Lunch theorem [20], has prompted the scientific community to design algorithm portfolios addressing the various types of difficulties involved in the problem instances, i.e., such that at least one algorithm in the portfolio can efficiently handle any problem instance [9,6].Algorithm portfolios thus raise a new issue, that of selecting a priori an algorithm well suited to the application domain [12].This issue, referred to as Algorithm Selection (AS) and first formalized by Rice [18], is key to the successful transfer of algorithms outside of research labs.It has been tackled by a number of authors in the last years (more in section 2).
Algorithm selection comes in different flavors, depending on whether the goal is to yield an optimal performance in expectation with respect to a given distribution of problem instances (global AS), or an optimal performance on a particular problem instance (per instance AS).Note that the measure of performance depends on the domain (e.g., time-to-solution in satisfiability, or time to reach the optimal solution up to a given precision in optimization 4 ).This paper focuses on the per-instance setting, aimed at achieving peak performance on every problem instance.
In some domains, it is often the case that some problems can be solved in no time by some algorithms.It thus makes sense to allocate a part of the computational budget to a pre-scheduler, sequentially launching a few algorithms with a small computational budget each.The pre-scheduler is expected to solve "easy" instances in a first stage; in a second stage, AS is only launched on problem instances which have not been solved in the pre-scheduler phase.Note that the prescheduler enables to extract some additional information characterizing the problem at hand, which can be used together with the initial information about the problem instance, to support the AS phase.
This paper presents the Algorithm Selector And Prescheduler system (ASAP), aimed at algorithm selection in the domain of combinatorial optimization (section 3).The main contribution lies in the joint optimization of both a pre-scheduler and a per-instance algorithm selector.The extensive empirical validation of ASAP is conducted on the ICON challenge on algorithm selection [11].This challenge leverages the Algorithm Selection library [1], aimed at the fair, comprehensive and reproducible benchmarking of AS approaches on 13 domains ranging from satisfiability to operations research (section 4).
The comparative empirical validation of ASAP demonstrates its good performances comparatively to state-of-art pre-schedulers and AS approaches (section 5), and its complementarity with respect to the prominent Zilla algorithms [22].The paper concludes with a discussion of the limitations of the ASAP approach, and some perspectives for further research.

Algorithm selectors
The algorithm selection issue, aimed at selecting the algorithm best suited to the problem at hand, was first formalized by Rice [18] as follows.Given a problem space mapping each problem instance onto a description x thereof (usually x in IR d ) and the set A of algorithms in the portfolio, let us denote G(x, a) a performance model, mapping each (x, a) pair onto the performance of algorithm a onto problem instance x.AS most naturally follows from such a performance model by selecting for each problem instance x the algorithm a with optimal G(x, a).
The performance model is most usually built by applying machine learning approaches onto a dataset reporting the algorithm performances on a comprehensive set of benchmark problem instances (with the exception of [5], using a multi-armed bandit approach).Such machine learning approaches range from k-nearest neighbors [15] to ridge regression [22], random forests [23], collaborative filtering [19,14], or learning to rank approaches [16].
As expected, the efficiency of the machine learning approaches critically depends on the quality of the training data: the representativity of the problem instances used to train the performance model, and even more importantly, the description of the problem instances.Considerable care has been devoted to the definition of descriptive features in the SAT and Constraint domains [21].

Schedulers
Besides AS, an algorithm portfolio can also take advantage of parallel computer architectures, by launching several algorithms working independently or in cooperation on the considered problem instance (see e.g.[24], [10]).Schedulers embed the parallel solving strategies in a sequential computer architecture, by considering a sequence of κ (algorithm a i , time-out τ i ) pairs, where the problem instance is successively tackled by algorithm a i with a computational budget τ i , until being solved.Notably, the famed restart strategy, launching a same algorithm with different random seeds or different initial conditions can be viewed as a particular case of scheduling strategy [6].Likewise, AS can be viewed as a particular case of scheduler with κ = 1 and τ 1 set to the overall computational budget.
As shown by [23], schedulers and AS can be combined together along a multi-stage process, where a scheduler solves easy instances in a first stage, and remaining instances are handled by the AS and tackled by the selected algorithm in the next stage.[10] build per-instance schedules where the AS is one of the component algorithms incorporated.

Overview of ASAP
This section first discusses the rationale for the ASAP approach, before detailing the pre-scheduler and AS modules in ASAP.V1.Extensions thereof, forming ASAP.V2, are presented thereafter.

Analysis
A benchmark suite most generally involves easy and hard problem instances.The difficulty is that the hardness of a problem instance depends on the considered algorithm.As shown on Fig. 1 in the case of the SAT11-HAND dataset (section 4), while several algorithms might solve 20% of the problem instances within seconds, the oracle (selecting the best one out of these algorithms for each problem instance) solves about 40% of the problem instances within seconds.
Accordingly, one might want to launch each one of these algorithms for a few seconds each on each problem instance: after this stage, referred to as pre-scheduler stage, circa 40% of the overall problem instances would be solved.
Definition 1 (Pre-scheduler).Let A be a set of algorithms.A κ-component pre-scheduler, defined as a sequence of κ (algorithm a i , time-out τ i ) pairs, sequentially launches algorithm a j on any problem instance x until either a j solves x, or time τ j is reached, or a j stops without solving x.If x has been solved, the execution stops.Otherwise, j is incremented while j ≤ κ.
Note that a pre-scheduler contributes to reduce the impact of the AS failures.An AS failure is manifested as a problem instance x for which the AS selects an inappropriate algorithm (requiring much computational resources to solve x or even failing to solve it), although there exists another algorithm which could have solved x in no time.Since a pre-scheduler increases the chance for each problem instance to be solved in no time, everything else being equal, it therefore mitigates the chances and impact of AS failures.
After this discussion, the ASAP system involves two modules, a pre-scheduler and an AS.The pre-scheduler is meant to solve as many problem instances as possible in a first stage, and the AS takes care of the remaining problem instances.A primary decision concerns the division of labor between the two modules: how to split the available runtime between the two, and how many algorithms are involved in the pre-scheduler (parameter κ).It is clear that the number of problem instances solved by a module will increase with its computational budget, everything else being equal; the pre-scheduler and the AS modules are interdependent.For simplicity and tractability however, the maximal runtime allocated to the pre-scheduler is fixed to T max ps (10% of the overall computational budget in the experiments, section 5), and the number κ of algorithms in the prescheduler is set to 3. [10] and [13] use a most close setup for their fixed-split selection schedules, except they do not constrain the AS component to take the last part of the schedule.
Given T max ps and κ, ASAP tackles the optimization of the pre-scheduler and the AS modules.It is clear that both optimization problems remain inter-dependent: the AS should mostly focus on the problem instances which are not solved by the pre-scheduler, while the pre-scheduler should symmetrically focus on the problem instances which are most uncertain or badly identified by the AS.Formally, this interdependence is handled as follows: -A performance model G(x, a) is built for each algorithm over all training problem instances, defining AS init (Eq.1); -A pre-scheduler is built to optimize the joint performance (pre-scheduler, AS init ) over all training problem instances; -Another performance model G2(x, a) is built over all training problem instances, using an additional boolean feature that indicates for each problem instance whether it was solved by the above pre-scheduler; let AS post denote the AS based on performance model G2(x, a).
ASAP finally is composed of the pre-scheduler followed by AS post .

ASAP.V1 pre-scheduler
Let (a i , τ i ) κ i=1 denote a pre-scheduler, with overall computational budget T ps = κ i=1 τ i , and let F ((a i , τ i ) κ i=1 ) denote the associated domain-dependent performance.ASAP.V1 considers for simplicity equal time-outs (a i = T ps κ , i = 1 . . .κ).The pre-scheduler is thus obtained by solving the following optimization problem: max T ps ≤T max ps ,a1,...aκ This mixed optimization problem is tackled in a hierarchical way, determining for each value of T ps the optimal κ-uple of algorithms a 1 . . .a κ .Thanks to both small κ values (κ = 3 in the experiments) and small number of algorithms (≤ 31 in the ICON challenge, section 4), the optimal κ-uple is determined by exhaustive search conditionally to the T ps value.
The ASAP.V1 pre-scheduler finally relies on the 1-dimensional optimization of the overall computational budget T ps allocated to the pre-scheduler.In all generality, the optimization of T ps is a multi-objective optimization problem, e.g.balancing the overall number of problems solved and the overall computational budget.Multi-objective optimization commonly proceeds by determining the so-called Pareto front, made of non-dominated solutions.In our case, the Pareto front depicts how the performance varies with the overall computational budget, as illustrated on Fig. 2, where the performance is set to the number of solved instances.
In multi-objective decision making [3,2], the choice of a solution on the Pareto front is tackled using post-optimal techniques [4], including: i) compromise programming, where one wants to find the point the closest to an ideal target in the objective space; ii) aggregation of the objectives into a single one, e.g., using linear combination; or iii) marginal rate of return.The last heuristics consists of identifying the so-called "knees", that is, the points where any small improvement on a given criterion is obtained at the expense of a large decrease on another criterion, defining the so-called marginal rate of return.The vanilla marginal rate of return is however sensitive to strong local discontinuities; for instance, it would select point A in Fig. 2. Therefore, a variant taking into account the global shape of the curve, and measuring the marginal rate of improvement w.r.t. the extreme solutions on the Pareto front is used (e.g., selecting point K instead of point A in Fig. 2).

ASAP.V1 algorithm selector
As detailed in section 3.1, the AS relies on the performance model learned from the training problem instances.Two learning algorithms are considered in this paper: random forests and k-nearest neighbors.One hyper-parameter was adapted for each ML approach (all other hyperparameters being set to their default value, using the Python scikit-learn library [17]), based on a few preliminary experiments: 35 trees are used for the RandomForest algorithm and the number of neighbors is set to k = 3 for the k-nearest neighbors.In the latter case, the predicted value associated to problem instance x is set to the weighted sum of the performance of its nearest neighbors, weighted by their relative distance to x: where x i ranges over the 3 nearest neighbors of x.Features were normalized (zero mean, unit variance) before selecting the neighbors.
A main difficulty comes from the descriptive features forming the representation of problem instances.Typically, the feature values are missing for some groups of features, for quite a few problem instances, due to diverse causes (computation exceeded time limit, exceeded memory, presolved the instance, crashed, other, unknown).The lack of feature value is handled by i) replacing the missing value by the feature average value; ii) adding to the set of descriptive features 7 additional boolean features per group of initial features, indicating whether the feature group values are available or the reason why they are missing otherwise5 .

ASAP.V2
Several extensions of ASAP.V1 have been considered after the closing of the ICON challenge, aimed at exploring a richer pre-scheduler-AS search space while preventing the risk of overfitting induced by a larger search space.
We investigated the use of different time-outs for each algorithm in the pre-scheduler, while keeping the set of algorithms (a 1 , . . ., a κ ) and the overall computational budget T ps .The sequential optimization strategy (section 3.2), deterministically selecting T ps as the solution with maximal average return rate, exhaustively determining the κ-uple of algorithms conditionally to T ps , is thus extended to optimize the (τ 1 , . . .τ κ−1 ) vector conditionally to κ−1 i=1 ≤ T ps , using a prominent continuous black-box optimizer, specifically the Covariance-Matrix Adaptation-Evolution Strategy (CMA-ES) [7].
This extended search space is first investigated by considering the raw optimization criterion F raw ((a i , τ i ) κ i=1 ) defined in section 3.2, that is, the cumulative performance of ASAP over all training problem instances.However a richer search space entails some risk of overfitting, where the higher performance on data used to optimize ASAP (training data) is obtained at the expense of a lower performance on test data.Generally speaking, the datasets used to train an AS are small ones.
A penalized optimization criterion is thus considered: which penalizes the L 2 distance between the (τ i ) vector and the uniform time outs (τ i = T ps κ ).The rationale for this penalization is to prevent brittle improvements on the training set due to opportunistic adjustments of the τ i s, at the expense of stable performances on further instances.The penalization weight w is adjusted using a nested CV process.
A randomized optimization criterion is also considered.By construction, the fitness function aggregates the performances of all training problem instances.As the training problem instances sample the problem domain, this fitness defines a noisy optimization problem.Sophisticated approaches have been proposed to address this noisy optimization issue in non-convex optimizationbased machine learning settings (see e.g.[8]).Another approach is proposed here, based on the bootstrap principle: in each CMA-ES generation, the set of n problem instances used to compute the performance is uniformly drawn with replacement from the n-size training set.In this manner, each optimization generation considers a slightly different optimization objective noted F rand , thereby discarding the insignificant improvements.
Finally, a probabilistic optimization criterion is considered, handling the ASAP performance on a single problem instance as a random variable with a triangle-shape distribution (Fig. 3) centered on the actual performance p(x), with support in [p(x) − θ, p(x) + θ], and taking the expectation thereof.The merit of this triangular probability distribution function is to allow for an analytical computation of the overall fitness expectation, noted F df p . 4 Experimental setting: The ICON challenge

ASlib data format
Due to the difficulty of comparing the many algorithm selection systems and the high entry ticket to the AS field, a joint effort was undertaken to build the Algorithm Selection Library (ASlib), providing comprehensive resources to facilitate the design, sharing and comparison of AS systems [1].ASlib (version 1.0.1)involves 13 datasets, also called scenarios (Table 1), gathered from recent challenges and surveys in the operations research, artificial intelligence and optimization fields.The interested reader is referred to [1] for a more comprehensive presentation.Each dataset includes i) the performance and computation status of each algorithm on each problem instance; ii) the description of each problem instance, as a vector of the expert-designed feature values (as said, this description considerably facilitates the comparison of the AS systems); iii) the computational status of each such feature (e.g.indicating whether the feature could be computed, or if it failed due to insufficient computational or memory resources).Last but not least, each dataset is equi-partitioned into 10 subsets, to enforce the reproducibility of the 10 fold CV assessment of every AS algorithm.

The ICON Challenge on Algorithm Selection
The ICON Challenge on Algorithm Selection, within the ASlib framework, was carried on between February and July 2015 to evaluate AS systems in a fair, comprehensive and reproducible manner6 .Each submitted system was assessed on the 13 ASlib datasets [1] with respect to three measures: i) number of problem instances solved; ii) extra runtime compared with the virtual best solver (VBS, also called oracle); and iii) Penalized Average Time-10 (PAR10) which is the cumulative runtime needed to solve all problem instances (set to ten times the overall computational budget whenever the problem instance is unsolved).
As the whole datasets were available to the community from the start, the evaluation was based on hidden splits between training and test set.Each submitted system provides a datasetdependent, instance-dependent schedule of algorithms, optionally preceded by a dataset-dependent presolver (single algorithm running on all instances during a given runtime before the per-instance schedule runs).Each system can also, in a dataset-dependent manner, specify the groups of features to be used (in order to save the time needed to compute useless features).
Two baselines are considered: the oracle, selecting the best algorithm for each problem instance; and the single best (SB) algorithm, with best average performance over all problem instances in the dataset.The baselines are used to normalize every system performance over all datasets, associating performance 0 to the oracle (respectively performance 1 to the single best), supporting the aggregation of the system results over all datasets.

Comparative results
Table 2 reports the results of all submitted systems on all datasets (the statistical significance tests are reported in Fig. 5).The general trend is that zilla algorithms dominate all other algorithms on the SAT datasets, as expected since they have consistently dominated the SAT contests in the last decade.On non-SAT problems however, zilla algorithms are dominated by ASAP RF.V1.
The robustness of the ASAP approach is demonstrated as they never rank last; they however perform slightly worse than the single best on some datasets.The rescaled performances of ASAP RF.V1 is compared to zilla and autofolio (Fig. 4, on the left), demonstrating that ASAP RF.V1 offers a balanced performance, significantly lower than for zilla and autofolio on the SAT problems, but significantly higher on the other datasets; in this respect it can be viewed as a low-risk system.

Sensitivity analysis
The sensitivity analysis conducted after the closing of the challenge compares ASAP.V2 (with different time-outs in the pre-scheduler) and ASAP.V1, and examines the impact of the different optimization criteria, aimed at avoiding overfitting: the raw fitness, the L2-penalized fitness, the randomized fitness and the probabilistic fitness (section 3.4).
The impact of the hyper-parameters used in the AS (number of trees set to 35, 100, 200, 300 and 500 trees in the Random Forest) is also investigated.
Table 3 summarizes the experimental results that each ASAP.V2 configuration would have obtained in the ICON challenge framework, together with the actual submissions results, including  On the left: per-dataset performances of ASAP RF.V1 (balls, dotted line), zilla (no marker, dashed line) and autofolio (triangles, solid line) scaled to the range of performance of all submitted systems.As a comparison, the per-dataset best submitted system (small balls, solid line) and ASAP RF scores before rescaling are depicted on the right.systems that were not competing in the challenge: llama-regr and llama-regrPairs from the organizers, and autofolio-48 which is identical to autofolio but with 48h time for training (12h was the time limit authorized in the challenge) [11].
The significance analysis, using a Wilcoxon signed-rank test, is reported in Fig. 5.A first result is that all ASAP.V2 variants improve on ASAP.V1 with significance level 1%.A second result is that ASAP.V2 with the probabilistic optimization criterion is not statistically significantly different from zilla, autofolio and zillafolio.
A third and most surprising result is that the difference between the challenge-winner zilla and most of ASAP.V2 variants is not statistically significant.
A new hybrid algorithm selection approach, the ASAP system was presented in this paper, combining a pre-scheduler and a per-instance algorithm selector.ASAP.V1 introduced a selector learned conditionally to a predetermined schedule so that it focuses on instances that were not solved by the pre-scheduler .ASAP.V2 completes the loop as it re-adapts the schedule to the new AS.The main message is that the scheduler and the AS must be optimized jointly to reflect the division of labor achieved by these two components ASAP.V1, thoroughly evaluated in the ICON challenge on algorithm selection (ranked 4th) received an honourable mention, due to its novelty and good performance comparatively to the famed and long-known Zilla algorithms.
The ASAP.V2 extension achieved significantly better results along the same challenge setting.It must be emphasized that these results must be comforted by additional experiments on fresh data.A main lesson learned is the importance of the regularization, as the amount of available data does not permit to consider richer AS search spaces without incurring a high risk of overfitting.The probabilistic performance criterion successfully contributed to a more stable optimization problem.Further research work will be devoted to extending this criterion.

Fig. 1 .
Fig. 1.Percentage of solved instances vs. runtime on the SAT11-HAND dataset, for 5 algorithms and the oracle (selecting the best algorithm out of 5 for each problem instance).

Fig. 2 .
Fig.2.Among a set of Pareto-optimal solutions, solution A has the best marginal rate of return; solution K, which maximizes the average rate of return w.r.t. the extreme solutions of the Pareto front (maximizing angle γ), is the knee selected in ASAP.

Fig. 3 .
Fig.3.Schedule execution difference between punctual and triangular pdf.On the left (punctual pdf), schedule stops during step 1.On the right (triangular pdf, part of the instance is not solved during step 1 and the schedule executes until step 2 solves the rest. Fig.4.On the left: per-dataset performances of ASAP RF.V1 (balls, dotted line), zilla (no marker, dashed line) and autofolio (triangles, solid line) scaled to the range of performance of all submitted systems.As a comparison, the per-dataset best submitted system (small balls, solid line) and ASAP RF scores before rescaling are depicted on the right.

Table 2 .
Normalized performances of submitted systems, aggregated across all folds and all measures (the lower, the better).Ranks of zilla (challenge winner) and ASAP RF.V1 (honourable mention) are given in parenthesis.Numbers were computed from the challenge outputs.