STIPI: Using Search to Prioritize Test Cases Based on Multi-objectives Derived from Industrial Practice

. The importance of cost-effectively prioritizing test cases is undeniable in automated testing practice in industry. This paper focuses on prioritizing test cases developed to test product lines of Video Conferencing Systems (VCSs) at Cisco Systems, Norway. Each test case requires setting up configurations of a set of VCSs, invoking a set of test APIs with specific inputs, and checking statuses of the VCSs under test. Based on these characteristics and available information related with test case execution (e.g., number of faults detected), we identified that the test case prioritization problem in our particular context should focus on achieving high coverage of configurations, test APIs, statuses, and high fault detection capability as quickly as possible. To solve this problem, we pro-pose a search-based test case prioritization approach (named STIPI ) by defining a fitness function with four objectives and integrating it with a widely applied multi-objective optimization algorithm (named Non-dominated Sorting Genetic Algorithm II). We compared STIPI with random search (RS), Greedy algorithm, and three approaches adapted from literature, using three real sets of test cases from Cisco with four time budgets (25%, 50%, 75% and 100%). Results show that STIPI significantly outperformed the selected approaches and managed to achieve better performance than RS for on average 39.9%, 18.6%, 32.7% and 43.9% for the coverage of configurations, test APIs, statuses and fault detection capability, respectively.


Introduction
Testing is a critical activity for system or software development, through which system/software quality is ensured [1].To improve the testing efficiency, a large number of researchers have been focusing on prioritizing test cases into an optimal execution order to achieve maximum effectiveness (e.g., fault detection capability) as quickly as possible [2][3][4].In the industrial practice of automated testing, test case prioritization is even more critical because usually there is a limited budget (e.g., time) to execute test cases, and thus executing all available test cases at a given context is infeasible [1,5].
Our industrial partner for this work is Cisco System, Norway, who develops product lines of Video Conferencing Systems (VCSs), which enable high quality conference meetings [4,5].To ensure the delivery of high quality VCSs to the market, test engineers of Cisco continually develop test cases to test software of VCSs under various hardware or software configurations, statuses (i.e., states) of VCSs with dedicated test APIs.A test case is typically composed of the following parts: 1) setting up test configurations of a set of VCSs under test; 2) invoking a set of test APIs of the VCSs; and 3) checking the statuses of the VCSs after invoking the test APIs to determine the success or failure of an execution of the test case.When executing test cases, several objectives need to be achieved, i.e., covering the maximum number of possible configurations, test APIs, statuses and detecting as many faults as possible.However, given a number of available test cases, it is often infeasible to execute all of them in practice due to a limited budget of execution time (e.g., ten hours), and it is therefore important to seek an approach for prioritizing the given test cases to cover maximum number of configurations, test APIs, statuses and detect faults as quickly as possible.
To address the above-mentioned challenge, we propose a search-based test case prioritization approach named Search-based Test case prioritization based on Incremental unique coverage and Position Impact (STIPI).STIPI defines a fitness function with four objectives to evaluate the quality of test case prioritization solutions, i.e., Configuration Coverage (CC), test API Coverage (APIC), Status Coverage (SC) and Fault Detection Capability (FDC), and integrates the fitness function with a widely-applied multi-objective search algorithm (i.e., Non-dominated Sorting Genetic Algorithm II) [6].Moreover, we propose two prioritization strategies when defining the fitness function in STIPI: 1) Incremental Unique Coverage, i.e., for a specific test case, we only consider the incremental unique elements (e.g., test APIs) covered by the test case as compared with the elements covered by the already prioritized test cases; and 2) Position Impact, i.e., a test case with a higher execution position (i.e., scheduled to be executed earlier) has more impact on the quality of a prioritization solution.Notice that both of these strategies are defined to help search to achieve high criteria (i.e., CC, APIC, SC and FDC) as quickly as possible.
To evaluate STIPI, we chose five approaches for the comparison: 1) Random Search (RS) to assess the complexity of the problem; 2) Greedy approach; 3) One existing approach [7] and two modified approaches from the existing literature [8,9].The evaluation uses in total 211 test cases from Cisco, which are divided into three sets with varying complexity.Moreover, four different time budgets are used for our evaluation, i.e., 25%, 50%, 75% and 100% (100% refers to the total execution time of all the test cases in a given set).Notice that 12 comparisons were performed (i.e., three sets of test cases*four time budgets) for comparing STIPI with each approach, and thus in total 60 comparisons were conducted for the five approaches.Results show that STIPI significantly outperformed the selected approaches for 54 out of 60 comparisons (90%).In addition, STIPI managed to achieve higher performance than RS for on average 39.9% (configuration coverage), 18.6% (test API coverage), 32.7% (status coverage), and 43.9% (fault detection capability).
The remainder of the paper is organized as follows: Section 2 presents the context, a running example and motivation.STIPI is presented in Section 3 followed by experiment design (Section 4).Section 5 presents experiment results and overall discussion.Related work is discussed in Section 6, and we conclude the work in Section 7.

Context, Running Example and Motivation
Figure 1 presents a simplified context of testing VCSs (Systems Under Test (SUTs)), and Figure 2 illustrates (partial) configuration, test API and status information for testing a VCS.First, one VCS consists of one or more configuration variables (e.g., attribute protocol of class VCS in Figure 2), each of which can take two or more configuration variable values (e.g., literal SIP of enumeration Protocol).Second, a VCS holds one or more status variables defining the statuses of the VCS (e.g., NumberofActive-Calls), and each status variable can have two or more status variable values (e.g., Num-berofActiveCalls taking values of 0, 1, 2, 3 and 4).Third, testing a VCS requires employing one or more test API commands (e.g., dial), each of which includes zero or more test API parameters (e.g., callType for dial).Each test API parameter can take two or more test API parameter values (e.g., Video and Audio for CallType).3 configures the configuration variable protocol with SIP (Line 1).Second, a test API command is invoked with appropriate values assigned to its input parameters, if any.For example, the test case in Figure 3 invokes the test API command dial consisting of the two test API parameter values: Video for callType and SIP for protocol) (Line 2).Third, the test case checks the actual statuses of VCSs.For example, the test case in Figure 3 checks the status of the VCS to see if NumberOfActiveCalls equals to 1 (Line 4).

Figure 3. An Excerpt of a Sanitized and Simplified Test Case
In the context of testing VCSs, test case prioritization is a critical task since it is practically infeasible to execute all the available test cases within a given time budget (e.g., five hours).Therefore, it is essential to cover maximum configurations (i.e., configuration variables and their values), test APIs (i.e., test API commands, parameters and their values) and statuses (i.e., status variables and their values), and detect faults as quickly as possible.For instance, Table 1 lists five test cases ( 1 … 5 ) with the information about configurations, test APIs and statuses.The test case in Figure 3 is represented as T1 in Table 1, which 1) sets the configuration variable protocol as SIP; 2) uses three test API commands: dial with two parameters (callType, protocol), accept and disconnect; and 3) checks values of three status variables (e.g., MaxVideoCalls).
Notice that the five test cases in Table 1 can be executed in 325 orders (i.e., (5,1) × 1! + (5,2) × 2! + … + (5,5) × 5!).When there is a time budget, each particular order can be considered as a prioritization solution.Given two prioritization solutions  1 = { 5 ,  1 ,  4 ,  2 ,  3 } ,  2 = { 1 ,  3 ,  5 ,  2 ,  4 }, one can observe that  1 is better than  2 since the first three test cases in  1 can cover all the configuration variables and their values, test API commands, test API parameters, test API parameter values, status variables and status variable values, while  2 needs to execute all the five test cases to achieve the same coverage as  1 .Therefore, it is important to seek an efficient approach to find an optimal order for executing a given number of test cases to achieve high coverage of configurations, test APIs and statuses, and detect faults as quickly as possible, which forms the motivation of this work.

STIPI: Search-based Test case prioritization based on Incremental unique coverage and Position Impact
This section presents the problem representation (Section 3.1), four defined objectives, fitness function (Section 3.2) and solution encoding (Section 3.3).

Basic Notations and Problem Representation
Basic Notations.We provide the basic notations as below used throughout the paper.)|. = { 1 …   } defines a set of effectiveness measures. = { 1 ,  2 …   } represents a set of potential solutions, such that  = (, 1) × 1! + (, 2) × 2! + … + (, ) × !.Each solution   consists of a set of prioritized test cases in :   = { 1 …   }, where   ∈  refers to the test case with the execution position i in the prioritized solution   .Note that it is possible for the maximum number of test cases in   (i.e., ) to be less than the total number of test cases in , since only a subset of T is prioritized during limited budget (e.g., time).
Problem Representation.We aim to prioritize the test cases in  in two contexts: 1) 100% time budget and 2) less than 100% time budget (i.e., time-aware [1]).Therefore, we formulate the test case prioritization problem as follows: a) search a solution   with  test cases from the total number of  solutions in  to obtain the highest effectiveness; and b) a test case   in a particular solution (e.g.,   ) with a higher position  has more influence for  than the test case with a lower position . 1) With 100% time budget: where   (  , ) and   (  , ) refer to the effectiveness measure  for a test case   at position p and q, respectively for a particular solution   . (  ,   ) and  (  ,   ) returns the effectiveness measure  for solutions   ,   respectively.

Fitness Function
Recall that we aim at maximizing the overall coverage for configuration, test API and status, and detect faults as quickly as possible (Section 2).Therefore, we define four objective functions for the fitness function to guide the search towards finding optimal solutions, which are presented in details as below., where  is the total number of test cases, and  is a specific execution position in a prioritization solution.Thus, test cases with higher execution positions have higher impact on the quality of a prioritization solution, which fits the scope of test case prioritization that aims at achieving higher criteria as quickly as possible.For instance, using this strategy,  for  1 is:

Maximize Configuration Coverage (CC
A higher value of CC shows a higher coverage of configuration., with a higher value indicating a higher status coverage, and therefore representing a better solution.Maximize Fault Detection Capability (FDC).In the context of Cisco, FDC is defined as the detected number of faults for test cases in a solution   [4,5,[10][11][12]

𝑚𝑓𝑑𝑐
.    denotes the FDC for a   ,  represents the sum of all FDC of test cases, and a higher value of  implies a better solution.Notice that we cannot apply the incremental unique coverage strategy for calculating    since the relations between faults and test cases are not known in our case (i.e., we only know whether the test cases can detect faults after executing it for a certain number of times rather than having access to the detailed faults detected).

Solution Representation
The test cases in  are encoded as an array  = { 1 ,  2 …   }, where each variable   represents one test case in T, and holds a unique value from 0 to 1.We prioritize the test cases in  by sorting the variables in  in a descending order from higher to lower, such that 1 is the highest, and 0 is the lowest order.Initially, each variable in  is assigned a random value between 0 and 1, and during search our approach returns solutions with optimal values for  guided by the fitness function defined in Section 3.2.In terms of time-aware test case prioritization (i.e., with a time budget less than 100%), we pick the maximum number of test cases that fit the given time budget.For example, in Table 1 for  = { 1 …  5 } with  as {0.6, 0.2, 0.4, 0.9, 0.3} and the execution time (recorded as minutes) as  = {4, 5, 6, 4, 3} , the prioritized test cases are { 4 ,  1 ,  3 ,  5 ,  2 } based on our encoding way for test case prioritization.If we have a time budget of 11 minutes, the first two test cases (in total 8 minutes for execution) are first added to the prioritized solution   , and there are 3 minutes left, which is not sufficient for executing  3 (6 minutes).Thus,  3 is not added into   , and the next test case is evaluated to see if the total execution time can fit the given time budget. 5 with three minutes will be added into   , since the inclusion of  5 will not make the total execution time exceed the time budget.Therefore, the new prioritized solution will be { 4 ,  1 ,  5 }.Moreover, we integrate our fitness function with a widely applied multi-objective search algorithm named Non-dominated Sorting Genetic Algorithm (NSGA-II) [6,13,14].The tournament selection operator [6] is applied to select individual solutions with the best fitness for inclusion into the next generation.The crossover operator is used to produce offspring solutions from the parent solutions by swapping some of the parts (e.g., test cases in our context) of the parent solutions.The mutation operator is applied to randomly change the values of one or more variables (e.g., in our context, each variable represents a test case) based on the pre-defined mutation probability, e.g., 1/(total number of test cases) in our context.We also compare the running time of STIPI with all the five chosen approaches, since STIPI is invoked very frequently (e.g., more than 50 times per day) in our context, i.e., the test cases require to be prioritized and executed often.Therefore, it would be practically infeasible if it takes too much time to apply STIPI.

Experiment tasks
As shown in Table 2 (Experiment Task column), we designed two tasks (T1, T2) for addressing RQ1-RQ2.The task T1 is designed to compare STIPI with RS for the four time budgets (i.e., 100%, 75%, 50% and 25%) and three sets of test cases (i.e., 100, 150 and 211).Similarly, the task T2 is designed to compare STIPI with the other four test case prioritization approaches, which is divided into four sub-tasks for comparing Greedy, A1, A2 and A3, respectively.
Moreover, we employed 211 real test cases from Cisco for evaluation by dividing it into three sets with varying complexity (#Test Cases column in Table 2).For the first set, we used all the 211 test cases.For the second set, we used 100 random test cases from the 211 test cases.Finally, for the third set, we used the 150 test cases by choosing 111 test cases not selected in the second set (i.e., 100) and 39 random test cases from the second set.Notice that the goal for using three test case sets is to evaluate our approach with test datasets with different complexity.

Evaluation metrics
To answer the RQs, we defined in total seven EMs (Table 3).Six are used to assess how fast the configurations, test APIs and statuses can be covered: When there is a limited time budget, it is possible that not all the configurations, test APIs and statuses can be covered.Therefore, we defined APCCp, APACp, and APACp to give penalty to missing configurations, test APIs, and statuses for time-aware prioritization (i.e., 25%, 50% and 75% time budget) based on the variant of APFD metric used for time-aware prioritization [1,16].For example, for a solution   with jn test cases (,   ) gives the test case from   that covers cv for APCVCp in Table 3.If   does not contain a test case that covers cv, (,   ) =  + 1.Notice that in our context, we only have information about how many times in a given period (e.g., a week) a test case was successful in finding faults.Therefore, it is not possible to use the APFD metric to evaluate FDC.Hence, we defined a metric: Measured Fault Detection Capability (MFDC) to measure the percentage of fault detected for time budget of 25%, 50% and 75%.

Quality Indicator, Statistical Tests and Parameter Settings
When comparing the overall performance of multi-objective search algorithms (e.g., NSGA-II [6]), it is common to apply quality indicators such as hypervolume (HV).Following the guideline in [10], we employ HV based on the defined EMs to address RQ2.2-RQ2.4(i.e., tasks T2.2 -T2.4 in Table 2).HV calculates the volume in the objective space covered by members of a non-dominated set of solutions (i.e., Pareto front) produced by search algorithms for measuring both convergence and diversity [17].A higher value of HV indicates a better performance of the algorithm.The Vargha and Delaney  ̂12 statistics [18] and Mann-Whitney U test are used to compare the EMs (T1 and T2), and HV (T2.2 -T2.4), as shown in Table 2 by following the guidelines in [19].The Vargha and Delaney  ̂12 statistics is a non-parametric effect size measure, and Mann-Whitney U test tells if results are statistically significant [20].For two algorithms A and B, A has better performance than B if  ̂12 is greater than 0.5, and the difference is significant if p-value is less than 0.05.
Notice that STIPI, A1, A2 and A3 are all combined with NSGA-II.Since tuning parameters to different settings might result in different performance of search algorithms, standard settings are recommended [19].We used standard settings (i.e., population size=100, crossover rate=0.9,mutation rate=1/(number of test cases)) as implemented in jMetal [21].The search process is terminated when the fitness function has been evaluated for 50,000 times.Since A2 does not support prioritization with a time budget, we collect the maximum number of test cases that can fit a given time budget.

RQ1: Sanity Check (STIPI vs. RS)
Results in Table 4 and Table 5 show that on average STIPI is higher than RS for all the EMs across the three sets of test cases.Moreover, for the three test sets using four time budgets, STIPI managed to achieve higher performance than RS for on average 39.9% (configuration coverage), 18.6% (test API coverage), 32.7% (status coverage), and 43.9% (FDC).In addition, results of the Vargha and Delaney statistics and the Mann Whitney U test show that STIPI significantly outperformed RS for all the Ems since all the values of  ̂12 are greater than 0.5 and all the p-values are less than 0.05.

RQ2: Comparison with the selected approaches
We compared STIPI with Greedy, A1, A2 and A3 using the statistical tests (Vargha and Delaney statistics and Mann Whitney U test) for the four time budgets (25%, 50%, 75% and 100%), and the three sets of test cases (i.e., 100, 150, 211).Results are summarized in Figure 4.For example, the first bar (i.e., Gr) in Figure 4 refers to the comparison between STIPI and Greedy for the 100% time budget where A= STIPI and B=Greedy.A>B means the percentage of EMs for which STIPI has significantly better performance than Greedy ( ̂12 > 0.5 &&  < 0.05), A<B means the opposite ( ̂12 < 0.5 &&  < 0.05), and A=B implies there is no significant difference in performance ( ≥ 0.05).

RQ2.1 (STIPI vs. Greedy).
From Table 4 and Table 5, we can observe that the average values of STIPI are higher than Greedy for 93.3% (42/45) 1 EMs across the three sets of test cases with the four time budgets.Moreover, from Figure 4, we can observe STIPI performed significantly better than Greedy for an average of 93.1% for the four time budgets (i.e., 88.9% for 100%, 91.7% for 75%, 91.7% for 50%, and 100% for 25% time budget).Detailed results are available in [15].4 and Table 5, we can see that STIPI has a higher average value than A1 for 82.2% (37/45) EMs, and STIPI performed significantly better than A1 for an average of 76.4% EMs across the four time budgets, while there was no difference in performance for 14.6% from Figure 4. Figure 5 shows that for HV, STIPI outperformed A1 for all the three sets of test cases with the four time budgets, and such better results are statistically significant.Detailed results are in [15].4 shows that the two approaches had similar average for EMs with 100% time budget.Moreover, for 100% time budget, there was no significant difference in the performance between STIPI and A2 in terms of EMs and HV (Figure 4 and Figure 5).However, when considering the time budgets of 25%, 50% and 75%, STIPI had a higher performance for 96.3% (26/27) EMs (Table 4 and Table 5).Furthermore, the statistical tests in Figure 4 and Figure 5 show that STIPI significantly outperformed A2 for an average of 88.9% EMs and HV values across the three time budgets (25%, 50%, 75%), while there was no significant difference for 11.1%.

RQ2.4 (STIPI vs. A3).
Based on the results (Table 4 and Table 5), STIPI held a higher average values for 75% (27/36) EM values for the four time budgets and three sets of test cases.For 100%, 75%, and 50%, we can observe from Figure 4 that STIPI performed significantly better than A3 for an average of 74.1% EMs, while there was no significant difference for 22.2%.For the 25% time budget, there was no statistically significant difference in terms of EMs for STIPI and A3.However, when comparing the HV values, STIPI significantly outperformed A3 for an average of 91.7% across the four time budgets and three sets of test cases.Notice that 12 comparisons were performed when comparing STIPI with each of the five selected approaches (i.e., three test case sets * four time budgets), and thus in total 60 comparisons were conducted.Based on the results, we can observe that STIPI significantly outperformed the five selected approaches for 54 out of 60 comparisons (90%), which indicate that STIPI has a good capability for solving our test case prioritization problem.In addition, STIPI took an average time of 36.5, 51.6 and 82 seconds (secs) for the three sets of test cases.The average running time for the five chosen approaches are: 1) RS: 18, 24.7 and 33.2 secs; 2) Greedy: 42, 48 and 54 milliseconds; 3) A1: 35.7, 42.8 and 65.5 secs; 4) A2: 35.2, 42.2 and 55.4 secs; and 5) A3: 8.9, 33.4 and 41.2 secs.Notice that there is no practical difference in terms of the running time for the approaches except Greedy, however the performance of Greedy is significantly worse than STIPI (Section 5.2), and thus Greedy cannot be employed to solve our test case prioritization problem.In addition, based on the domain knowledge of VCS testing, the running time in seconds is acceptable when deployed in practice.

Overall Discussion
For RQ1, we observed that STIPI performed significantly better than RS for all the EMs with the three sets of test cases under the four time budgets.Such an observation reveals that solving our test case prioritization problem is not trivial, which requires an efficient approach.As for RQ2, we compared STIPI with Greedy, A1, A2 and A3 (Section 4.1).
Results show that STIPI performed significantly better than Greedy.This can be explained that Greedy is a local search algorithm that may get stuck in a local space during the search process, while STIPI employs mutation operator (Section 4.4) to explore the whole search space towards finding optimal solutions.In addition, Greedy converted our multi-objective optimization problem into a single-objective optimization problem by assigning weights to each objective, which may lose many other optimal solutions that hold the same quality [22], while STIPI (integrating NSGA-II) produces a set of non-dominated solutions (i.e., solutions with equivalent quality).When comparing STIPI with A1, A2 and A3, the results of RQ2 showed that STIPI performed significantly better than A1, A2 and A3 by 83.3% (30/36).Overall STIPI outperformed the five selected approaches for 90% (54/60) comparisons.That might be due to two main reasons: 1) STIPI considers the coverage of incremental unique elements (e.g., test API commands) when evaluating the prioritization solutions, i.e., only the incremental unique elements covered by a certain test case are taken into account as compared with the already prioritized test cases; and 2) STIPI provides the test cases with higher execution positions more influence on the quality of a given prioritization solution.Furthermore, A2 and A3 usually work under the assumption that the relations between detected faults and test cases are known beforehand, which is sometimes not the situation in practice, e.g., in our case, we are only aware how many execution times a test case can detect faults rather than having access to the detailed faults detected.However, STIPI defined FDC to measure the fault detection capability (Section 3.2) without knowing the detailed relations between faults and test cases, which may be applicable to the similar other contexts when the detailed faults cannot be accessed.It is worth mentioning that the current practice of Cisco do not have an efficient approach for test case prioritization, and thus we are working on deploying our approach in their current practice for further strengthening STIPI.

Threats to Validity
The internal validity threat arises due to using search algorithms with only one configuration setting for its parameters as we did in our experiment [23].However, we used the default parameter setting from the literature [24], and based on our previous experience [5,10], good performance can be achieved for various search algorithms with the default setting.To mitigate the construct validity threat, we used the same stopping criteria (50,000 fitness evaluations) for finding the optimal solutions.To avoid conclusion validity threat due to the random variations in the search algorithms, we repeated the experiments 10 times to reduce the possibility that the results were obtained by chance.Following the guidelines of reporting the results for randomized algorithms [19], we employed the Vargha and Delaney test as the effect size measure and Mann-Whitney test to determine the statistical significance of results.First external validity threat is that one may argue the comparison performed only included RS, Greedy, one existing approach and two modified versions of the existing approaches, which may not be sufficient.Notice that we discussed and justified why we chose these approaches in Section 4.1, and it is also possible to compare our approach with other existing approaches, which requires further investigation as the next step.Second external validity threat is due to the fact that we only performed the evaluation using one industrial case study.We need to mention that we conducted the experiment using three sets of test cases with four distinct time budgets based on the domain knowledge of VCS testing.

Related Work
In the last several decades, test case prioritization has attracted a lot of attention and considerable amount of work has been done [1][2][3]8].Several survey papers [25,26] present results that compare existing test case prioritization techniques from different aspects, e.g., based on coverage criteria.Followed by the aspects presented in [25], we summarize the related work close to our approach and highlight the key differences from the following three aspects: coverage criteria, search-based prioritization techniques (which is related with our approach) and evaluation metrics.
Coverage Criteria.Existing works defined a number of coverage criteria for evaluating the quality of prioritization solutions [2,3,26] such as branch coverage and statement coverage, function coverage and function-level fault exposing potential, block coverage, modified condition/decision coverage, transition coverage and round trip coverage.As compared with the state-of-the-art, we proposed three new coverage criteria driven by the industrial problem (Section 3.2): 1) Configuration coverage (CC); 2) Test API coverage (APIC) and 3) Status coverage (SC).
Search-Based Prioritization Techniques.Search-based techniques have been widely applied for addressing test case prioritization problem [3][4][5]10].For instance, Zhang et al. [3] defined a fitness function with three objectives (i.e., Block, Decision and Statement Coverage) and integrated the fitness function with hill climbing and GA for test case prioritization.Arrieta et al. [7] proposed to prioritize test cases by defining a twoobjective fitness function (i.e., test case execution time and fault detection capability) and evaluated the performance of several search algorithms.The authors of [7] also proposed a strategy to give higher importance to test cases with higher positions (to be executed earlier).A number of research papers have focused on addressing the test case prioritization problem within a limited budget (e.g., time and test resource) using search-based approaches.For instance, Walcott et al. [1] proposed to combine selection (of a subset of test cases) and prioritization (of the selected test cases) for prioritizing test cases within a limited time budget.Different weights are assigned to the selection part and prioritization part when defining the fitness function followed by solving the problem with GA.Wang et al. [5] focused on the test case prioritization within a given limited test resource budget (i.e., hardware, which is different as compared with the time budget used in this work) and defined four cost-effectiveness measures (e.g., test resource usage), and evaluated several search algorithms (e.g., NSGA-II).
As compared with the existing works, our approach (i.e., STIPI) defines a fitness function that considers configurations, test APIs and statuses, which were not addressed in the current literature.When defining the fitness function, STIPI proposed two strategies, which include 1) only considering the unique elements (e.g., configurations) achieved; and 2) taking the impact of test case execution orders on the quality of prioritization solutions into account, which is not the case in the existing works.

Evaluation Metrics (EMs).
APFD is widely used in the literature as an EM [2,3,8,16].Moreover, the modified version of APFD (i.e., APFDp) using time penalty [1,16] is usually applied for test case prioritization with a time budget.Other metrics were also defined and applied as EMs [9,26] such as Average Severity of Faults Detected, Total Percentage of Faults Detected and Average Percentage of Faults Detected per Cost (APFDc).As compared with the existing EMs, we defined in total six new EMs driven by our industrial problem for configurations, test APIs and statuses (Table 3), which include: 1) APCC, APAC, and APSC, inspired by APFD, when there is 100% time budget; and 2) APCCp, APACp, and APSCp inspired by APFDp, when there is a limited time budget (e.g., 25% time budget).Furthermore, we defined the seventh EM (MFDC) to assess to what extent faults can be detected when the time budget is less than 100% (Table 3).To the best of our knowledge, there is no existing work that applies these seven EMs for assessing the quality of test case prioritization solutions.

Conclusion and Future Work
Driven by our industrial problem, we proposed a multi-objective search-based test case prioritization approach named STIPI for covering maximum number of configurations, test APIs, statuses, and achieving high fault detection capability as quickly as possible.
We compared STIPI with five test case prioritization approaches using three sets of test cases with four time budgets.The results show that STIPI performed significantly better than the chosen approaches for 90% of the cases.STIPI managed to achieve a higher performance than random search for on average 39.9% (configuration coverage), 18.6% (test API coverage), 32.7% (status coverage) and 43.9% (FDC).In the future, we plan to compare STIPI with more prioritization approaches from the literature using additional case studies with larger scale to further generalize the results.

Figure 4 .
Figure 4. Results of Comparing STIPI with Greedy, A1, A2 and A3 for EMs RQ2.2 (STIPI vs. A1).Based on Table4and Table5, we can see that STIPI has a higher average value than A1 for 82.2% (37/45) EMs, and STIPI performed significantly better than A1 for an average of 76.4% EMs across the four time budgets, while there was no difference in performance for 14.6% from Figure4.Figure5shows that for HV, STIPI outperformed A1 for all the three sets of test cases with the four time budgets, and such better results are statistically significant.Detailed results are in[15].

Figure 5 .
Figure 5. Results of Comparing STIPI with A1, A2 and A3 for HV RQ2.3 (STIPI vs. A2).RQ2.3 is designed to compare STIPI with the approach A2 (Section 4.1).Table4shows that the two approaches had similar average for EMs with 100% time budget.Moreover, for 100% time budget, there was no significant difference in the performance between STIPI and A2 in terms of EMs and HV (Figure4and Figure5).However, when considering the time budgets of 25%, 50% and 75%, STIPI had a higher performance for 96.3% (26/27) EMs (Table4 and Table 5).Furthermore, the statistical tests in Figure4and Figure5show that STIPI significantly outperformed A2 for an average of 88.9% EMs and HV values across the three time budgets (25%, 50%, 75%), while there was no significant difference for 11.1%.

Table 1 . Illustrating Test Case Prioritization* Test Case Configuration Test API Status
2 …   } represents a set of status variables covered by .For each   ,   refers to the status variable values:   = { 1 …   }.
=  = 3).Moreover, we propose two prioritization strategies for calculating  and .The first one is Incremental Unique Coverage, i.e.,    and    representing the number of incremental unique CV and CVV covered by   (Section 3.1).For example, in Table1, for one test case prioritization solution  1 = { 5 ,  1 ,  4 ,  2 ,  3 },   5 is 1 since  5 is in the first execution position and covers one CVV (i.e., H320).  1 and   4 are at the second and third position, and cover one CVV each (i.e., SIP, H323).However,   2 and   3 are 0, since they are already covered by   1 .This strategy is defined since test case prioritization in our case concerns how many configurations, test APIs, and statuses can be covered rather than how many times they can be covered.The second prioritization strategy is Position Impact, which is calculated as ). CC measures the overall configuration coverage of a solution   with  number of test cases, which is composed of Configuration Variable Coverage (CVC) and Configuration Variable Values Coverage (CVVC).We can calculate CVC and CVVC for   as:  , where mcv and mcvv represent the total number of unique Configuration Variables (CV) and Configuration Variable Values (CVV) respectively covered by the total test cases in  (e.g., in Table 1 ). APIC measures the overall test API coverage of a solution   with  number of test cases.It consists of three sub measures: Test API Command Coverage (ACC), Test API Parameter Coverage (APC), and Test API parameter Value Coverage (AVC).ACC, APC and AVC can be calculated as below: Incremental Unique Coverage and Position Impact) are applied for calculating ,  and , where    ,    and    denotes the number of unique test API commands (AC), test API parameters (AP), and test API parameter values (AV) respectively covered by   (Section 3.1).They are measured similar as for   in CVVC., , and  refer to the total number of unique AC, AP, and AV covered by the total number of test cases as explained for  in CVVC.The APIC for   is represented as:    =    and    are the number of unique Status Variables (SV) and Status Variable Values (SVV) respectively covered by   (Section 3.1), which are measured similar as   in . and  represent the total number of unique SV and SVV respectively measured similar as for  in CVVC.The SC for   is represented as:    = .Similarly, the same two strategies (i.e.,3.A higher value of APIC shows a higher coverage of test APIs.Maximize Status Coverage (SC).SC measures the total status coverage of a solution   .It consists of two sub measures: Status Variable Coverage (SVC) and Status VariableValue Coverage (SVVC), calculated as follow:    = . The FDC for a test case   is calculated as:    = FDC of   is calculated based on the historical information of executing   .For example, if   was executed 10 times, and it detected fault 4 times, the FDC for   is 0.4.We calculate FDC for a solution   as:    = [7]STIPI effective for test case prioritization as compared with RS (i.e., random prioritization)?We compare STIPI with RS for four time budgets: 100% (i.e., total execution time of all the test cases in a given set), 75%, 50% and 25%, to assess the complexity of the problem such that the use of search algorithms is justified.RQ2:Is STIPI effective for test case prioritization as compared with four selected approaches, in the contexts of four time budgets: 100%, 75%, 50% and 25%?RQ2.1:IsSTIPIeffective as compared with the Greedy approach (a local search approach)?RQ2.2:IsSTIPIeffective as compared with the approach used in[7](named as A1 in this paper)?Notice that we chose A1 since it also proposed a strategy to give higher importance to test cases with higher execution positions.
[9].3:Is STIPI effective as compared with the modified version of the approach proposed in[8](named as A2 in this paper)?We chose A2 since it combines the Average Percentage of Faults Detected (APFD) metric and NSGA-II for test case prioritization without considering time budget.We modified it by defining Average Percentage of Configuration Coverage (APCC), Average Percentage of test API Coverage (APAC) and Average Percentage of Status Coverage (APSC) (Section 4.3) for assessing the quality of prioritization solutions for configurations, test APIs and statuses.RQ2.4:IsSTIPIeffective as compared with the modified version of the approach in[9](named as A3 in this paper)?We chose A3 since 1) it combines the ADFD with cost (APFDc) metric and NSGA-II for addressing time-aware test case prioritization problem.We revised A3 by defining Average Percentage of Configuration Coverage with cost (APCCc), Average Percentage of test API Coverage with cost (APACc) and Average Percentage of Status Coverage with cost (APSCc).For illustration, we provide a formula for Average Percentage of Configuration Variable Value Coverage with cost (APCVVCc) that is a sub-metric for APCCc as:   =  .For a solution   with jn test cases,  i is the first test case from   that covers  i (i.e., the i th configuration variable value),  is the total number of unique configuration variable value, and   is the execution time for k th test case.Notice that the detailed formulas for APCCc, APACc and APSCc can be consulted in our technical report in[15].