Efficient Parallel Algorithms for Linear RankSVM on GPU

. Linear RankSVM is one of the widely used methods for learning to rank. Although using Order-Statistic Tree (OST) and Trust Region Newton Methods (TRON) are eﬀective to train linear RankSVM on CPU, it becomes less eﬀective when dealing with large-scale training data sets. Furthermore, linear RankSVM training with L2-loss contains quite amount of matrix manipulations in comparison with that with L1-loss, so it has great potential for achieving parallelism on GPU. In this paper, we design eﬃcient parallel algorithms on GPU for the linear RankSVM training with L2-loss based on diﬀerent queries. The experimental re-sults show that, compared with the state-of-the-art algorithms for the linear RankSVM training with L2-loss on CPU, our proposed parallel algorithm not only can signiﬁcantly enhance the training speed but also maintain the high prediction accuracy.


Introduction
As a promising parallel device for general-propose computing, Graphics Processing Unit (GPU) not only provides tens of thousands of threads for applications with data-level or task-level parallelism, but also shows superb computational performance on floating point operations in comparison with the current multicore CPUs [1]. Additionally, combining with Compute Unified Device Architecture (CUDA) programming model [2] released by NVIDA in 2007, quite a lot of existing applications can be conveniently programmed and ported to GPUs. Especially, the machines learning algorithms can be highly parallelizable on GPUs since they typically contain a large number of matrix manipulations [3].
Given a set of training label-query-instance tuples (y i , q i , x i ), y i ∈ K ⊂ R, q i ∈ Q ⊂ Z, x i ∈ R n , i = 1, · · · , l, where K is the set of possible relevance levels with |K| = k, Q is the set of queries with |Q| = m, l is the total number of training instances and n is the number of features for each training instance, as well as a defined set of preference pairs: P ≡ {(i, j) | q i = q j , y i > y j } with p ≡ |P|, where (i, j) indicates (x i , x j ) for short, then the objective function f (w) of linear RankSVM with L2-loss is presented by: where w ∈ R n is a vector of parameters, C > 0 is a regularization parameter. The goal of RankSVM is to learn w such that w T x i > w T x j if (i, j) ∈ P.
Although there have been many serial algorithms for linear RankSVM, however, there is no empirical research exists on this issue for achieving linear RankSVM on some parallel systems. Moreover, on the one hand, although the linear RankSVM training using TRust regiON Newton methods (TRON) [20] instead of Cutting Plane Method (CPM) may obtain more quick convergence speed, it becomes less effective when dealing with the large-scale training data sets; on the other hand, the linear RankSVM with L2-loss contains more matrixmatrix or matrix-vector operations over that with L1-loss, so training linear RankSVM with L2-loss can be accelerated effectively on GPU. This motivates us to design efficient GPU algorithms to train linear RankSVM with L2-loss.
The main contributions of this paper can be summarized as follows: (1) We define a new rule of how to determine the preference pairs in terms of the different queries (Please see Definition 1); (2) Based on the new rule, we propose a parallel algorithm P-SWX for linear RankSVM training with L2-loss; (3) We propose an efficient GPU sorting algorithm, GPU-quicksorting, that can sort multiple sequences within a single GPU kernel. Meanwhile, we conduct extensive comparison experiments to prove the effectiveness of our proposed algorithms. To the best of our knowledge, this is the first work that achieves RankSVM training on GPU.
The rest of the paper is organized as follows: In Section 2, we briefly introduce the basic principle of linear RankSVM with L2-loss we are interest in solving, as well as its effective solution TRON. Section 3 is mainly devoted to designing parallel algorithms P-SWX and GPU-quicksorting on GPU to accelerate training speed of the linear RankSVM with L2-loss. Experiments, which indicate the performance of our proposed algorithms, are given and analysed in Section 4. Finally, Section 5 summarizes the conclusion of this project and points the future research work.

Linear RankSVM Training with L2-loss
This section briefly introduces the linear RankSVM training with L2-loss by using TRON.

Linear RankSVM Traing with L2-loss by Using Trust Region Newton Method
Typically, TRON can be viewed as an effective Newton method to solve the optimization problem f (w), the primary goal of which, at the d-th iteration, is to find a w d+1 so that f (w d+1 ) is less than f (w d ). To update w d by w d+1 = w d +v, TRON takes an improved Conjugate Gradient (CG) method to find an optimal direction v ∈ R n by iteratively minimizing F d (v) which is the second-order where Δ d is the size of the trust region, and ∇f (w d ) and ∇ 2 f (w d ) are indicated the first and second order differential function of f (w), respectively. Apparently, TRON contains two levels iterations, inner iterations and outer iterations. The inner one is the CG iterations which are used to find an optimal v within the trust region iteratively for updating w, while the outer one is Newton Method which is applied to generate a more optimal w for f (w). The whole framework of TRON can be clearly presented by Algorithm 1. Our setting for updating Δ d follows the work done by Lin et al. [21]. But for the stopping condition, we follow that of TRON in the package LIBLINEAR 1 [22] to check if the gradient is small enough compared with an initial gradient shown as follows.
where w 0 is the initial iteration and s is the stopping tolerate given by users. Optimizing f (w) by TRON refers to computing ∇f (w) and ∇ 2 f (w). However, the ∇ 2 f (w) doesn't exist because ∇f (w) is not differentiable. To derive a faster method to calculate ∇ 2 f (w)v, Lee et.al. [16] has explored the structure of ∇ 2 f (w)v by defining some expressions as follows.

Algorithm 1 Trust Region Newton Method
Following the above definitions, Lee et.al. [16] converted f (w) and ∇f (w) and ∇ 2 f (w)v into following expressions.
according to the derivation done by Lee et al. [16], can be computed by: can be viewed as the computational bottlenecks since it refers to not only the CG iteration but also the outer iteration of TRON. According to the definitions of SV + i and SV − i , computing all parameter variables in (5) requires to determine whether So sorting all w T x i before CG iterations of TRON must be a reasonable way to do a quick decision. If all w T x i is sorted already, then computing the all parameter variables in (5) by DCM may cost O(lk) [23]. If taking advantage of favourable searching performance of OST, then the O(lk) term is would be reduced to O(llog(k)) [15]. Therefore, by using TRON along with OST, the total computation complexity of linear RankSVM training with L2-loss is equal to

Novel Parallel Algorithms for Linear RankSVM Training with L2-loss on Graphic Processing Units
In this section, we devote to designing efficient parallel algorithms for training linear RankSVM with L2-loss on GPU.

Efficient Parallel Algorithm for Computing Hessian-vector Product on Graphic Processing Units
As shown in (5), each x i has one-to-one relationship with the parameters So we can assign a thread to calculate these variables that correspond to x i . Although this rough parallel method can effectively achieve data-level parallelism on GPU, each assigned thread has to execute O(l) steps, which is less effective over DCM (O(lk)) or OST (O(llog(k))) if k is small.
However, according to definition of P, x i and x j can combine into a preference pair if and only if q i = q j holds true. Hence, all x i (or all y i ) can be divided into m subsets because of existing m different queries in Q. We assume that for a query , · · · , y t|Yt| ] indicate the corresponding subsets of the training instances and labels respectively.
Proof. According to SV + i and SV − i , computing the parameter variables corresponding to x i (or y i ) only refers to all x j (or all y j ) satisfying q j = q i . Therefore, it implies that any two of X t (Y t ) are independent of each other.
According to Theorem 1, if x i ∈ X t and y i ∈ Y t , then computing the pa- i corresponding to x i should go through only the subsets X t and Y t but not all x i and all y i , which can reduce the computation complexity significantly. Meanwhile, we should sort all subsets is not suitable any more because it refers to all x i and all y i .

Definition 1. For a query Q(t), the rule of how to determine the preference pairs
The SV t (w) is similar but essentially different from SV(w) because it only involves the data information related to Q(t). So if all X t (or all Y t ) are obtained already, then the expressions of (4) and (5) should be transformed into (9) and (10) respectively.
Accordingly, f (w)v, ∇f (w) and ∇ 2 f (w) have to be converted into: is as similar as Of course, combined with (9) and (10),Ã T wÃ wX v, A T wÃ wX w andÃ T w e w can be computed by: According to Theorem 1, it needs to assign m thread blocks to compute all parameter variables shown in (10) in parallel on GPU. Moreover, each x ti corresponds to β + ti , β − ti , α + ti , α − ti , γ + ti and γ − ti , so do such computation can effectively achieve data-level parallelism on GPU in terms of the Definition 1. Assume that, for the t-th query Q(t), w T X t indicates the sorted w T X t , i.e., (1) , · · · , y tπ(|Xt|) ]. Then, we map three of w T X t , X t v and Y t into the t-th thread block of GPU jointly for parallel computing.
Based on the above discussions, we propose an efficient parallel algorithm, P-SWX, to compute parameter variables shown in (10) on GPU. The specific steps of P-SWX are clearly shown in Algorithm 2 in which the threads in the t-th thread block should execute at most O(|X t |) steps. Let l L l indicates the largest |X t |, then the threads on GPU should execute at most O(l L ) steps.
However, to get more favourable training speed, the matrix operations, including matrix-matrix products, matrix-vector products and vector additions, should be calculated by respectively adopting Segmm, Sgemv and Saxay subroutines in CUBLAS [24]. So if all parameter variables shown in (10) are calculated already, then computing ∇ 2 f (w)v, ∇f (w) and f (w) on GPU by invoking CUBALS may be a more reasonable choice. Although P-SWX and CUBLAS may effectively improve the training speed of linear RankSVM with L2-loss, the sorting costs on CPU, O(llogl) term, is still high when addressing large-scale training data sets.

Efficient GPU Sorting for Linear RankSVM Training with L2-loss
As discussed in 3.1, we should assigning m thread blocks to sort all w T X t concurrently on GPU. As the subscript i of each w T x ti needs to be applied in the Algorithm 2 P-SWX: m thread blocks should be assigned on GPU.
In computer memory, w TX T is always stored instead of all w T X t , thus w TX T should be converted into a struct sequence d pri , and each w T X t corresponds to a subsequence d pri t of d pri . Apparently, how to locate the boundaries of each d pri t in d pri is crucial to achieve the sorting in parallel on GPU. Thus, we define a struct parameter workset with two elements beg and end to record the boundaries of each subsequence in d pri , where beg and end store starting position and ending position of a subsequence respectively. Taking the t-th subsequence d pri t for example, both beg and end of workset(t) can be calculated by using following expression.
According to the above discussions and work done by D. Cederman et al. [25], we propose an efficient GPU sorting, GPU-quicksorting, by using an auxiliary buffer d aux which is as large as d pri . The basic principle of such a GPU sorting are primarily broken down into two steps. The first one is that if the |X t |, in the t-th thread block, is larger than a user-defined minsize, then the d pri t would be partitioned into two sub-sequences by a randomly selected pivot; if not, the d pri t would be sorted directly by bitornic sorting [26]. The second one is that the each divided subsequence, generated in the first step, with the size ≤ minsize would be sorted by bitornic sorting, while those/that with the large size should be further divided by the first step until the size of each new divided subsequence has been ≤ minsize. The specific steps of GPU-quicksorting are presented in Algorithm 3.
Theoretically, our proposed GPU-quicksorting would be an efficient GPU sorting algorithm since it can make full use of the unique properties of GPU to sort multiple sequences in parallel within a single GPU Kernel. Consequently, by using Algorithm 1, Algorithm 2, Algorithm 3 and CUBLAS, we can design an efficient GPU implementation P-SWXRankSVM to train linear RankSVM with L2-loss on GPU.

Performance Evaluation
In this section, we set two state-of-the-art applications OSTRankSVM 2 and DCMRankSVM as the comparison tools to evaluate P-SWXRankSVM in our experimental tests. The OSTRankSVM is an effective method that solves the linear RankSVM training with L2-loss by TRON along with OST, while DCM-RankSVM is another effective method that solves the linear RankSVM training with L2-loss by TRON along with DCM.
There are six real world training data sets, the size of each of which is clearly presented in Table 1, that are used in our experimental tests. We set C, s and minsize to be 1, 10 −5 and 64 respectively. It is unclear yet if it is the best option, but certainly we would like to try custom setting first. The measurements are carried out in a single server with four Intel(R) Xeon(R) E5620 2.40GHz four-core CPUs and 32GB of RAM running Ubuntu 9.04(64 bit). The graphics processor used is a NVIDA Tesla C2050 card with 448 CUDA cores, and the frequency of each CUDA core is 1.15 GHz. The card has 3GB GDDR5 memory and a memory bandwidth of 144 GB/s. Besides, the CUDA driver and runtime versions used in our experiments are both 5.0, and only one Tesla C2050 card is used in all benchmark tests. if end − beg < minsize then 5: bitonic(d aux (beg → end), d pri (beg → end)) //Sort the data. 7: else 8: push both beg and end to workstack 9: while workstack = ∅ do 10: pivot ← random(d aux (beg → end)) //Select a pivot randomly. 12: lt threadid , gt threadid ← 0, 0 //threadid : the ID number of threads 13: for

Performance Evaluation for P-SWXRankSVM
The specific performance comparisons among DCMRankSVM, OSTRankSVM and P-SWXRankSVM with respect to the different training data sets are presented in Table 2. As shown in the table, the OSTRankSVM performs better than OSTRankSVM, which mainly relies on the superior property of OST (O(llog(k))) in reducing the computation complexity compared to DCM (O(lk)). As expected, the P-SWXRankSVM has greater speedup performance over both of DCMRankSVM and OSTRankSVM when addressing the large-scale training data sets. Apparently, such efficient speedup for P-SWXRankSVM mainly depends on that P-SWXRankSVM can make full use of the great computation power of GPU based on our designing. Moreover, to analyse the convergence performance of DCMRankSVM, OS-TRankSVM and P-SWXRankSVM in more detail, we investigate the relative difference η to the optimal function value shown as: where the w * is the optimum of (1), and the s is also set to be 10 −5 . The measured results involving convergence speed with respect to the different training data sets are clearly illustrated in Figure 1. From the figure, we can observe that the OSTRankSVM converges faster than DCMRankSVM as training time goes, but the convergence speed of OSTRankSVM is not marked enough over that of DCMRankSVM as training time goes if the data sets have a small k. This may be because OST becomes more efficient over DCM if k is large enough. As expected, P-SWXRankSVM can converge much faster than both of DCM-RankSVM and OSTRankSVM. However, it is special for "MQ2008". The reason for this case relies on that if the data sets with small size can't effectively utilize the great computation power of GPU, so invoking P-SWXRankSVM to train the small data set, such as "MQ2008", would cost too much time in launching GPU kernels and communicating between CPU and GPU, rather than in computing.

Prediction Performance Evaluation for P-SWXRankSVM
If an optimum w * of (1) is obtained, then we need to evaluate that such a w * is whether good or not for doing prediction. In general, checking Pairwise Accuracy (PA) [27] is very suitable for measuring the prediction performance of pairwise approach such as RankSVM. Thus, we choose PA as the measurement in here.
The specific measured results are presented in Table 3 clearly. As shown in the table, training linear RankSVM with L2-loss by using anyone of these implementations would result in almost same prediction performance, which effectively proves that P-SWXRankSVM not only can accelerate the linear RankSVM training with L2-loss significantly but also guarantee the prediction accuracy. However, please note that such a case can be explained as follows: In essence, four of DCM, OST, P-SWX and GPU-quicksorting are devoted to improving the training speed of the linear RankSVM with L2-loss, thus they, theoretically, couldn't influence the prediction accuracy.

Conclusion and Future Work
In this paper, we have proposed two efficient parallel algorithms P-SWX and GPU-quicksorting to accelerate the linear RankSVM training with L2-loss on GPU. To sum up, to design efficient parallel algorithms for the linear RankSVM training with L2-loss, we not only divide all training instances x i and labels y i into several independent subsets in terms of different queries, but also redefine a new rule of how to determine the preference pairs (i, j). Just because of this, our proposed parallel algorithms can achieve both task-level and data-level parallelism effectively on GPU. Since this is the initial work to design GPU implementations for training linear RankSVM with L2-loss on a single GPU, there are still many challenges that should to be addressed to further explore their multiple GPU implementations. So, the next extension of this work will use multiple GPU devices to solve even larger training problem in parallel fashion.