Unified-Processing of Flexible Division Dealing with Positive and Negative Preferences

. Nowadays, current trends of universal quantiﬁcation-based queries are been oriented towards ﬂexible ones (tolerant queries and-or those involving preferences). In this paper, we are interested in universal quantiﬁcation-like queries dealing with both positive or negative preferences (requirements or prohibitions), considered separately or simultaneously. We have emphasised the improvement of the proposed operator, by designing new variants of the classical Hash-Division algorithm, presented in [1], for dealing with our context. The parallel implementation is also presented, and the issue of answers ranking is dealt with. Computational experiments are carried out in both sequential and parallel versions. They shows the relevance of our approach and demonstrate that the new operator outperforms the conventional one with respect to performance (the gain exceeds a ratio of 40).


Introduction
Relational operators including universal quantification are an interesting type of queries.They are very useful for many applications, especially in business intelligence applications and in recommendation systems [2].In relational algebra, universal quantification-like queries are the most complex operators.That is why a lot of research focuses on their implementation, algorithms and optimisation [3].Universal quantification-like queries are, often, about division or anti-division operators.The division searches elements associated with all members of a set of requirements, while the anti-division aims to find all elements that are associated with none of the members of a set of prohibitions [4].In this paper, we are concerned with some relevant issues related to the improvement of queries combining both of required and forbidden associations.

The division and anti-division operators
Relational division is used when an element that satisfies a whole set of requirements is sought for.Whereas, the anti-division operator is used to select elements that exclude any association with a set of prohibitions [6].In relational algebra, the division (resp.anti-division) of relation r(X,Y), called 'dividend'; by relation s(Y), called 'divisor'; is a new relation q(X), called 'quotient' that includes some parts of Projection(r,X) satisfying the following condition: x is in q(X) if and only if x is in P roject(r, X) and for all (resp.none) y in s(Y ), r(X, Y ) contains (resp.doesn't contain) < x, y > [6].X and Y are two compatible sets of attributes.More formally, the relational division is characterised by Equation 1, and the anti-division by Equation 2: Div(r, s, X, Y ) = {x ∈ projection(r, X) | ∀y, (y ∈ s) ⇒ ( x, y ∈ r)} (1) Example 1: Consider a distribution company of some products.In its commercial activity, the company wants to select its most valued customers (buyers).Customers ranking is based on some categories of products.Let Customer Order (#customer, #product, #order state), Critical Product (#product, #order state) and Golden Product(#product, #order state) be three crisp relations as sketched in Figure 1.
Fig. 1: Division query: "Which customers have made an approved order for each golden products?";Anti-division query: "Which customers have not made an aborted order for any of the critical products?" In the figure above, C 1 and C 3 are the resulting quotients of the division because they have made an approved order for all golden products.Whereas, for the anti-division, C 1 and C 2 are the valid quotients, since both of them have not made an aborted order of any critical product.

Current trends
Both relational division and anti-division often provide an empty answer.This is a widely studied problem in the last two decades [7].Flexible operators (tol-erant operators and operators dealing with user's preferences), is the most desirable technique to solve this problem and improve the DBMS answer quality [8], especially for recommender systems [9].Flexible division (anti-division) consists in the weakening of the quantifier all (none ) used in the classical operator [6,10].

Related work and motivation
Two main areas of research on division and anti-division can be identified.The first concerns the improvement of those operators, while the second area investigate them in a flexible context.
In literature, several studies have been focused on how to efficiently implement the division, including those surveyed in [1,2,3] in the relational model, and [5] in the object-oriented model.Indeed, the approach proposed in [1] and called 'Hash-Division', has proven through the experimental results to be better than the traditional algorithms in processing time in most cases.Further, there are only as far as we know, the work of Bosc et al. for the relational antidivision [6,11].Nonetheless, their implementation is based upon the SQL query derivation and is far from being optimal.
In the flexible area, some authors have suggested new operators for relational division [10,12] and anti-division [6,11,13], which are tailored for the flexible context.However, the performance aspect has not been adequately dealt with.Besides, some extended variants of the hash-division algorithm have been discussed in our earlier work [14] to tailor with some forms of the flexible division and division with preferences.However, to the best of our knowledge, the only experimentations done for the anti-division are those presented in [6,11].Although, their implementation is based on the nested loops algorithm which is far from being acceptable.Moreover, queries evaluation are performed with a reduced size of data (dividend and divisor).This does not fit reality, especially for analysis treatments on extra-large databases.In addition, authors in [15] have suggested a way for combining the division and the anti-division operators.However, neither implementation nor experimentations are presented in the paper.

Main contributions in this paper
This paper is carried out as a continuation of our previous work detailed in [14], which is proven to be an efficient processing of the flexible division.Hence, extended variant will be proposed in this work to cover additional forms of the universal-quantification based queries.
In fact, the main purpose of our work is to design a unified processing to handle queries involving requirements and prohibitions simultaneously, with a single operator.Such queries allow users to express several kinds of their preferences, which is very useful in information systems especially in artificial intelligence.
We also address the performance enhancement of the new operator drawing to the Hash-Division strategy as used in our previous work [14].
Example 2: Let's take relations in the previous example.Thanks to the mixing query, customers can be evaluated through the following query: "Find customers who have made an approved order for all golden products and they haven't made any aborted order for the critical products?".
Here, C 3 is no longer a valid quotient because he has made an aborted order of one of the critical products (P 2 ).Idem for C 2 , he hasn't made an approved order for all golden product.Thus, we can conclude that the customers can be better distinguished through the mixed query.In addition, a unified (single operator) and fast processing of such queries will improve them even more.This is the backdrop behind our work.Hereafter we summarise our contributions: -Investigate performance enhancement of the flexible queries involving both of division and anti-division, essentially for very large volumes of data.-Investigate the parallel implementation feasibility for the extended approach.
We consider in this work the flexible division and anti-division over crisp databases exclusively.Fuzzy relations will be studied in future work.

Outline of the paper
The remainder of this paper is organised as follows.In Section 2, we present the classical Hash-Division algorithm.Section 3 gives an overview of the flexible division and the flexible anti-division.In Section 4, our contribution is presented together with analytics and discussion of the experimental results obtained.Section 5 introduces a parallel implementation of the proposed operator.Finally, Section 6 concludes the paper and suggests directions for future work.

Review of Hash-Division Algorithm
In this section, we give a brief description of the hash-division algorithm (HD) (see [1] for further details).It uses two hash tables, in order to avoid the exhaustive comparison, used in the traditional algorithms.The first table is for the divisor and the second for the quotient.Thanks to these two structures, both dividend and divisor relations are scanned exactly once, that makes the division operator faster.Hash-Division algorithm is proceeding in three stage: Stage 01: Building the hash-divisor table : during the scan of the divisor table, we insert all divisor tuples into buckets in the hash-divisor table.Each entry in this table, is stored together with an integer called divisor number 'Num div'.Num div is initialized to 0 and it is incremented whenever a new insertion in the hash-divisor table occurs.
Stage 02: Building the hash-quotient table: during the scan of the dividend; for each row that corresponds to one of the divisors, stored in the hash-divisor table, we insert a quotient candidate into hash buckets in the hashquotient table.Together with each inserted candidate, a bitmap is kept with one bit for each divisor.All bits are initialized to 0, and updated to 1 whenever a match with the corresponding divisor occurred.
Stage 03 (end): Building the result: in this last stage, we select from the constructed hash-quotient table all quotient candidates whose bitmaps contain only ones as valid quotients.

Review of Flexible Division and Flexible anti-division
Flexible (or tolerant) division and anti-division were essentially proposed in order to avoid the empty result problem, which may occur mostly whenever we use 'for all' or 'for none' quantifiers [6,10].There are a plethora of suggestions, in literature, showing that original relational division (anti-division) can be extended to different types of flexible queries.We are interested in this work on the following forms of flexible operators : (i) Exception-based tolerant division, (ii) Exception-based tolerant anti-division.

Principle
This category is based on exceptions into the requirements set for the division or the prohibitions set for the anti-division (divisor).The principle is to weak the quantifier 'all' (resp.'none' ) to the fuzzy quantifier 'almost all' (resp.'almost none' ) to express tolerant division (resp.anti-division) [6,10,12].Thus, depending on the desired level of relaxation, some elements, in the divisor set, are allowed to be not associated (resp.associated) with the quotient in the dividend relation.

Modelling
In fact, a maximum number of exceptions is allowed to be ignored.Satisfactionlevel SL of a quotient is measured by Equation 3 for the division and Equation 4for the anti-division.A threshold is required for accepted quotients [10,13].Valid quotients are sorted depending on their satisfaction levels.

SL Division =
N umber of divisors associated with the candidate total number of divisors SL Anti−Division = N umber of divisors not associated with the candidate total number of divisors (4) 4 Our proposed approach for the mixed query This section is devoted to a tolerant universal-quantification queries in which both division and anti-division are considered simultaneously .We first give a novel way for combining those two types of associations: required and forbidden associations.Then, the performance of the proposed approach is highlighted.
In fact, we propose to improve the effectiveness of the mixed query by inspiring from the strategy of the hash-division algorithm.We have made various alterations to the structures and the procedures used in the classic algorithm, to deal with the unified mixed operator.Moreover, we describe an adequate technique to better discriminate final quotients, with no additional cost.It should be noted that our work differs from Bosc et al.'s work presented in [15] in our formulating query.All preferences, requirements and prohibitions, are expressed thanks to a single operator.While the key issue with the approach presented in [15] is that is based on the decomposition of the mixed operator on several successive relational division and-or anti-division operations, depending on the number of layers, which is a very time-consuming process.

Strict and gradual Mixed Query
To deal with the mixed query, the divisor is subdivided into two sets, positive part (requirements) P, and negative part (prohibitions) N .
In the strict version, to be selected as a valid quotient, an element x must be associated with all values in P and must be not associated with any value in N .Thereby, P and N must be totally independents.In this strict version, all results are equally ranked.For the gradual mixed query, since some tolerances are allowed in both subsets P and N , results are discriminated depending on their satisfaction levels.Hence, for each accepted quotient we define two sublevel: S p and S n stand for the satisfaction level for the positive and the negative part respectively.S p is computed as in Equation 3 with respect to the positive part, and S n is computed as in Equation 4 regarding the negative part.

Hash-mixed query: an improvement of the mixed query
Here we will describe how we have improved the processing time of the mixed query relying on the Hash-Division like algorithm.Hence, the three altered phases of the hash-mixed query are described hereafter.

The first stage:
As in the classic algorithm, we store all divisor tuples in a hash table.Whereas for ours, each tuple is stored together with two integers: -ind lyr: index of the layer, 0 for P and 1 for N .This integer is used to indicate the offset of the divisor tuple inside the bitmap.Bits corresponding to divisors in P are located, in the bitmap, before those belonging in N .-num div lyr: the divisor number (rank ) of the tuple in its layer (P or N ).
The data structure of a divisor tuple in the hash-divisor table is shown in the following figure.Hence, for each layer, P and N , we keep its own divisors counter.These two counters are initialized to 0 and incremented whenever we insert a new divisor, of the corresponding layer, into the hash-divisor The second Stage: In the second stage (Construction of the Hash-quotient table) of the hash-mixed query, we have made two major differences from the basic algorithm.The first is how to update the bitmap.Hence, if a divisor matching (P or N ) with the quotient candidate occurs, we set the bit to 1 whose position, in the quotient bitmap 1 , is equal to 'offst lyr+num div lyr' where: -num div lyr: the divisor number stored together with the matching divisor.
-offst lyr: is set to 0 if the matching divisor belongs to P , otherwise (belongs to N ) it is set to |P | (the cardinality of the positive subset).
Therefore, the data structure of the bitmap of candidates is as shown below: . Fig. 3: Data structure of the bitmap for hash-mixed query.
The second difference is that we kept with each quotient candidate counters of ones (bit = 1), in its bitmap, for each layer.We called these counters Nb ones 1 for the layer P and Nb ones 2 for the layer N .These latter are incremented at each bit switching (0 to 1) in the corresponding layer of the quotient candidate bitmap.Hereafter is a pseudo-code of this stage: In such a way, final quotients are automatically sorted in decreasing order according to their satisfaction levels.The cell whose index is 0 points the best quotients (satisfying the whole set of requirements and dissatisfying all prohibitions).Hence, to select the k − top answers, we just need to browse the indexed table from the top (from quotients with the highest satisfaction-level to the lowest ones); until k quotients are found.This sorting technique offers a better discrimination between accepted quotients, while no additional costs is needed.In the light of the above, it can be said that we have been able to combine two types of associations (positive and negative) in a single operator.The conceived operator is not complex since it does not need to handle each operation (Division and anti-division) separately.Furthermore, it requires no iterations.Hence, thanks to th new unified operator, users can introduce simultaneously requirements and prohibitions in a constructively simple manner.

Experimentations
We consider four sizes for the dividend relation: 3.104 , 5.10 5 , 3.10 6 , and 5.10 8 tuples, randomly generated2 .Sizes considered for the divisor relation are: 10, 20, 50, and 100 uniformly distributed over layers P and N. Obtained results are gathered in Table 1.Run-time is measured in seconds.
Table 1 shows the run-times of our variants of the mixed query, comparing with the classic one presented in [15] where several successive classical-divisions are involved.We can notice that our approaches complete performance much faster than the classic one for the four dividend sizes.Indeed, the run-time is improved by several orders of magnitude (the gain factor is greater than 61 in the case of 5.10 8 dividend tuples).In addition, implementation requires roughly the same run-time regardless of the investigated form of the mixed query.For the largest dividend relation: run-time is approximately equal to 120s for the three variants (strict form, symmetrical impact form, and the hierarchical form).

Parallel implementation
Parallel implementation is realized thanks to the PVM framework (Parallel Virtual Machine), on machines based on an Intel i5 CPU and 8 Go RAM .Experimentations were performed over 2, 4 and 6 nodes.The parallelism strategy is as follows: 1.The hash-divisor table is created only once on a single node called master .2. The master sends the hash-divisor table created to all other nodes.3. The dividend table is uniformly partitioned between all nodes.4. Each node builds its own hash-quotient table.The hash function may be different between the nodes, depending on the memory space of each one.5.When all sub-tables of the hash-quotient are completely constructed in all nodes, the master collects those sub-tables.Then, it merges all of them in one global hash-quotient table to select valid quotients.
The pseudo-code of the last step (point 5), in the master, is given hereafter: Algorithm 3 Parallel implementation of the mixed query.Through the results obtained from the parallel implementation of the hashmixed query and illustrated in the figure above, we observed a linear effect on speed-up in the case of large dividend (≥ 3.10 6 ).However, an additional cost3 , but still negligible, for a relatively small size of the dividend (≤ 5.10 5 ) occurs.
In summary, first results of the hash-mixed query presented in this paper are encouraging.The proposed approach has been successful in processing this complex forms of the universal-quantification based queries effectively.Although, there is still a need for multiple implementations in real SGBD, to firmly validate the hash mixed approach proposed.

Conclusion and perspectives
We have presented in this paper a unified operator to deal with universal quantification based queries involving positive and negative preferences (desired and forbidden associations) simultaneously.Our new technique is then improved relying on the hash-division algorithm.Moreover, the issue of answers ranking is dealt with.We have conducted some experiments particularly for large-sized relations, and compare execution time with the original approaches (nested loop algorithms) proposed in the literature.As expected, the performance got is very interesting.We have been able to improve the response time of some queries by several orders of magnitude.We presented also a parallel version of the mixed query, where we have obtained a near-linear speed-up, especially for large tables.We are currently designed new forms of complex queries where more than two layers, several kinds of preferences, and several connectors come to play.Furthermore, there is still a need for multiple implementations in real SGBD, to firmly validate the hash mixed approach proposed.It will also be exciting to look at other parallelism strategies which take into account the data skew issue that causes deteriorations in performance.

Fig. 5 :
Fig. 5: Speed-up for parallel algorithm of the hash-mixed query.
table.Pseudo-code of the hashdivisor table building for the mixed query is given hereafter: Algorithm 1 Building of the hash-divisor table for the Hash-mixed query num divisorsP ← 0; num divisorsN ← 0; /* initialize the two counters to zero */ for each tuple t in the divisor relation do

Table 1 :
Experimental results for the hash-mixed query algorithms : Positive and negative part as hierarchical preferences. b for each sub-hash-quotient table received from the slave nodes do for each quotient-candidate in the sub-hash-quotient table do Compute the hash bucket (Hqot) using the master hash function, over the quotient value of the candidate;if the candidate (quotient value) is already contained in the hash-quotient table, constructed in the master, at the bucket Hqot then Update the bitmap of the candidate by calculating the result of the binary OR operator between the bitmap in the master and that received from the node; else Insert a new quotient candidate into the hash-quotient table of the master at the bucket Hqot, with a bitmap equal to that received from the node;