Top-k Querying of Unknown Values under Order Constraints

,


Introduction
Many data analysis tasks involve queries over ordered data, such as maximum and top-k queries, which must often be evaluated in presence of unknown data values. This problem occurs in many real-life scenarios: retrieving or computing exact data values is often expensive, but querying the partially unknown data may still be useful to obtain approximate results, or to decide which data values should be retrieved next. In such contexts, we can often make use of order constraints relating the data values, even when they are unknown: for instance, we know that object A is preferred to object B (though we do not know their exact rating). This paper thus studies the following general problem. We consider a set of numerical values, some of which are unknown, and we assume a partial order on these values: we may know that x y should hold although the values x or y are unknown. Our goal is to estimate the unknown values, in a principled way, and to evaluate top-k queries, namely find the items with (estimated) highest values.
Without further information, one may assume that every valuation compatible with the order constraints is equally likely, i.e., build a probabilistic model where valuations are uniformly distributed. Indeed, uniform distributions in the absence of prior knowledge is a common assumption in probabilistic data management [1,11,33] for continuous distributions on data values within an interval; here we generalize to a uniform distribution over multiple unknown values. Though the distribution is uniform, the dependencies between values lead to non-trivial insights about unknown values and top-k results, as we will illustrate.
Our results also include a review of existing definitions for top-k over uncertain data, which motivates our particular choice of definition (in Section 6). We survey related work in more depth in Section 7 and conclude in Section 8. Our results are provided with complete proofs, given in appendix for lack of space.

Preliminaries and Problem Statement
This section introduces the formal definitions for the problem that we study in this paper. We model known and unknown item values as variables, and order constraints as equalities and inequalities over them. Then we define the possible valuations for the variables via possible-world semantics, and use this semantics to define a uniform distribution where all worlds are equally likely. The problem of top-k querying over unknown values can then be formally defined with respect to the expected values of variables in the resulting distribution.

Unknown Data Values under Constraints
Our input includes a set X = {x 1 , . . . , x n } of variables with unknown values v(x 1 ), . . . , v(x n ), which we assume 3 to be in the range [0, 1]. We consider two kinds of constraints over them: order constraints, written x i x j for x i , x j ∈ X , encoding that v(x i ) v(x j ); exact-value constraints to represent variables with known values, written 4 x i = α for 0 α 1 and for x i ∈ X , encoding that v( In what follows, a constraint set with constraints of both types is typically denoted C. We assume that constraints in C are not contradictory (e.g., we forbid x = 0.1, y = 0.2, y z, and z x), and that they are closed under implication: e.g., if x = α, y = β are given, and α β, then x y is implied and thus should also be in C. We can check in PTIME that C is non-contradictory by simply verifying that it does not entail a false inequality on exact values (e.g., 0.2 0.1 as in our previous example). The closure of C can be found in PTIME as a transitive closure computation [26] that also considers exact-value constraints. We denote by X exact the subset of X formed of variables with exact-value constraints. Example 1. In the product classification example from the Introduction, a variable x i ∈ X would represent the compatibility score of the product to the i-th category. If the score is known, we would encode it as a constraint x i = α. In addition, C would contain the order constraint x i x j whenever category i is a sub-category of j (recall that the score of a sub-category cannot be higher than that of an ancestor category).

Possible World Semantics
The unknown data captured by X and C makes infinitely many valuations of X possible (including the true one). We model these options via possible world semantics: a possible world w for a constraint set C over X = {x 1 , . . . , x n } is a vector of values w = (v 1 , . . . , v n ) ∈ [0, 1] n , corresponding to setting v(x i ) := v i for all i, such that all the constraints of C hold under this valuation. The set of all possible worlds is denoted by pw X (C), or by pw(C) when X is clear from context. Notice that C can be encoded as a set of linear constraints, i.e., a set of inequalities between linear expressions on X and constants in [0, 1]. Thus, following common practice in linear programming, the feasible region of a set of linear constraints (pw(C) in our setting) can be characterized geometrically as a convex polytope, termed the admissible polytope: writing n := |X |, each linear constraint defines a feasible half-space of R n (e.g., the half-space where x y), and the convex polytope pw(C) is the intersection of all half-spaces. In our setting the polytope pw(C) is bounded within [0, 1] n , and it is non-empty by our assumption that C is not contradictory. With exact-value constraints, or order constraints such as x i x j and x j x i , it may be the case that the dimension of this admissible polytope is less than |X |. Computing this dimension can easily be done in PTIME (see, e.g., [43]). Example 2. Let X = {x, y, z}. If C = {x y}, the admissible polytope has dimension 3 and is bounded by the planes defined by x = y, x = 0, y = 1, z = 0 and z = 1. If we add to C the constraint y = 0.3, the admissible polytope is a 2-dimensional rectangle bounded by 0 x 0.3 and 0 z 1 on the y = 0.3 plane. We cannot add, for example, the constraint x = 0.5, because C would become contradictory.

Probability Distribution
Having characterized the possible worlds of pw(C), we assume a uniform probability distribution over pw(C), as indicated in the Introduction. This captures the case when all possible worlds are equally likely, and is a natural choice when we have no information about which valuations are more probable.
Since the space of possible worlds is continuous, we formally define this distribution via a probability density function (pdf), as follows. Let X and C define a d-dimensional polytope pw X (C) for some integer d. The d-volume (also called the Lebesgue measure [28] on R d ) is a measure for continuous subsets of d-dimensional space, which coincides with length, area, and volume for dimensions 1, 2, and 3, respectively. We denote by V d (C) the d-volume of the admissible polytope, or simply V (C) when d is the dimension of pw(C).

Definition 3.
The uniform pdf p maps each possible world w ∈ pw(C) to the constant p(w) := 1 /V (C).

Top-k Queries
We are now ready to formally define the main problem studied in this paper, namely, the evaluation of top-k queries over unknown data values. The queries that we consider retrieve the k items that are estimated to have the highest values, along with their estimated values, with ties broken arbitrarily. We further allow queries to apply a selection operator σ on the items before performing the top-k computation. In our example from the Introduction, this is what allows us to select the top-k categories among only the end categories. We denote the subset of X selected by σ as X σ .
If all item values are known, the semantics of top-k queries is clear. In presence of unknown values, however, the semantics must be redefined to determine how the top-k items and their values are estimated. In this paper, we estimate unknown items by their expected value over all possible worlds, i.e., their expected value according to the uniform pdf p defined above on pw(C). This corresponds to interpolating the unknown values from the known ones, and then querying the result. We use these interpolated values to define the top-k problem as computing the k variables with the highest expected values, but we also study on its own the interpolation problem of computing the expected values.
To summarize, the two formal problems that we study on constraint sets are: Interpolation. Given a constraint set C over X and variable x ∈ X , the interpolation problem for x is to compute the expected value of x in the uniform distribution over pw X (C). Top-k. Given a constraint set C over X , a selection predicate σ, and an integer k, the top-k computation problem is to compute the ordered list of the k maximal expected values of variables in X σ (or less if |X σ | k), with ties broken arbitrarily.
We review other definitions of top-k on uncertain data in Section 6, where we justify our choice of semantics.
Alternate phrasing. The Interpolation problem can also be defined geometrically, as the computation of the centroid (or center of mass) of the admissible polytope: the point G such that all vectors relative to G originating at points within the polytope sum to zero. The constraints that we study correspond to a special kind of polytopes, for which we will design a specific algorithm in the next section, and derive an FP #P membership bound which does not hold for general polytopes (as explained in the Introduction). However, the geometric connection will become useful when we study the complexity of our problem in Section 4.1.

An Algorithm for Interpolation and Top-k
Having defined formally the problems that we study, we begin our complexity analysis by designing an algorithm that computes the expected value of variables. The algorithm enumerates all possible orderings of the variables (to be defined formally below), but it is still nontrivial: we must handle exact-value constraints specifically, and we must compute the probability of each ordering to determine its weight in the overall expected value computation. From the algorithm, we will deduce that our interpolation and top-k problems are in FP #P .
Eliminating ties. To simplify our study, we will eliminate from the start the problem of ties, which will allow us to assume that values in all worlds are totally ordered. We say that a possible world w = (v 1 , . . . , v n ) of C has a tie if v i = v j for some i, j. Note that occasional ties, not enforced by C, have an overall probability of 0: intuitively, if the admissible polytope is d-dimensional, then all the worlds where v i = v j correspond to a (d − 1)-dimensional hyperplane bounding or intersecting the polytope. A finite set of such hyperplanes (for every pair of variables) has total d-volume 0. Since our computations (volume, expected value) involve integrating over possible worlds, a set of worlds with total probability 0 does not affect the result.
What is left is to consider ties enforced by C (and thus having probability 1). In such situations, we can rewrite C by merging these variables to obtain an equivalent constraint set where ties have probability 0. Formally: Lemma 4. For any constraint set C, we can construct in PTIME a constraint set C such that the probability that the possible worlds of C have a tie (under the uniform distribution) is zero, and such that any interpolation or top-k computation problem on C can be reduced in PTIME to the same type of problem on C .
Hence, we assume from now on that ties have zero probability in C, so that we can ignore possible worlds with ties without affecting the correctness of our analysis. Note that this implies that all of our results also hold for strict inequality constraints, of the forms x < y and x = y.
Under this assumption, we first study in Section 3.1 the case where C is a total order. We then handle arbitrary C by aggregating over possible variable orderings, in Section 3.2.

Total Orders
In this section we assume C is a total order C n 1 (α, β) defined as x 0 x 1 · · · x n x n+1 , where x 0 = α and x n+1 = β are variables with exact-value constraints in X exact .
We first consider unfragmented total orders, where x 1 , . . . , x n ∈ X exact . In this case, we can show that the expected value of x i , for 1 i n, corresponds to a linear interpolation of the unknown variables between α and β, namely: i n+1 · (β − α) + α. This can be shown formally via a connection to the expected value of the order statistics of samples from a uniform distribution, which follows a Beta distribution [21]. Now consider the case of fragmented total orders, where C is allowed to contain more exact-value constraints than the ones on α and β. We observe that we can split the total order into fragments: by cutting at each variable that has an exact-value constraint, we obtain sub-sequences of variables which follow an unfragmented total order. We can then compute the expected values of each fragment independently, and compute the total order volume as the product of the fragment volumes. The correctness of this computation follows from a more general result (Lemma 12) stated and proven in Section 5.
Hence, given a constraint set C imposing a (possibly fragmented) total order, the expected value of x i can be computed as follows. If x i ∈ X exact , analysis is trivial. Otherwise, we consider the fragment C q−1 p+1 (v p , v q ) that contains x i ; namely, p is the maximal index such that 0 p < i and x p ∈ X exact , and q is the minimal index such that i < q n + 1 and x q ∈ X exact . The expected value of x i can then be computed within C q−1 p+1 (v p , v q ) using linear interpolation.
The following proposition summarizes our findings: Proposition 5. Given a constraint set C implying a total order, the expected value of any variable x i ∈ X can be computed in PTIME.

General Constraint Sets
We can now extend the result for total orders to an expression of the expected value for a general constraint set C. We apply the previous process to each possible total ordering of the variables, and aggregate the results. To do this, we define the notion of linear extensions, inspired by partial order theory: Definition 6. Given a constraint set C over X , we say that a constraint set T is a linear extension of C if (i) T is a total order; (ii) the exact-value constraints of T are exactly those of C; and (iii) C ⊆ T , namely every constraint x y in C also holds 5 in T . Algorithm 1 presents our general scheme to compute the expected value of a variable x ∈ X under an arbitrary constraint set C, assuming the uniform distribution on pw(C).
The algorithm iterates over each linear extension T of C, and computes the expected value of x in T and the overall probability of T in pw(C). A linear extension is a total order, Algorithm 1: Compute the expected value of a variable Input: Constraint set C on variables X with n := |X | where ties have probability 0 and where value range [0, 1] is enforced by constraints; variable x ∈ X Output: Expected value of x 1 if x is in X exact , i.e., has an exact-value constraint in C to some v then return v; Write k the index in T of the variable x; 8 Write i j , i j+1 the indices of variables from X exact s.t. i j < k < i j+1 ; Input: p < q: indices; k: requested variable index; α < β: exact values at p, q resp. Output: Expected value of variable x k in the fragment C q−1 p+1 (α, β) 15 n ← q − p − 1 ; // num. of variables in the fragment which are ∈ X exact 16 return k−p n+1 · (β − α) + α ; // linear interpolation (see Section 3.1) 17 Function VolumeFrag(p, q, α, β) Input: p < q: indices; α < β: exact values at p, q resp. Output: V (C q−1 p+1 (α, β)): volume of the total order fragment between indices p, q 18 n ← q − p − 1 ; // num. of variables in the fragment which are ∈ X exact 19 return (β−α) n n! ; // Volume of [α, β] n divided by num. of total orders so x is within a particular fragment of it, namely, between the indices of two consecutive variables with exact-value constraints, i j and i j+1 . The expected value of x in T , denoted by , is then affected only by the constraints and variables of this fragment, and can be computed using linear interpolation by the function ExpectedValFrag (line 9). Now, the final expected value of x in C is the average of all E T [x] weighted by the probability of each linear extension T , i.e., the volume of pw(T ) divided by the volume of pw(C). Recall that, by Lemma 4, worlds with ties have total volume 0 and do not affect this expected value. We compute the volume of T as the product of volumes of its fragments (line 10). The volume of a fragment, computed by function VolumeFrag, is the volume of [α, β] n , i.e., all assignments to the n variables of the fragment in [α, β], divided by the number of orderings of these variables, to obtain the volume of one specific order (line 19).
The complexity of Algorithm 1 is polynomial in the number of linear extensions of C, as we can enumerate them in constant amortized time [40]. However, in the general case, there may be up to |X |! linear extensions. To obtain an upper bound in the general case, we note that we can rescale all constraints so that all numbers are integers, and then nondeterministically sum over the linear extensions. This yields our FP #P upper bound: Theorem 7. Given a constraint set C over X and x ∈ X (resp., and a selection predicate σ, and an integer k), determining the expected value of x in pw(C) under the uniform distribution (resp., the top-k computation problem over X , C, and σ) is in FP #P .
The FP #P membership for interpolation does not extend to centroid computation in general convex polytopes, which is not in FP #P [31,41]. Our algorithm thus relies on the fact that the polytope pw(C) is of a specific form, defined with order and exact-value constraints. The same upper bound for the top-k problem immediately follows. We will show in Section 4 that this FP #P upper bound is tight.
We also provide a complete example to illustrate the constructions of this section.
x y = γ y z Full Example. We exemplify our scheme on variables X = {x, y, y , z} and on the constraint set C generated by the order constraints x y, y z, x y , y z and the exact-value constraint y = γ for some fixed 0 < γ < 1. Remember that we necessarily have 0 x and z 1 as well. The constraints of C are closed under implication, so they also include x z. The figure shows the Hasse diagram of the partial order defined by C on X . Note that ties have a probability of zero in pw(C).
The two linear extensions of C are T 1 : x y y z and T 2 : x y y z. Now, T 1 is a fragmented total order, and we have pw( 1] where C is defined on variables {x, y} by 0 x y γ. We can compute the volume of pw(T 1 ) as Let us compute the expected value of y for C. In T 1 its expected value is The overall expected value of y is the average of these expected values weighted by total order probabilities (volumes fractions), namely E C [y] = α1µ1+α2µ2 α1+α2 .

Hardness and Approximations
We next show that the intractability of Algorithm 1 in Section 3 is probably unavoidable. We first show matching lower bounds for interpolation and top-k in Section 4.1. We then turn in Section 4.2 to the problem of approximating expected values.

Hardness of Exact Computation
We now analyze the complexity of computing an exact solution to our two main problems. We show below a new result for the hardness of top-k. But first, we state the lower bound for the interpolation problem, which is obtained via the geometric characterization of the problem. In previous work, centroid computation is proven to be hard for order polytopes, namely, polytopes without exact-value constraints, which are a particular case of our setting: We now show a new lower bound for top-k queries: interestingly, these queries are FP #P -hard even if they do not have to return the expected values. Recall that σ is the selection operator (see Section 2.4), which we use to compute top-k among a restricted subset of variables. We can show hardness even for top-1 queries, and even when σ only selects two variables: Theorem 9. Given a constraint set C over X , a selection predicate σ, and an integer k, the top-k computation problem over X , C and σ is FP #P -hard even if k is fixed to be 1, |X σ | is 2, and the top-k answer does not include the expected value of the variables.
Proof sketch. To prove hardness in this case, we reduce from interpolation. We show that a top-1 computation oracle can be used as a comparison oracle to compare the expected value of a variable x to any other rational value α, by adding a fresh element x with an exact-value constraint to α and using σ to compute the top-1 among {x, x }. What is more technical is to show that, given such a comparison oracle, we can perform the reduction and determine exactly the expected value v of x (a rational number) using only a polynomial number of comparisons to other rationals. This follows from a bound on the denominator of v, and by applying the rational number identification scheme of [38]. See the Appendix.
In settings where we do not have a selection operator (i.e., X σ = X ), we can similarly show the hardness of top-k (rather than top-1). See Appendix B.1 for details.

Complexity of Approximate Computation
In light of the previous hardness results, we now review approximation algorithms, again via the geometric characterization of our setting. In Section 5, we will show a novel exact solution in PTIME for specific cases.
The interpolation problem can be shown to admit a fully polynomial-time randomized approximation scheme (FPRAS). This result follows from existing work [6,30], using a tractable almost uniform sampling scheme for convex bodies.

Proposition 10. ([30], Algorithm 5.8). Let C be a set of constraints with variable set X and x ∈ X . There is an FPRAS that determines an estimateÊ
This result is mostly of theoretical interest, as the polynomial is in |X | 7 (see [6], Table 1), but recent improved sampling algorithms [36] may ultimately yield a practical approximate interpolation technique for general constraint sets (see [20,35]). For completeness, we mention two natural ways to define randomized approximations for top-k computation: We can define the approximate top-k as an ordered list of k items whose expected value does not differ by more than some ε > 0 from that of the item in the actual top-k at the same rank. An FPRAS for this definition of approximate top-k can be obtained from that of Proposition 10. It is highly unlikely that there exists a PTIME algorithm to return the actual top-k with high probability, even without requiring it to return the expected values. Indeed, such an algorithm would be in the BPP (bounded-error probabilistic time) complexity class; yet it follows from Theorem 9 above that deciding whether a set of variables is the top-k is NP-hard, so the existence of the algorithm would entail that NP ⊆ BPP.

Tractable Cases
Given the hardness results in the previous section and the impracticality of approximation, we now study whether exact interpolation and top-k computation can be tractable on restricted classes of constraint sets. We consider tree-shaped constraints (defined formally below) and generalizations thereof: they are relevant for practical applications (e.g., classifying items into tree-or forest-shaped taxonomies), and we will show that our problems are tractable on them. We start by a splitting lemma to decompose constraint sets into "independent" subsets of variables, and then define and study our tractable class.

Splitting Lemma
We will formalize the cases in which the valuations of two variables in X are probabilistically dependent (the variables influence each other), according to C. This, in turn, will enable us to define independent subsets of the variables and thus independent subsets of the constraints over these variables. This abstract result will generalize the notion of fragments from total orders (see Section 3.1) to general constraint sets. In what follows, we use x i ≺ x j to denote the covering relation of the partial order , i.e., x i x j is in C but there exists no Definition 11. We define the influence relation x ↔ y between variables of X \X exact as the equivalence relation obtained by the symmetric, reflexive, and transitive closure of the ≺ relation on X \X exact .
The uninfluenced classes of X under C is the partition of X \X exact as the subsets X 1 , . . . , X m given by the equivalence classes of the influence relation.
The uninfluence decomposition of C is the collection of constraint sets C 1 , . . . , C m of C where each C i has as variables X i X exact and contains all exact-value constraints of C and all order constraints between variables of X i X exact .
We assume w.l.o.g. that m > 0, i.e., there are unknown variables in X \X exact ; otherwise the uninfluence decomposition is meaningless but any analysis is trivial. Intuitively, two unknown variables x, x are in different uninfluenced classes if in every linear extension there is some variable from X exact between them, or if they belong to disconnected (and thus incomparable) parts of the partial order. In particular, uninfluenced classes correspond to the fragments of a total order: this is used in Section 3.1. The uninfluence decomposition captures only constraints between variables that influence each other, and constraints that can bound the range of a variable by making it comparable to variables from X exact . We formally prove the independence of C 1 , . . . , C m via possible-world semantics: every possible world of C can be decomposed to possible worlds of C 1 , . . . , C m , and vice versa.
Example 13. Let X be {x, y, y , z, w}, and let C be defined by y = 0.5 and x y y z. The uninfluence classes are X 1 = {x, y}, X 2 = {z}, and X 3 = {w}. The uninfluence decomposition thus consists of C 1 , with variables X 1 {y }, and constraints x y y and y = 0.5; C 2 , with variables X 2 {y }, and constraints y z and y = 0.5; and C 3 , with variables X 3 {y }, and constraint y = 0.5. We next use this independence property to analyse restricted classes of constraint sets.

Tree-Shaped Constraints
We define the first restricted class of constraints that we consider: tree-shaped constraints.
Recall that a Hasse diagram is a representation of a partial order as a directed acyclic graph, whose nodes correspond to X and where there is an edge (x, y) if x ≺ y. An example of such a diagram is the one used in Section 3.2.

Definition 14.
A constraint set C over X is tree-shaped if the probability of ties is zero, the Hasse diagram of the partial order induced on X by C is a directed tree, the root has exactly one child, and exactly the root and leaves are in X exact . Thus, C imposes a global minimal value, and maximal values at each leaf, and no other exact-value constraint.
We call C reverse-tree-shaped if the reverse of the Hasse diagram (obtained by reversing the direction of the edges) is tree-shaped.
Tree-shaped constraints are often encountered in practice, in particular in the context of product taxonomies. Indeed, while our example from Figure 1 is a DAG, many real-life taxonomies are trees: in particular, the Google Product Taxonomy [22] and ACM CCS [2].
We now show that for a tree-shaped constraint set C, unlike the general case, we can tractably compute exact expressions of the expected values of variables. In the next two results, we assume arithmetic operations on rationals to have unit cost, e.g., they are performed up to a fixed numerical precision. Otherwise, the complexities remain polynomial but the degrees may be larger. We first show: Proof sketch. We process the tree bottom-up, propagating a piecewise-polynomial function expressing the volume of the subpolytope on the subtree rooted at each node as a function of the value of the parent node: we compute it using Lemma 12 from the child nodes.
See Appendix C.2 for the complete proof. This result can be applied to prove the tractability of computing the marginal distribution of any variable x ∈ X \X exact in a treeshaped constraint set, which is defined as the pdf p Theorem 16. For any tree-shaped constraint set C on variable set X , for any variable x ∈ X \X exact , the marginal distribution for x is piecewise polynomial and can be computed Proof sketch. We proceed similarly to the proof of Theorem 15 but with two functions: one for x and its descendants, and one for all other nodes. The additional |X exact | factor is because the second function depends on how the value given to x compares to the leaves.
We last deduce that our results for tree-shaped constraints extend to a more general tractable case: constraint sets C whose uninfluence decomposition C 1 , . . . , C m is such that every C i is (reverse-)tree-shaped. By Lemma 12, each C i (and its variables) can be considered independently, and reverse-tree-shaped trees can be easily transformed into tree-shaped ones. Our previous algorithms thus apply to this general case, by executing them on each constraint set of the uninfluence decomposition that is relevant to the task (namely, containing the variable x to interpolate, or top-k candidates from the selected variables X σ ): Corollary 17. Given any constraint set C and its uninfluence decomposition C 1 , . . . , C m , assuming that each C i is a (reverse-)tree-shaped constraint set, we can solve the interpolation problem in time O(max i |X i | 3 ) and the top-k problem in PTIME.
On large tree-shaped taxonomies (e.g., the Google Product Taxonomy [22]), in an interactive setting where we may ask user queries (e.g., the one in the Introduction), we can improve running times by asking more queries. Indeed, each answer about a category adds an exact-value constraint, and reduces the size of the constraint sets of the uninfluence decomposition, which decreases the overall running time, thanks to the superadditivity of x → x 3 . We do not study which variables should be queried in order to reduce the running time of the algorithm; see, e.g., [39] for tree-partitioning algorithms.

Other Variants
We have defined top-k computation on constraint sets by considering the expected value of each variable under the uniform distribution. Comparing to different definitions of top-k on unknown values that have been studied in previous work, our definition has some important properties [14]: it provides a ready estimation for unknown values (namely, their expected value) and guarantees an output of size k. Moreover, it satisfies the containment property of [14], defined in our setting as follows: Definition 18. A top-k definition satisfies the containment property if for any constraint set C on variables X , for any predicate σ (where we write X σ the selected variables), and for any k < |X σ |, letting S k and S k+1 be the ordered lists of top-k and top-(k + 1) variables, S k is a strict prefix of S k+1 .
The containment property is a natural desideratum: computing the top-k for some k ∈ N should not give different variables or order for the top-k with k < k. Our definition clearly satisfies the containment property (except in the case of ties). By contrast, we will now review prominent definitions of top-k on uncertain data from related work [14,46,51], and show that they do not satisfy the containment property when we apply them to the possible world distributions studied in our setting. We focus on two prominent definitions, U-top-k and global-top-k and call our own definition local-top-k when comparing to them; we also discuss other variants in Appendix D.2.

U-top-k.
The U-top-k variant does not study individual variables but defines the output as the sequence of k variables most likely to be the top-k (in that order), for the uniform distribution on pw(C). We call this alternative definition U-top-k by analogy with [14,46]. Interestingly, the U-top-k and local-top-k definitions sometimes disagree in our setting: There is a constraint set C and selection predicate σ such that local-top-k and U-top-k do not match, even for k = 1 and without returning expected values or probabilities.
We can easily design an algorithm to compute U-top-k in PSPACE and in polynomial time in the number of linear extensions of C: compute the probability of each linear extension as in Algorithm 1, and then sum on linear extensions depending on which top-k sequence they realize (on the variables selected by σ), to obtain the probability of each answer. Hence: Proposition 20. For any constraint set C over X , integer k and selection predicate σ, the U-top-k query for C and σ can be computed in PSPACE and in time O(poly(N )), where N is the number of linear extensions of C.
Unlike Theorem 7, however, this does not imply FP #P -membership: when selecting the most probable sequence, the number of candidate sequences may not be polynomial (as k is not fixed). We leave to future work an investigation of the precise complexity of U-top-k.
We show that in our setting U-top-k does not satisfy the containment property of [14].

Lemma 21.
There is a constraint set C without ties such that U-top-k does not satisfy the containment property for the uniform distribution on pw(C).

Global-top-k. We now study the global-top-k definition [51]
, and show that it does not respect the containment property either, even though it is defined on individual variables: The global-top-k query, for a constraint set C, selection predicate σ, and integer k, returns the k variables that have the highest probability in the uniform distribution on pw(C) to be among the variables with the k highest values, sorted by decreasing probability.

Lemma 23.
There is a constraint set C without ties such that global-top-k does not satisfy the containment property for the uniform distribution on pw(C).

Related Work
We extend the discussion about related work from the Introduction.
Ranking queries over uncertain databases. A vast body of work has focused on providing semantics and evaluation methods for order queries over uncertain databases, including top-k and ranking queries (e.g., [14,18,24,25,27,32,42,45,49,50]). Such works consider two main uncertainty types: tuple-level uncertainty, where the existence of tuples (i.e., variables) is uncertain, and hence affects the query results [14,18,25,27,32,42,49,50]; and attribute-level uncertainty, more relevant to our problem, where the data tuples are known but some of their values are unknown or uncertain [14,24,27,45]. Top-k queries over uncertain data following [45] was recently applied to crowdsourcing applications in [12]. These studies are relevant to our work as they identify multiple possible semantics for order queries in presence of uncertainty, and specify desired properties for such semantics [14,27]; our definition of top-k satisfies the desiderata that are relevant to attribute-level uncertainty [27]. We depart from this existing work in two main respects. First, existing work assumes that each variable is given with an independent function that describes its probability distribution. We do not assume this, and instead derive expressions for the expected values of variables in a principled way from a uniform prior on the possible worlds. Our work is thus well-suited to the many situations where probability distributions on variables are not known, or where they are not independent (e.g., when order constraints are imposed on them). For this reason, the problems that we consider are generally computationally harder. For instance, [45] is perhaps the closest to our work, since they consider the total orders compatible with given partial order constraints. However, they assume independent marginal distributions, so they can evaluate top-k queries by only considering k-sized prefixes of the linear extensions; in our setting even computing the top-1 element is hard (Theorem 9).
The second key difference is that other works do not try to estimate the top-k values, because they assume that the marginal distribution is given: they only focus on ranks. In our context, we need to compute missing values, and need to account, e.g., for exact-value constraints and their effect on the probability of possible worlds and on expected values (Section 3).
We also mention our previous work [4] which considers the estimation of uncertain values (expectation and variance), but only in a total order, and did not consider complexity issues.
Partial order search. Another relevant research topic, partial order search, considers queries over elements in a partially ordered set to find a subset of elements with a certain property [3,16,19,23,39]. This relates to many applications, e.g., crowd-assisted graph search [39], frequent itemset mining with the crowd [3], and knowledge discovery, where the unknown data is queried via oracle calls [23]. These studies are complementary to ours: when the target function can be phrased as a top-k or interpolation problem, if the search is stopped before all values are known, we can use our method to estimate the complete output.
Computational geometry. Our work reformulates the interpolation problem as a centroid computation problem in the polytope of possible worlds defined by the constraint set. This problem has been studied independently by computational geometry work [30,37,41].
Computational geometry mostly studies arbitrary convex polytopes (corresponding to polytopes defined by arbitrary linear constraint sets), and often considers the task of volume computation, which is related to the problem of computing the centroid [41]. In this context, it is known that computing the exact volume of a polytope is not in FP #P because the output is generally not of polynomial size [31]. Nevertheless, several (generally exponential) methods for exact volume computation [10] have been developed. The problem of approximation has also been studied, both theoretically and practically [15,17,20,30,35,44]. Our problem of centroid computation is studied in [37], whose algorithm is based on the idea of computing the volume of a polytope by computing the lower-dimensional volume of its facets. This is different from our algorithm, which divides the polytope along linear extensions into subpolytopes, for which we apply a specific volume and centroid computation method.
Some works in computational geometry specifically study order polytopes, i.e., the polytopes defined by constraint sets with only order constraints and no exact-value constraints. For such polytopes, volume computation is known to be FP #P -complete [9], leading to a FP #P -hardness result for centroid computation [41]. However, these results do not apply to exact-value constraints, i.e., when order polytopes can only express order relations, between variables which are in [0, 1]. Exact-value constraints are both highly relevant in practice (to represent numerical bounds, or known information, e.g., for crowdsourcing), allow for more general polytopes, and complicate the design of Algorithm 1, which must perform volume computation and interpolation in each fragmented linear order.
Furthermore, to our knowledge, computational geometry works do not study the top-k problem, or polytopes that correspond to tree-shaped constraint sets, since these have no clear geometric interpretation.
Tree-shaped partial orders. Our analysis of tractable schemes for tree-shaped partial orders is reminiscent of the well-known tractability of probabilistic inference in tree-shaped graphical models [7], and of the tractability of probabilistic query evaluation on trees [13] and treelike instances [5]. However, we study continuous distributions on numerical values, and the influence between variables when we interpolate does not simply follow the tree structure; so our results do not seem to follow from these settings.

Conclusion
In this paper, we have studied the problems of top-k computation and interpolation for data with unknown values and order constraints. We have provided foundational solutions, including a general computation scheme, complexity bounds, and analysis of tractable cases. One natural direction for future work is to study whether our tractable cases (tree-shaped orders, sampling) can be covered by more efficient PTIME algorithms, or whether more general tractable cases can be identified: for instance, a natural direction to study would be partial orders with a bounded-treewidth Hasse diagram, following recent tractability results for the related problem of linear extension counting [29]. Another question is to extend our scheme to request additional values from the crowd, as in [3,12], and reduce the expected error on the interpolated values or top-k query, relative to a user goal. In such a setting, how should we choose which values to retrieve, and could we update incrementally the results of interpolation when we receive new exact-value constraints? Finally, it would be interesting to study whether our results generalize to different prior distributions on the polytope.

A Proofs for Section 3 (An Algorithm for Interpolation and Top-k)
Lemma 4. For any constraint set C, we can construct in PTIME a constraint set C such that the probability that the possible worlds of C have a tie (under the uniform distribution) is zero, and such that any interpolation or top-k computation problem on C can be reduced in PTIME to the same type of problem on C .
Proof. We consider two types of ties: persistent ties, which are enforced by C and hold in each possible world, and occasional ties, which are not enforced by C and only hold in some possible worlds.
We first formally prove that occasional ties have a total probability 0: let d be the dimension of pw(C) and n := |X |. Assume that x, y have a tie in some worlds and do not have a tie in other worlds (in particular, x = y is not implied by C). If we add constraints to enforce a persistent tie, pw(C ∪ {x y, y x}) is now at most (d − 1)-dimensional. Geometrically, this is a projection of the admissible polytope on the x = y hyperplane. The d-volume of pw(C ∪ {x y, y x}) is thus 0, i.e., this set of worlds has probability 0. Then, taking the union of O(n 2 ) such sets, to obtain all worlds with occasional ties between any pair of variables, the result has total probability 0.
Next, to handle persistent ties, we define a constraint set where persistently tied variables are replaced by a single variable. Let X /∼ be the set of equivalence classes of X . We define C/∼ to be the constraint set where every occurrence of a variable x i ∈ X is replaced by some representative s(x i ) of its equivalence class. By definition C/∼ does not have persistent ties; and there is clearly a bijection between pw(C) and pw(C/∼).
It remains to show that problems on C can be reduced (in PTIME) to X /∼. The interpolation problem for a variable x i on C clearly reduces to interpolation for s(x i ) on C/∼, since due to the bijection the expected value of x i on C equals that of s(x i ) on C/∼. If no selection is applied, the reduction for top-k works by selecting the top-k variables in C/∼, replacing each s(x i ) by the variables of its equivalence class [s(x i )], and truncating the obtained ranked list to length k (recall that ties can be broken arbitrarily). If a selection σ is applied, then apply top-k only to representatives of selected variables from X σ ; and replace each representative s(x i ) in the top-k result only by the variables from [s(x i )] ∩ X σ .

A.1 Total Orders (Section 3.1)
We now give the omitted formal details about the "Fragmented Total Orders" paragraph. The soundness of splitting can be proved via the possible world semantics: assume that x i ∈ X exact . Then the possible valuations of x 1 , . . . , x i−1 are affected only by the constraints in C on x 1 , . . . , x i , and similarly for x i+1 , . . . , x n . The formal claim is as follows, which follows from Lemma 12: Lemma 24. Let C be a constraint set implying a fragmented total order x 1 . . . x n , and x i = v a constraint in C. Let C i , C i ⊆ C be the constraints restricted to the variable subsets x 0 , . . . , x i and x i , . . . , x n+1 respectively. Then there is a bijective correspondence between pw x1...xi (C i ) × pw xi...xn (C i ) and pw(C).
The correspondence is defined from the left-hand side to the right-hand side as merging the duplicated component for x i to obtain worlds of exactly n values.

A.2 General Constraint Sets (Section 3.2)
Theorem 7. Given a constraint set C over X and x ∈ X (resp., and a selection predicate σ, and an integer k), determining the expected value of x in pw(C) under the uniform distribution (resp., the top-k computation problem over X , C, and σ) is in FP #P .
Proof. Let C be an arbitrary constraint set with order constraints and exact-value constraints on a variable set X , and let n := |X |. Use Lemma 4 to ensure that there are no ties. To simplify the reasoning, we will make all values occurring in exact-value constraints be integers that are multiples of (n + 1)! as follows: let ∆ be (n + 1)! times the product of the denominators of all exact-value constraints occurring in C, which can be computed in PTIME, and consider C the constraint set defined on [0, ∆] n by keeping the same variables and order constraints, and replacing any exact-value constraint x i = v by x i = v∆; the constraint set C is computable in PTIME from C, and the polytope pw(C ) is obtained by scaling pw(C) by a factor of ∆ along all dimensions; hence, if we can compute the expected value of x i in C (which is the coordinate of the center of mass of pw(C ) on the component corresponding to x i ), we can compute the expected value of x i in C by dividing by ∆. So we can thus assume that pw(C ) is a polytope of [0, ∆] n where all exact-value constraints are integers which are multiples of (n + 1)!.
We use Lemma 5.2 of [ACK + 11] to argue that the volume of pw(C) can be computed in #P. The PTIME generating Turing machine T , given the constraint set C, chooses nondeterministically a linear extension of (X , C ), which can clearly be represented in polynomial space and checked in PTIME. The PTIME-computable function g computes, as a rational, the volume of the polytope for that linear extension, and does so according to the scheme of Section 3: the volume is the product of the volumes of each C n 1 (α, β), whose volume is (β−α) n n! . This is clearly PTIME-computable, and as α and β are values occurring in exact-value constraints, they are integers and multiples of n!, so the result is an integer, so the overall result is a product of integers, hence an integer. By Lemma 5.2 of [ACK + 11], V (C ), which is the sum of V (T ) across all linear extensions T of C (because there are no ties), is computable in #P.
We now apply the same reasoning to show that the sum, across all linear extensions T , of V (T ) times the expected value of x i in T , is computable in #P. Again, we use Lemma 5.2 of [ACK + 11], with T enumerating linear extensions, and with a function g that computes the volume of the linear extension as above, and multiplies it by the expected value of x i , by linear interpolation in the right C q−1 p+1 as in the previous section (it is an integer, as all values of exact-value constraints are multiples of q − p n + 1). So this concludes, as the expected value v of x i is 1 where v T is the expected value of x i for T , and we can compute both the sum and the denominator in #P. Hence, the result of the division, and reducing back to the answer for the original C, can be done in FP #P .

A.3 Marginal Distributions
Beyond expected values, our results from Section 3 can serve to compute the marginal distributions of variables under the uniform distribution; these can be used for other estimation schemes of unknown variables (e.g., other moments of the variables, median values, etc.).
We now formalize this notion. Letting C be a constraint set, x ∈ X \X exact a variable with no exact-value constraint, and v ∈ [0, 1] a value, we write C |x=v the marginalized constraint set C ∪ {x = v}. Note that, if the admissible polytope pw(C) is d-dimensional, then pw(C |x=v ) is (d − 1)-dimensional. We can now define: Proof. The distribution on n uniform independent samples in [α, β] can be described as first choosing a total order for the samples, uniformly among all permutations of {1, . . . , n} (with ties having a probability of 0 and thus being neglected). Then, the distribution for each total order is exactly that of x i when variables are relabeled.
Note that the mean of this distribution corresponds to a linear interpolation of the unknown variables between α and β, which we use in section 3.1.
We summarize these findings, as follows: Proposition 29. Given C, a constraint set implying a total order, the marginal distribution of any variable x i ∈ X is a polynomial of degree at most n and can be computed in PTIME.
Algorithm 1 can be adapted to compute marginal distributions, by replacing line 9 to compute instead the marginal distribution of x in T on variables x ij . . . x ij+1 : according to Observation 28, this marginal distribution is polynomial and defined on the range [v ij , v ij+1 ]. We modify line 11 to compute the sum of the marginal distributions instead, seeing them as piecewise-polynomial functions (which are zero outside of their range), still weighted by the volume of T . We deduce that the marginal distribution of x in C (computed by modifying line 13) is a piecewise-polynomial function with at most |X | pieces and degree at most |X |.

B.1 Hardness of Exact Computation (Section 4.1)
The following lemma provides a direct connection between the expected value of a variable to its expected rank, for constraint sets that consist only of order constraints. By the hardness of computing expected rank in partial orders [BW91] this provides an alternative proof to the hardness of computing expected value (Theorem 8). We use this connection to obtain an upper bound on the denominator of the expected value, when we reduce to the problem of top-k computation below.
Lemma 30. Given X and a set of order constraints C over X , if the expected rank of x ∈ X is r then its expected value is r n+1 .
Proof. Let N be the number of linear extensions of C. Since C only consists of order constraints, by symmetry, the probability of each of its linear extensions is identical, namely 1/N . Denote by r i the rank of x in the linear extension T i . Then r = N i=1 1 N r i . By Algorithm 1, the expected value of x relative to T i is ri n+1 , and thus the overall expected value of Theorem 9. Given a constraint set C over X , a selection predicate σ, and an integer k, the top-k computation problem over X , C and σ is FP #P -hard even if k is fixed to be 1, |X σ | is 2, and the top-k answer does not include the expected value of the variables.
Proof. We will perform a reduction from the interpolation problem (i.e., expected value computation) on sets of order constraints, which is FP #P -hard by Theorem 8. Write n := |X |. Assume using Lemma 4 that the probability of ties in pw(C) is zero.
We first observe, using Lemma 30, that the expected value v of x i can be written as T r T , where the sum is over all the linear extensions T of C, r T ∈ {1, . . . , n} is the rank of x i in the linear extension T , and N n! is the number of linear extensions. This implies that v can be written as a rational p/q with 0 p q and 0 q M , where we write M := (n + 1)!.
We determine this fraction p/q using the algorithm of [Pap79], that proceeds by making queries of the form "is p/q p /q " with 0 p , q M , and runs in time logarithmic in the value M , so polynomial in the input C. To do so, we must describe how to decide in PTIME whether v p /q for 0 p , q M , using an oracle for the top-1 computation problem that does not return the expected values.
Fix v = p /q the query value and let v = p/q be the unknown target value, the expected value of x i . We illustrate how to decide whether v v . The general idea is to add a variable with exact-value constraint to v and compute the top-1 between x i and the new variable, but we need a slightly more complicated scheme because the top-1 answer variable can be arbitrary in the case where v = v (i.e., we have a tie in computing the top-1). Let ε := 1/(2(M 2 + 1)), which is computable in PTIME in the value of n (so in PTIME in the size of the input C). Construct C (resp., C − , C + ) by adding an exact-value constraint x = v (resp., x − = v − ε, x + = v + ε) for a fresh variable x (resp., x − , x + ). Now use the oracle for C (resp., C − , C + ) and the selection predicate that selects x i and x (resp., x − , x + ), taking k = 1 in all cases. The additional variables do not affect the expected value of x i in C , C + , and C − , so it is also v in them. Further, we know that Thus, if the top-1 variable in all oracle calls is always x i (resp., never x i ), then we are sure that v > v (resp., v < v ). If some oracle calls return x i but not all of them, we are sure that v = v . Hence, we can find out in PTIME using the oracle whether v v . This concludes the proof, as we then have an overall PTIME reduction from the FP #P -hard problem of interpolation (i.e., expected value computation) to the top-1 computation problem, showing that the latter is also FP #P -hard.
for different values of k. Note that we can assume that the expected value v of x i is not equal to the exact value v of x (otherwise, we can resolve it using the technique of Theorem 9). Under this assumption, we have v < v iff the smallest k such that x occurs in the top-k is less than the smallest k such that x occurs in the top-k , and likewise v > v whenever k > k . k and k can be found e.g. using binary search. For clarity and completeness, we give a self-contained proof of this result, using the almost uniform sampling scheme of [KLS97]:

B.2 Complexity of
Proof. We use a result by Kannan, Lovász, and Simonovits (Theorem 2.2 of [KLS97]) that shows that sampling a point almost uniformly from a convex body in dimension n can be done byÕ(n 5 ) calls to an oracle deciding membership in the body, and an additional factor of O(n 2 ) arithmetic operations (here,Õ(·) is the soft-O notation, that ignores polylogarithmic factors). "Almost uniformly" means that the total variation distance between the uniform distribution and the actual distribution realized by the sampling is less than any fixed number ε > 0. The dependency of the running time in ε is logarithmic, hence hidden inÕ(·).
Let ε, δ > 0 be two reals. For some number N that we define further, we first apply We then consider the projection of every point p i ∈ P to the dimension defined by the variable x, giving N values v 1 , . . . , v N . We then compute the estimateÊ C [x] as 1 Clearly this has no impact on the running time. It only remains to show that this estimate satisfies the given bounds, i.e., that ε with probability at least 1 − δ.
Let f : pw(C) → [0; 1] be the probability density function of the distribution realized by the algorithm of [KLS97], according to which the independent samples were drawn. By definition of the total variation distance, we know that 1 If we denote by µ v the average of the (independent) x-value samples v 1 , . . . , v N , and we denote by f x=t the function obtained from f by setting to t the coordinate corresponding to x, we have:

Top-k Querying of Unknown Values under Order Constraints (Extended Version)
Now, as the v i are independent and identically distributed, we obtain by Hoeffding's inequality [Hoe63]: Therefore, setting N : , we have:
Proof. Define the function M : pw(C) → pw X1 Xexact (C 1 ) × · · · × pw Xm Xexact (C m ), that maps each possible world w of C to a tuple w 1 , . . . , w m of possible worlds of C 1 , . . . , C m defined as follows: for each variable x ∈ X , letting v its value in C, we give to x the value v in all possible worlds of w 1 , . . . , w m where x appears. Recall that we assumed w.l.o.g. that m > 0, so each variable of X must occur in at least one of w 1 , . . . , w m .
It is immediate that any w ∈ pw(C) yields a tuple of possible worlds of C 1 , . . . , C m by this definition of M , since they are subsets of C.
Conversely, consider any tuple of possible worlds w 1 , . . . , w m ∈ C 1 , . . . , C m . Note that each x i ∈ X exact is always consistently mapped to the same value as each C i contains every exact-value constraint. Every other variable x i ∈ X j appears only in w j ∈ pw Xj ∪Xexact (C j ). Thus w = M −1 ( w 1 , . . . , w m ) is well-defined. Let us show that w ∈ pw(C).
First, we observe that exact-value constraints are necessarily respected, because each C i contains all such constraints. Second, let us show that order constraints are respected. Consider an order constraint of the form x x in C. Clearly, if x and x belong to the same uninfluenced class X i (or if at least one of them has an exact-value constraint), the constraint is reflected in C i , so it must be respected in w. Hence, we focus on the case where x ∈ X i and x ∈ X j with i = j. Now, as x x , there exists a path P of the form x = x 1 ≺ · · · ≺ x n = x , but as they are not in the same uninfluenced class there must be a variable of X exact in the sequence. Let x k be this variable with exact-value v. By definition of the uninfluence decomposition, x x k and x k = v are in C i ; similarly, x k x and x k = v are in C j . Thus x x must be respected in w overall.

C.2 Tree-Shaped Constraints (Section 5.2)
Theorem 15. For any tree-shaped constraint set C over X , we can compute its volume Proof. Let T be the tree with vertex set X which is the Hasse diagram of the order constraints imposed by C. For any variable x ∈ X that has no exact-value constraint (so it is not the root of T or a leaf of T ), let C x be the constraint set obtained as a subset of C by keeping only constraints between x and its descendants in T , as well as between x and its parent. For v ∈ [0, 1], we call V x (v) the d-volume of pw(C x ∪ {x = v}) where x is the parent of x and d is the dimension of pw(C x ). In other words, V x (v) is the d-volume of the admissible polytope for the subtree T |x of T rooted at x, as a function of the minimum value on x imposed by the exact-value constraint on the parent of x. It is clear that, letting x r be the one child of the root x r of T , we have V (C) = V x r (v r ), where v r is the exact value imposed on x r . We show by induction on T that, for any node x of T , letting m x be the minimum exact-value among all leaves that are descendants of x, the function V x is zero in the interval [m x , 1] and can be expressed in [0, m x ] as a polynomial whose degree is at most the number of nodes in T |x , written |T |x |. Since the probability of ties is 0, we have m x > 0 for all x.
The base case is for a node x of T which has only leaves as children; in this case it is , and is zero otherwise. For the inductive case, let x be a variable. It is clear that V . , x l are the children of x. Hence, by definition of the volume, we know that dv. Now, we use the induction hypothesis to deduce that V xi (v), for all i, in the interval [0, m x ], is a polynomial whose degree is at most |T |xi |. Hence, as the product of polynomials is a polynomial whose degree is the sum of the input polynomials, and integrating a polynomial yields a polynomial whose degree is one plus that of the input polynomial, V x in the interval [0, m x ] is a polynomial whose degree is at most |T |x |.
Hence, we have proved the claim by induction, and we use it to determine V (C) as explained in the first paragraph.
We now prove that the computation is quadratic. We first assume that the tree T is binary. We show by induction that there exists a constant α 0 such that the computation of the polynomial V xi in expanded form has cost less than αn 2 i , where n i is |T |xi |. The claim is clearly true for nodes where all children are leaves, because the cost is linear in the number of child nodes as long as α is at least the number the number of operations per node α 0 . For the induction step, if x i is an internal node, let x p and x q be the two children. By induction hypothesis, computing V xp and V xq in expanded form has cost α(n 2 p + n 2 q ). Remembering that arithmetic operations on rationals are assumed to take unit time, computing the product of V xp and V xq in expanded form has cost linear in the product of the degrees 6 of V xp and V xq which are less than n p and n q , so the cost of computing the product is α 1 n p n q for some constant α 1 . Integrating has cost linear in the degree of the resulting polynomial, that is, n p + n q . So the total cost of computing V xi is α(n 2 p + n 2 q ) + α 1 n p n q + α 2 (n p + n q ) + α 3 for some constants α 2 , α 3 . Now, as n q = n i − n p − 1, computing V xi costs less than: As long as α is set to be max( α1 2 , α2 2 ), the second and third terms are negative, which means (since n p n q and n i are both 1) that V xi costs less than: . This concludes the induction case, by setting α to any arbitrary value which is greater than max α 0 , α1 2 , α2 2 , α1+α3

3
. Hence the claim is proven if T is binary.
If T is not binary, we use the associativity of product to make T binary, by adding virtual nodes that represent the computation of the product. In so doing, the size of T increases only by a constant multiplicative factor (recall that the number of internal nodes in a full binary tree is one less than the number of leaves, meaning that the total number of nodes in a binary expansion of a n-ary product is less than twice the number of operands of the product). So the claim also holds for arbitrary T .

Theorem 16.
For any tree-shaped constraint set C on variable set X , for any variable x ∈ X \X exact , the marginal distribution for x is piecewise polynomial and can be computed Proof. Recall that C |x=v is C plus the exact-value constraint x = v. For any variable x , we let m x be the minimum exact-value among all leaves reachable from x . By definition, the marginal distribution for . We have seen in Theorem 15 that 1 V (C) can be computed in quadratic time; we now focus on the function V (C |x=v ).
By Lemma 12, letting x 1 , . . . , x k be the children of x, D 1 , . . . , D k be their descendants (the x i included), and D be all variables except x and its descendants, that is, where V xi is as in the proof of Theorem 15, and V x (v) is the volume of the constraint set C x,v over D obtained by keeping all constraints in C about variables of D, plus the exact-value constraint x = v. Indeed, the uninfluenced classes of C |x=v are clearly D\X exact , D 1 \X exact , . . . , D k \X exact . We denote by X the variables of C x,v .
We know by the proof of Theorem 15 that V xi , in the interval [0, m x ], is a polynomial whose degree is at most |T |xi |, and that it can be computed in O(|T |xi | 2 ). Hence, the product of the V xi (v) can be computed in quadratic time in T |x overall (as in the proof of Theorem 15) and it has linear degree. We thus focus on C x,v , for which it suffices to show that V (C x,v ) is a piecewise-polynomial function with at most |X exact | pieces, each piece having a linear degree and being computable in quadratic time in X . Indeed, this suffices to justify that computing the product of V (C x,v ) with i V xi (v), and integrating to obtain the marginal distribution, can be done in time O(|X exact | × |X | 2 ), and that the result is indeed piecewise polynomial.
For any node x i of D with no exact-value constraint, we let V xi,x (v, v ) be the volume of the constraint set obtained by restricting C x,v to the descendants of the parent x i of x i and adding the exact-value constraint x i = v. We let (v 1 , . . . , v q ) be the values occurring in exact-value constraints in C, in increasing order, so that q |X exact |. We show by induction on D the following claim: for any 1 i < q, for any variable x i in D with no exact-value constraint, in the intervals v ∈ [0, where P and P are polynomials of degree at most |T |xi ∩ D| and can be computed in quadratic time in |T |xi ∩ D|.
The proof is the same as in Theorem 15: for the base case where all children of For the inductive case, we do the same argument as before, noting that, clearly, taking the product of the V ·,· (v, v ) among the children of x i , the variable v occurs in at most one of them, namely the one from which x is reachable. We conclude that V x (v) is indeed a piecewise polynomial function with at most |X exact | many pieces, all of which have a linear degree, by evaluating V x ,x (v , v), where x is the one child of the root of T and v is the value to which it has an exact-value constraint. The overall computation time is then in O(|X exact | × |X | 2 ).

Corollary 17.
Given any constraint set C and its uninfluence decomposition C 1 , . . . , C m , assuming that each C i is a (reverse-)tree-shaped constraint set, we can solve the interpolation problem in time O(max i |X i | 3 ) and the top-k problem in PTIME.
Proof. First, notice that any reverse-tree-shaped constraint set C can be transformed to a tree-shaped constraint set C such that E C (x) = 1 − E C (x) for every x ∈ X , by reversing order constraints and replacing exact-value constraints x = α with x = 1 − α. Second, we also observe that, formally, the constraint sets in the uninfluence decomposition may include variables with exact value constraints that are not connected to the tree-shaped structure, but it is obvious that these variables have no impact on the interpolation problem. Now, we use Lemma 12 to deduce that we can indeed solve the problems by solving them separately in each constraint set of the uninfluence partition. For the interpolation problem, we can compute the interpolated value of a variable by looking only at its uninfluence class. The complexity of top-k follows.

D
Proofs for Section 6 (Other Variants)

D.1 Proofs of Comparison Results
Lemma 19. There is a constraint set C and selection predicate σ such that local-top-k and U-top-k do not match, even for k = 1 and without returning expected values or probabilities.
Proof. Let µ = 2/3, m = 1/ √ 2, and pick any v such that µ < v < m. Consider variables x, x and y, with the constraint set that imposes x x and y = v. Fix k = 1 and consider the predicate σ that selects all variables. It is immediate by linear interpolation that the expected value of x is µ. Hence y, that has a greater expected value, is the local-top-1. However, the marginal distribution of x can easily be computed to be p x : t → 2t. Intuitively, when x becomes larger, it makes a larger range of values possible for x , and thus also a larger range of possible worlds. By integrating, we obtain that x has a probability of 0.5 of exceeding m -a value greater than v, y's value -and hence its probability of being the top-1 is greater than y's. Namely, x is the U-top-1. Intuitively, the volume of possible worlds where x is the top-1 is greater due to the asymmetry of x's distribution. However, there is a larger gap (on average) between y and x in worlds where y is the top-1, which ultimately leads to a higher expected value for y.

Lemma 21.
There is a constraint set C without ties such that U-top-k does not satisfy the containment property for the uniform distribution on pw(C).

Proof.
Consider variables x l , x h , x + f , and x − f , and the constraint set C that imposes x l x h , x + f = . 7, and x − f = .69. Consider the selection predicate σ that selects all variables. The total volume of the constraint set is clearly V = 1/2.
We first set k = 1. The first possible answer is (x + f ) with probability .7×.7 2·V = .49, and the second is (x h ) with probability .51, so the U-top-1 is (x h ).
We then set k = 2. There are four possible answers. The first possible answer is (

Lemma 23.
There is a constraint set C without ties such that global-top-k does not satisfy the containment property for the uniform distribution on pw(C).

Proof.
Consider variables x s , x f , x l and x h and the constraint set C that imposes x l x h , x l = . 45

D.2 Other Variants
Additional variants of top-k have been studied, see [CLY09,ZC09]. However, in the context of [CLY09], these definitions do not satisfy the containment property either, except for two. The first, U-kRanks [SIC07], does not satisfy the natural property that top-k answers always contain k different variables. The second, expected ranks [CLY09], resembles local-top-k but uses ranks instead of values, so the definition is value-independent. While this makes sense for top-k queries designed to return tuples, as in [CLY09], we argue it is less sensible when focusing on the numerical value of variables; this justifies our focus on local-top-k.
Another possibility to define top-k in our context, however, is to design it based on different assumptions. One natural choice is to require a stability property, namely, adding exact-value constraints to fix some variables to their interpolated values does not change the interpolated values of the other variables. We can show that this property is not respected by our scheme, but that it can be enforced on tree-shaped constraint sets: see Appendix E.

E Alternative Interpolation Scheme
We now consider variants of the interpolation problem, thus far performed by considering the expected value under the uniform distribution. Since we are not aware of candidate variants in previous work, for interpolating over partial orders with exact-value constraints, we propose a new alternative variant. Rather than imposing the connection to the uniform prior, we present natural desiderata for an interpolation scheme on partial orders. We show that they are not respected by our current definition, and show that a definition that respects them can be proposed for tree-shaped partial orders (Definition 14), it is in fact unique, and it can be computed tractably. We first define the notion of interpolation scheme.

Definition 32.
An interpolation scheme is a function that maps any constraint set C on variables X to a mapping from X to its interpolated value in [0, 1].
For instance, the interpolation scheme that we have studied thus far maps each variable to its expected value under the uniform distribution on pw(C). We refer to this scheme as Uniform in the sequel.
We define the first natural desideratum for interpolation schemes, stability: intuitively, an interpolation scheme is stable if assigning variables to their interpolated value does not change the result of interpolation elsewhere. Formally, Definition 33. An interpolation scheme S is stable if, for every constraint set C over X and every x ∈ X , S assigns the same mapping f : X → [0, 1] to both C and C ∪ {x = f (x)}.
This property can be shown to be respected, e.g., by linear interpolation on total orders. However, a counterexample shows that the stability property is not respected by the uniform scheme: Lemma 34. The Uniform scheme is not stable, even on tree-shaped constraint sets.
Proof. Consider the set of variables {x r , x a , x b , x c , x d , x e } and the constraint set C formed of the order constraints x r x a , x a x b x c , x a x d x e , and the exact-value constraints x r = 0, x c = .5 and x e = 1. We can compute that the interpolated values for x a and for x b are 3/20 and 13/40 respectively. However, adding the exact-value constraint x b = 13/40, the interpolated value for x a becomes 611/4020, which is different from 3/20.
Let us use stability as a guide to design a different interpolation scheme. We impose another desideratum to act as a base case, specifying what one expects from an interpolation scheme when there is only a single unknown variable: Definition 35. Let C be a (non-contradictory) constraint set such that x is the only unknown variable; y 1 , . . . , y n are variables with exact-value constraints such that y i x; and z 1 , . . . , z m are variables with exact-value constraints such that x z i (having n, m 1). We say an interpolation scheme is balanced if, for each such C, its interpolated value for x is maxi(v(yi))+mini(v(zi)) 2 .
In particular, the Uniform scheme is balanced in this sense. However, we would like to find a scheme that is both balanced and stable. For the case of general constraint sets, this problem remains open, and we leave it for future work. For tree-shaped constraint sets, we next not only show such a scheme, but also prove it is unique, as follows.

Proposition 36.
There is at most one interpolation scheme on trees that is both stable and balanced.
Proof. We first observe that, because of the balanced requirement, in any constraint set C, for any unknown variable x whose parent y was interpolated to value v and whose children z i were interpolated to w i , x must have be interpolated to (v + (min i w i ))/2. Indeed, considering the constraint set C where y and z i have been set to those values, by stability, the interpolation value of x does not change. Hence, as the interpolation scheme is balanced, we conclude that the claimed property holds.
We now show that the resulting set of equations always has at most one solution on any tree-shaped constraint sets. Indeed, assume that there are two stable and balanced interpolation schemes f and g which yield different results on a tree-shaped constraint set C. For all variables x of C, let d(x) := g(x) − f (x). Calling x r the root variable of C, we must have d(x r ) := 0 for the root, because it has an exact-value constraint by definition of tree-shaped constraint sets.
Now, as f and g differ on C, there must be a variable x where d(x) = 0. Without loss of generality, we have d(x) > 0. Hence, let us consider a variable x with parent y so that we have d(x) > d(y): as d(x r ) = 0, we can find such a variable x by picking a variable which is as high as possible in the tree, such that d(x) > 0 but d(y) = 0. Necessarily x is not a leaf (as they have exact-value constraints, so d(x) = 0), so x has children. We show that x has a child x g such that d(x g ) > d (x). Consider x f the child of n such that f (x f ) is minimal among children of x, and x g defined in the analogous manner for g. Now, as f and g are balanced, by our preliminary observation we have f (x) = (f (y)+f (x f ))/2 hence f (x f ) = 2·f (x)−f (y), and likewise g(x g ) = 2·g(x)−g(y). But then, by minimality of g(x g ), we have g( d(y). Now, as we have d(y) < d(x), we have d(x f ) > d (x), which is what we wanted to show. Now, repeating the argument on x f , we obtain a child x 2 f of x f such that d(x 2 f ) > d(x f ). Repeating the argument, we thus build a descending chain of variables x in the tree-shaped constraint set C along which d is strictly increasing. When we reach the leaves, we obtain a contradiction. This implies that we must have d(x) = 0 for all x ∈ X , so that f = g on C. Hence, there cannot be two different stable and balanced interpolation schemes on tree-shaped constraint sets.
We now prove the existence of a stable and balanced interpolation scheme on trees, which we dub Stable, and show that expected values under this scheme can be computed in linear time: Proof. We compute the interpolation scheme on a tree-shaped constraint set C top-down. For each variable x which has no exact-value constraint or interpolated value, but whose parent y has an exact value or an interpolated value v y , we consider all leaves z which are descendants of x (and have an exact-value constraint to some value v z ), and set the interpolated value of x to be the minimum of linear interpolation from y to z; namely, letting d x (z) be the depth of leaf z in the subtree rooted at x, we set v x := min z v y + vz−vy dx(z)+1 . This can clearly be done in the indicated time bound.
We now show that the resulting interpolation scheme is indeed balanced and stable. It is immediate to observe that this scheme is balanced. We now show that it is stable. Towards this, let us first show that for every variable x with parent y, if z is a leaf that achieved the minimum when interpolating x to its value, then for all variables on the path from x to z, z was also a leaf that achieved the minimum when interpolating their value. Indeed, it suffices to show the claim for the first variable x of this path, a child of x, and then repeat the argument. We know that, from our choice when interpolating x, by minimality of the interpolation result for x using z, we have vz−vy dx(z)+1 v z −vy dx(z )+1 , where v y is the value of the parent y of x; let us call the first quantity δ and the second δ . By definition of the interpolation of x, we then have v x := v y + δ. Now, consider any leaf z reachable from x . We must show that z achieves the minimum when interpolating x ; in other words, we must compare η := vz−vx d x (z)+1 and η := v z −vx d x (z )+1 , and show that η η ; note that d x (z) + 1 = d x (z) and d x (z )+1 = d x (y). The quantity η can then be rewritten as dx(z)+1 dx(z) × 1 dx(z)+1 (v z −v y −δ), i.e., dx(z)+1 dx(z) × δ − δ dx(z)+1 , which simplifies to δ: hence, η = δ. The quantity η can be written as dx(z )+1 dx(z ) × 1 dx(z )+1 (v z − v y − δ), i.e., dx(z )+1 dx(z ) × δ − δ dx(z )+1 , which simplifies to (dx(z )+1)δ −δ dx (z ) . Now, as δ δ, we deduce that η (dx(z )+1)δ−δ dx(z ) , so that η δ. Hence, we have η η, so that the leaf z also achieves the minimum for variable x . Repeating the argument on the path from x to z, we have shown the claim.
From this initial claim, we deduce the following (*): for any variable x, letting z be a leaf that achieves the minimum when interpolating x (once the value of its parent y is known), then the variables in the path from y to z are interpolated according to linear interpolation on that chain. This is immediate by the previous claim, as all variables on that path are interpolated using linear interpolation from their parent to that same leaf (or another minimal leaf that sets them to the same value).
We now show a similar auxiliary claim. Let us define, once we have interpolated in C, the function u that maps each non-root variable x to u(x) defined as the interpolated value of x minus that of its parent. We show that (**): for any variables y, x, x , where y is the parent of x and x is the parent of x , then u(x) u(x ). Indeed, let v y , v x and v x be the interpolated values, and let z be the witness leaf used to interpolate for x . By definition of the scheme, we have u(x ) = v z −vx d x (z )+1 . Furthermore, letting z be the witness leaf used to interpolate for x, we have u(x) = vz−vy dx(z)+1 . Using the notation above, note that u(x ) = η and u(x) = δ. By the same reasoning as for claim (*) to show δ = η η , we conclude that u(x) u(x ).
We are now ready to show that the scheme is stable. Consider the initial tree-shaped constraint set C, and let us set a variable x to its interpolated value v x , yielding C . Note that C is no longer tree-shaped, but it can be rewritten by Lemma 12 to two tree-shaped constraint sets. It is then clear that all variables that are descendants of x in C are interpolated in the same manner in C and in C, as the scheme proceeds top-down and the value of x in C is by definition the same as its interpolated value in C. We now show that the ancestors of x in C are interpolated in the same way in C than in C, which is clearly sufficient to justify the claim that all variables in C are interpolated in the same way as in C. Let us therefore pick an ancestor x of x, which is neither x nor the root variable, otherwise the claim is trivial; we pick it as high as possible in the tree, so the interpolated value v y of its ancestor y is the same in C and in C .
We first show that the interpolated value for x in C is no higher than in C. Assuming to the contrary that it is, then it must be the case that x was interpolated in C using as minimal leaf z some leaf which is a descendant of x in C, as otherwise we can still interpolate using z in C and obtain the same result. Now, if x was interpolated in C using z as minimal leaf, then, by our preliminary claim (*), x was interpolated in C following linear interpolation between the parent y of x and the leaf z. Hence, using the new leaf x in C to interpolate x in C yields the same result as the interpolation in C. Contradiction.
Second, we show that the interpolated value for x in C is no lower than in C. Assuming to the contrary that it is, then, if x was interpolated in C following a leaf z which is not x, then we immediately reach a contradiction as we should have used the same leaf z to interpolate to the same value in C. Hence, we must have interpolated x in C using the new leaf x, and x was interpolated in C following linear interpolation between y and x. Let γ be the value difference between two consecutive nodes in C on this path, and l the length of the path. Calling u(x) for a variable x the difference between its interpolated value and the value of its parent in C, we must then have u(x ) > γ, because the value of x in C is strictly greater than in C , and y has same value in C and C . By preliminary claim (**), we have reached a contradiction, because then the function u always takes values which are > γ on the path from x to x in C, so that when we reach x we know that the value of x in C is > l · γ, contradicting the fact that it is l · γ, as we know from C .