Guaranteeing Correctness of Bulk Operations in Outsourced Databases

. The adoption of public cloud services, as well as other data outsourcing solutions, raises concerns about conﬁdentiality and integrity of information managed by a third party. By focusing on data integrity, we propose a novel protocol that allows cloud customers to verify the correctness of results produced by key-value databases. The protocol is designed for supporting eﬃcient insertion and retrieval of large sets of data through bulk operations in read and append-only workloads. In these contexts, the proposed protocol improves state-of-the-art by reducing network overheads thanks to an original combination of aggregate bilinear map signatures and extractable collision resistant hash functions.


Introduction
The adoption of cloud services and other data outsourcing solutions is often hindered by data confidentiality needs and by limited trust about the correctness of operations performed by the service provider. Data confidentiality issues are addressed by several proposals based on encryption schemes (e.g., [10,12,21]). The correctness may be guaranteed through standard authenticated data structures [15,24] based on message authentication codes [1] and digital signatures [19] that are affected by large network overheads and by limited database operations. Recent proposals, such as [13,16,17,20], improve standard protocols but they cannot be adopted to guarantee results correctness in outsourced keyvalue databases because they incur either in network overheads [13,16,20] or in high computational costs [16,17,9]. For these reasons, we propose Bulkopt, a novel protocol that allows us to detect unauthorized modifications on outsourced data, as well as the correctness of all results produced by a cloud database service. Bulkopt guarantees authenticity, completeness and freshness of results produced by outsourced databases including cloud related services. It is specifically designed to work efficiently in read and append-only workloads possibly characterized by bulk operations, where large amounts of records may be inserted in the key-value database through one write operation. Moreover, Bulkopt supports

System and threat models
We adopt popular terminology for database outsourcing [23]. We identify a data owner that stores data on a database server managed by an untrusted service provider, and many authorized users that retrieve data from the server. The server offers a query interface that can be accessed by the data owner and the authorized users to retrieve values by providing a set of keys. We consider a publicly verifiable setting [23] and assume that only the data owner knows his private key, that is required to insert data into the database, and that authorized users know the public key of the owner that is required to verify results produced by the server. We note that in this first version of the protocol, we do not consider delete and update operations and focus on efficient insert and read database operations.
Our threat model assumes that the owner and all users are honest, while the server is untrusted. In particular we assume that the server (or any other unauthorized party, that does not have legitimate access to the private key) may try to insert, modify and delete data on behalf of the owner. The Bulkopt protocol allows all users and the owner to verify the correctness of all results produced by the server. We distinguish three types of results violations: -authenticity: results that contain records that have never been previously inserted by the data owner or that have been modified after insertion; -completeness: results that do not include all keys requested by the client but that have been previously inserted by the data owner; -freshness: results that are based on an old version of the database. In the considered operation workload the server can only violate freshness if he returns results that are both authentic and complete, but refer to an old version of the database.

Protocol overview
We describe the formal model used by Bulkopt to represent data and operations (Section 3.1) and to express authenticity and completeness guarantees as set operations (Sections 3.2 and 3.3). We note that since in this version of the protocol we do not consider delete and updates, the server can only violate freshness if he returns results that are both authentic and complete, but that refer to an old version of the database. As a result, clients can detect freshness violations by always using updated cryptographic digest to compute authenticity and completeness proofs. For details about verification operations please refer to the candidate implementation of the protocol described in Section 4.

Data model
We model the key-value database as a set of tuples D = {(k, v)}, where k is the key and v is the value associated to k. The owner populates the key-value database by executing one or more insert operations. For each insert operation the owner sends a set of tuples where i is an incremental counter that uniquely identifies an insert operation. The set B i contains at least one tuple, and may contain several tuples in case of bulk insertions. Without loss of generality, in the following we refer to each set of tuples B i as a bulk. We define as K i the set of keys included in B i , and D n = ∪ n i=1 B i the set of records stored in the database after n bulk insertions.
We assume that the server has access to a lookup function that given a set of keys {k} allows him to retrieve the set of insert operation identifiers {i} in which these keys were sent by the owner. Such function can be obtained by deploying any standard indexing data structure of preference (e.g., a B-tree).
Any client (including the owner) can issue a read operation requesting an arbitrary set of keys X = {k}. If the server behaves correctly he must return the subset of the database A, defined as: We define R as the set of keys included in A, that is: While executing read operations issued by clients, the server distinguishes two different sets of keys: T andT .
T is the union of all sets K i that contain at least one key among those requested by a client: Within each K i we identify two subsets of keys: We define Q as the union of all sets Q i , and we note that the union of all sets R i is equal to set R (see Equation (2)). Thus, set Q is the complement of R in T .T is the union of all sets K i that do not contain any key among those requested by a client:T To better explain how these sets are built and the relationships among them, we refer to a simple example shown in Figure 1. In this example we have a key- value database on which the owner already executed five bulk insert operations, each involving a different amount of tuples. The keys included in the database are represented by sets K 1 to K 5 . We assume that a legitimate client executes a read operation, asking to retrieve six keys belonging to three different bulks. The set of keys requested is represented by X. Since X includes keys belonging to bulks K 1 , K 3 and K 4 , all keys of these bulks belong to T , whileT includes all keys belonging in the remaining bulks (K 2 and K 5 ). Sets R 1 , R 3 and R 5 include only the keys requested by the client and belonging to K 1 , K 3 and K 5 , respectively. Set R includes all the keys belonging to the union of R 1 , R 3 and R 5 . Sets Q 1 , Q 3 and Q 5 include only the keys that were not requested by the client and that belong to K 1 , K 3 and K 5 , respectively. Finally, set Q includes all the keys belonging to the union of Q 1 , Q 3 and Q 5 .
Sets Q andT are the main building blocks that Bulkopt leverages to identify a violation of the security properties or to prove the correctness of results produced by the server.

Authenticity
Bulkopt builds proofs of authenticity by demonstrating that: where K D represents the set of keys included in D n . We recall from Section 2 that authenticity is violated if the server produces a result containing a key that has not been inserted by the owner. Let us assume that R includes a fake key k f that has been created by the server but does not belong to K D . Then it is obvious that Equation (5) does not hold, since R is not a subset of K D .
An obvious solution to demonstrate that R is a subset of K D would be for the client to have the complete set K D . Of course this is not applicable, since it would require all clients to maintain a local copy of the whole key-value database.
To overcome this issue, Bulkopt requires the owner to maintain a cryptographic accumulator σ(K D ) that represents the state of the keys stored in the database D n . This accumulator is updated after each insert operation and has to be available to all users. Moreover, the server builds two witness data structures W Q and WT that represent the sets Q andT , and sends them to the client together with its response A. We remark that cryptographic accumulators and witnesses are small and fixed-size data structures, that can be transmitted with minimal network overhead [3,6].
To verify Equation (5) a client can extract the set of keys R from A, and use two accumulators verification functions. In particular, it checks whether the witness data structures received by the database validates the results with respect to the requested data and the current state of the database that is maintained locally. Intuitively, the client verification process can be represented as following: where verify denotes accumulators verification functions. If Equation (6) is verified, then the user knows that the two witnesses produced by the server are correct and that Equation (5) is also verified. Hence R is a subset of K D and authenticity holds. On the other hand, if Equation (6) is not verified, either the witnesses produced by the server are not correct or R is not a subset of K D . In both cases, the client is able to efficiently detect a misbehavior of the server.

Completeness
Bulkopt builds proofs of completeness by demonstrating that: that is, the set of keys requested by the client X and the set of keys not returned by the server K D \R share no common keys. We recall that K D \R is equal to Q ∪T , hence Equation (7) can be expressed as the following equation: Bulkopt proves such conditions by leveraging properties of ECR hash functions.
In particular, as shown by [8], ECR hash functions can be used to efficiently express set intersections by using polynomial representations of sets. That is, an empty intersection between sets correspond to polynomials having great common divisor (gcd ) equal to 1 (that is, informally we say that since the sets do not share any common elements, the corresponding polynomials do not have common roots). Let us denote as C M (s) a polynomial representation of a generic set M w.r.t. variable s [8,11], and a set P = Q ∪T . To prove that the gcd of the polynomials is 1, the server must generate two polynomialsṗ,ẋ such that: The server sends witnesses W P , Wṗ and Wẋ in addition to W Q and WT that were already sent to prove authenticity. A user can now exploit verification functions of the considered cryptographic signature to verify Equation (9). If Equation (9) is verified, then the client knows that the witnesses produced by the server are correct and that Equation (7) is also verified. Hence R includes all keys X requested by the client that are available in the server database, and completeness holds. On the other hand, if Equation (9) is not verified, either the witnesses produced by the server are not correct or X shares common elements with sets of keys Q orT that were not sent by the server, thus violating completeness. In both cases, the client is able to efficiently detect a misbehavior of the server.

Protocol Implementation
In this section we describe the Bulkopt protocol by referring to its main three phases: setup and key generation (Section 4.1), insert operations (Section 4.2) and read operations (Section 4.3).

Setup and key generation
Setup. Let g be a generator of the cyclic multiplicative group G of prime order p, G T a cyclic multiplicative group of the same order andê : G × G → G T be the pairing function that satisfies the following properties: bilinearity:ê(m a , n b ) = e(m, n) ab ∀m, n ∈ G, a, b ∈ Z * p ; non-degeneracy:ê(g, g) = 1; computability: there exists an efficient one-way algorithm to computeê(m, n), ∀m, n ∈ G.
Let h be a cryptographic hash function and h z (·), h g (·) be two full domain hash functions (FDH) secure in the random oracle model [2,7] defined as following: Let us denote as C M (s) the characteristic polynomial that uniquely represents the set M , generated by using as roots of the polynomial the sum opposite of the elements of the set and as variable the secret key s [22]. Polynomial C M (s) can be computed as following: Let F M = (f (M ), f (M )) be the output of an extractable collision resistant (ECR) hash function [4] with secret key (s, α) ∈ Z * p × Z * p and public key [g, g s , . . . , g s q , g α , g αs , . . . , g αs q ], where M denotes a set of values m ∈ Z * p . The output of the function can be computed through two different algorithms depending on the knowledge of the secret key s. For this reason, we denote as f sk (M ), f sk (M ) the computation of (f (M ), f (M )) with knowledge of the secret key and f pk (M ), f pk (M ) the computation of (f (M ), f (M )) with only knowledge of the public key. We will use notation F M , f (M ) and f (M ) to identify the black-box outputs of the functions when it is indifferent if they were computed with or without knowledge of the secret key. Functions f sk (M ) and f sk (M ) can be computed by using straightforwardly the polynomial C M (s) shown in Equation (12) as following: Functions f pk (M ) and f pk (M ) can be computed by using the coefficients of the polynomial C M (s). That is, if we consider the set of the coefficients can be computed as following: Although functions (f sk (·), f sk (·)) and (f pk (·), f pk (·)) have the same behavior, computing of (f sk (·), f sk (·)) is more efficient due to the computation of only one exponentiation in the group G. Without knowledge of the secret key, ECR hash functions can be verified as following: Otherwise, the secret key allows a more efficient verification: Although knowledge of the secret key improves the algorithm efficiency, it allows one to cheat in the computation of the hash function. Hence, it cannot be given to parties that have advantages in breaking the security of the ECR hash function. Key Generation. We denote the owner's secret and public keys as sk and pk and generate them as follows: pk = (U, [g s , . . . , g s q , g α , g αs , . . . , g αs q ]), U = g u (20) where q ∈ N must be greater than or equal to the maximum number of records involved for each insert or read operation, and u, s and α be different from each other.

Insert operations
The owner issues an insert operation by sending the tuple (B i , σ i , Γ i ), where: i ∈ N is the operation identifier, that is the incremental counter maintained locally by the owner and by the server that identifies the insert operation (see Section 3); -B i = {(k, v)} is the set of keys and records inserted in the database at operation i. We also denote as K i the set of the keys {k} inserted in this operation; σ i is the bulk signature of the set of keys K i inserted at operation i. It is computed by the tenant as: -Γ i is the set of the record signatures of the records B i , computed by using a BLS aggregate signature scheme [7]: where denotes the concatenation operator. We assume that the concatenation of the values k and v does not compromise the security of h g (·). If the security of the candidate implementation of h g (·) in this context, one should apply a collision resistant hash function or a message authentication code algorithm on the value v previous to the concatenation operation [1].
We note that the bulk signature σ i (Equation (21)) is similar to the computation of a bilinear map accumulator [18]. The original scheme would compute the signature of f sk (K i ) as f sk (K i ) u . Our scheme differs for the factor h g (i) u , that could be seen as a BLS signature of the operation identifier i. This variant allows us to bind the bulk signature σ i (K i ) to the operation identifier i in which the insert operation is executed. As we describe in Section 4.3, this design choice also allows us to verify correctness of the server answers by using security proofs that were originally proposed for the memory checking setting [8].
Both the owner and the server keep track of the operation identifier i locally, without exchanging it in each insert operation. After each insert operation, the server stores all records B i , the bulk signatures σ i and the record signatures Γ i in the database associated to the operation identifier i.
The owner does not store any bulk signature σ i or record Γ i , but he maintains a cryptographic structure of constant size to keep track of the state of the database. We call it the database signature D = (σ last , F D last ), where last is the value of the operation identifier i for the last insert operation executed on the server, and σ last and F D last are the bulk signature and ECR hash function of all the keys inserted in the database.
The owner computes the bulk signature σ last as following: after the first insertion (i = 1) he sets the initial value of the database signature as σ 1 = σ 1 ; after any other insert operation (i > 1), the owner computes the database signature σ i by computing the product of the current version of the database signature σ i−1 and the bulk signature σ i of the last executed insert operation as As a result, the value of the database signature σ last is equal to the product of all the bulk signatures σ i ever sent by the owner to the server: The owner computes the database ECR hash function F D last as following: after the first operation (i = 1), the database accumulator is equal to the ECR hash function of the keys included in the first bulk of data, that is F D1 = (f sk (K 1 ), f sk (K 1 )); after any other operation (i > 1), the database accumulator is computed as Di−1 . As a result, the value of F D last after the last insert operation is the following:

Read operations
To execute a read operation a client must send a set of keys X = {k} to the server. The server returns the following tuple: response (X) := (I, A, π auth , π comp , π rec ) (26) where I = {i} is the set of the operation identifiers associated to the bulks that include at least one of the keys X requested by the client; A = {A i } i∈I is the set of the key-value records that compose the actual response to the client, grouped by the corresponding operation identifier i from which the server retrieved it; π auth , π comp and π rec are the keys authenticity proof, the keys completeness proof and the records authenticity proof used to prove keys authenticity, completeness for the returned keys and authenticity of the values associated to the keys, respectively. Although from a security perspective keys authenticity and completeness proofs depend on each other, we distinguish them for the sake of clarity. We also observe that guaranteeing records correctness does not require any completeness proof because we are considering a key-value database where projection queries are not allowed. We recall from Section 3 that the elements of each set of the response A i is a key-value tuple (k, v), and we denote as R i the set of the keys included in the set A i . In the following we describe separately the generation and the verification processes for keys authenticity proofs, keys completeness proofs and records authenticity proofs.
Keys authenticity. The keys authenticity proof is a tuple that includes the following values: where {F Qi } i∈I is the set of the bulk witnesses, F T is the aggregate ECR hash function of bulks that include at least one of the keys requested by the client, WT is the aggregate bilinear signature of the bulks that do not include any of the keys requested by the client. The server generates each bulk witness F Qi by computing the ECR hash function f pk (see Equation (16)) on the set complement Q i of R i with respect to K i , as following: Moreover, the server computes the aggregate bilinear signature WT as the witness for bulks that do not include any keys requested by the client by aggregating the owner signatures as following: The client verifies authenticity of the keys {R i } returned by the server by using values included in the authentication proof π auth and the database signature σ last stored locally (see Equation (24)). The client verifies correctness of the ECR hash function F T by using Equation (17). Then, the client verifies that the ECR hash function F T is built correctly with respect to the aggregate bilinear signature WT by using the locally maintained database signature σ last , as following:ê Finally, the client uses F T to verify authenticity of the returned records {R i } i∈I by using the bulk witnesses {F Qi } i∈I , as following: After this verification process the client is sure about the following guarantees: -F T is a valid witness for the bilinear aggregate signature WT , as the probability of generating or extracting any other owner signature would break the non-extractability guarantees of aggregate bilinear signatures [7]; all the returned keys {R i } i∈I are authentic, because the server proved existence of the witnesses Q i with respect to bulks aggregate hash function F T and generating false witnesses would break extractable collision resistance (ECR) guarantees of the ECR hash function (f (·), f (·)) [8]; all the operation identifiers i ∈ I sent by the client are authentic, as generating identifiers that satisfy Equation (31) would break either the FDH function h g (·) or the collision resistance guarantees of aggregate bilinear signatures [7].
Keys completeness. As described in Section 3.3, to prove completeness of the response the server must produce witnesses that prove disjunction the requested keys X with respect to the complement sets Q andT . The completeness proof is a tuple that includes such witnesses, and additional values that allow the client to verify that the server generated them correctly: where F P is the ECR hash function of the set union including the complement sets Q andT , (Fq and Fẋ) the witnesses that prove disjunction of the set of the requested keys X with respect to setsT and Q. First, the server computes the ECR hash function of Q ∪T as: The two witnesses Fṗ and Fẋ of polynomialsẋ andṗ are generated by the server to show that the gcd between the characteristic polynomials C X and C Q∪T of sets X and Q ∪T is 1, that is equivalent to prove disjunction of sets X, Q and T , as shown in [8]:ẋ ,ṗ : C X (s) ·ẋ + C P (s) ·ṗ = 1 (34) The client verifies correctness of the ECR hash functions F P , Fq and Fẋ sent by the server by using Equation (17). Then, he verifies whether F P represents the set complement of R with respect to D by checking the value of F P against the database accumulator F D last (see Equation (25)) publicly distributed by the owner:ê Now that the client verified the correct generation of the witnesses F P , he can verify disjunction of X, Q andT by testing Equations (34) as following: e(f pk (X), Fẋ) ·ê(F P , Fṗ) Records authenticity. The server computes the proof of authenticity π rec by aggregating all the record signatures γ k,v = γ(k, v) previously received by the owner for all the records returned to the client, as following: The client verifies authenticity of the response A given the server integrity proof π int and the owner public key U by verifying the following condition: This concludes the description of the protocol: any client that is enabled to query the database and that knows the owner's public key pk and the state of the database D can verify correctness of the results by using the described verification operations. We recall that if a client knows the secret key sk, such as in symmetric settings, he can verify results correctness more efficiently by using the secret exponents u and α.
Most literature related to security of data outsourcing and cloud services aims to protect data confidentiality of tenant data against malicious insiders of cloud providers. These works typically assume the honest-but-curious threat model where an insider within the cloud provider may access and copy tenant data without corrupting or deleting them. To solve this issue several works already proposed in the literature leverage architectures based on partially homomorphic and property preserving encryptions that allow cloud computations and efficient retrieval on encrypted data (e.g., [10,12,21]). Unlike these works, in this paper we do not trust the cloud provider to behave correctly, but we assume a threat model where the cloud provider can violate authenticity and completeness of tenant data, either due to hardware/software failures or deliberate attacks. The main problem in this context is to combine authenticity and completeness guarantees without affecting the database performance and functionalities. As an example, standard message authentication codes or digital signatures can guarantee authenticity of outsourced data. However, they cannot guarantee results completeness without incurring in great network overhead.
A well-known solution to guarantee results correctness is to adopt Merkle hash trees [9], that allow to build efficient proofs for range queries by authenticating the sorted leafs of the tree with respect to an index defined at design time. However, they do not support efficient queries on arbitrary values and efficient proofs on dispersed key values. Other solutions allow the tenant to verify authenticity and completeness of outsourced data by means of RSA accumulators [16,17,13]. Although RSA accumulators provide constant asymptotic complexity for read and update operations, their high constant computational overhead often prevent their practical application in most scenarios [9]. A different approach is proposed in [25], that relies on the insertion of a number of fake records in the database. These records are then retrieved to verify their presence, and possibly identify completeness violations. However, since no cryptographic verification is executed on the real database, such a solution provides lower security guarantees based on probabilistic completeness verification. The protocols proposed in [8] guarantees authenticity of operations in a memory-checking model by maintaining an N-ary tree of constant height. Since only the values of the nodes change (but not the number of cells), these protocols can produce proofs of constant size with respect to the cardinality of the sets stored in each memory cell. However, their proposal cannot be easily adopted in the data outsourcing scenario because the amount of sets is not constant and the tree structure would require expensive re-balancing operations.

Conclusion
This paper proposes Bulkopt, a novel protocol that provides authenticity and completeness guarantees for key-value databases. Bulkopt is specifically designed for providing data security guarantees in the context of cloud-based services subject to read/write workloads, and efficiently support bulk insert operations, as well as read requests that involve the retrieval of multiple and not contiguous keys at once. Efficient verification of bulk operations is achieved by modeling data security constraints in terms of set operations, and by leveraging cryptographic proofs based for set operations. In particular, Bulkopt is the first protocol that combines extractable collision resistant hash functions and aggregate bilinear map signatures to achieve novel cryptographic constructions that allow the verification of authenticity and completeness over large sets of data by relying on small cryptographic proofs. More work is needed to tune the protocol performance by using data structures to cache partial proofs at the server side, as well as further developments to also support update operations.