Secure Database Using Order-Preserving Encryption Scheme Based on Arithmetic Coding and Noise Function

. Order-preserving symmetric encryption (OPE) is a deterministic encryption scheme which encryption function preserves numerical order of the plaintexts. That allows comparison operations to be directly applied on encrypted data in case, for example, decryption takes too much time or cryptographic key is unknown. That’s why it is successfully used in cloud databases as effective range queries can be performed based on. This paper presents order-preserving encryption scheme based on arithmetic coding. In the first part of it we review principles of arithmetic coding, which formed the basis of the algorithm, as well as changes that were made. Then we describe noise function approach, which makes algorithm cryptographically stronger and show modifications that can be made to obtain order-preserving hash function. Finally we analyze resulting vulnerability to chosen-plaintext attack.


Introduction
Nowadays, the amount of information stored in various databases steadily increases. In order to store and effectively manage large amounts of data it is needed to increase data storages capacity and allocate funds for its administration. Another way that was chosen by many companies is to give the database management to a third-party. Such service is managed by a cloud operator and is called Database as a Service, DBaaS. Obviously, this approach has its own flaws. And the most important of them is security issue. Data can be stolen by the service provider itself or by someone else from its storage. Fortunately, this problem can be solved by encryption. Of course if we just encrypt the whole database with a conventional encryption algorithm, we'll have to encrypt and decrypt it each time we need something. So, all advantages will be lost. That's why special encryption schemes, such as homomorphic encryption and order-preserving encryption, are developed. The first one allows us to handle encrypted data, and the secondto sort them and select the desired.
All known order-preserving schemes have significant problems, such as low level of security (polynomial monotonic functions [1], spline approximation [2], linear functions with random noise [3]), low performance (summation of random numbers [4], B-trees [5]) or too-large numbers proceeding (scheme by Boldyreva [6]). Proposed scheme doesn't have these disadvantages and, furthermore, unlike all the others can be used to encrypt real numbers. Also it can be used to obtain orderpreserving hash function.
This algorithm combines two main ideas, which the majority of OPE schemes operate with: monotonic functions design and elements of coding theory (implicit monotonic functions design). It is claimed that scheme is based on arithmetic coding and noise function, but, in fact, this article considers only the case with binary alphabet. In theory, nothing prevents the use of an arbitrary one.
First, let's give a definition of order-preserving encryption. Assume there are two sets A and B with order relation < . Function f: A → B is strictly increasing if ∀x, y ∈ A, x < ⇔ x < y . Order-preserving encryption is deterministic symmetric encryption based on strictly increasing function.
The described order-preserving encryption scheme was developed in Laboratory of Modern Computer Technologies of Novosibirsk State University Research Department as a part of "Protected Database" project 1 and is based on arithmetic coding and noise function. Let us consider them precisely.

Splitting procedure of arithmetic coding
Suppose c is non-negative integer number requiring for its representation n bits, i.e.
where α 1 , α 2 , … , α n is a bit string, α 1 is the MSB. Let us define the bijection f. Assume that the string α 1 , α 2 , … , α n defines certain real number s ∈ 0, 1 as follows: Let us find another representation for the number s. In order to do it, we use the idea of arithmetic coding. Notice that the number s satisfies the equation 2 n s = c. The equation has only one solution on the interval 0, 1 . If we solve this equation using a standard binary search, we get the initial number s after n steps. The main idea of arithmetic coding is that intervals can be split into parts randomly. In this case approximate solution of the equation can be found after the less number of steps. That allows us to achieve compression of data while using arithmetic coding. First of all, let us consider the splitting procedure.  This interval is again split into parts in the ratio γ: μ. According to the sign of functionG(x) in the splitting point, one of the segments is selected. Proceeding by induction, the interval a k , b k can be calculated for ∀k. Its length is γ r μ n−r , where r is the number of zeros in string β . If ∀r: 1 2 n < γ r μ k−r , then s ∈ a k , b k and c = 2 n s are uniquely defined by β = (β 1 , … , β k ). It is also obvious that this mapping preserves an order.
Generalizing used in the adaptive arithmetic coding, as well as in the proposed algorithm, is that it is possible to use different ratio on each step. This allows us to achieve stronger security of encryption.

Noise function
It is known that the composition of two strictly increasing functions strictly increases. Therefore, to provide stronger security of cryptographic algorithm special random strictly increasing function is used in addition to the splitting procedure. In fact, we use inverse function of the one that was generated. It was proved [6] that OPE schemes cannot satisfy the standard notions of security, such as indistinguishability against chosen-plaintext attack (IND-CPA) [7], since they leak the ordering information of the plaintexts. If an adversary knows plaintexts p 1 , p 2 and corresponding ciphertexts c 1 , c 2 and c , such that c 1 < < c 2 , it is obvious that the plaintext for c lies in the interval (p 1 , p 2 ) . In addition, the adversary can always find the decryption function in some approximation, for instance, using linear interpolation.
And moreover, in case of using, for example, encryption method developed by David A. Singer and Sun S. Chung [1], where strictly increasing polynomial functions f x = a 0 + a 1 x + ⋯ + a n x n are used for encryption, the adversary can calculate the exact encryption function if he has (n + 1) arbitrary pairs (plaintext, ciphertext). It is enough to solve the system of equations: a 0 + a 1 x 0 + ⋯ + a n x 0 n = y 0 a 0 + a 1 x 1 + ⋯ + a n x 1 n = y 1 ⋮ a 0 + a 1 x n + ⋯ + a n x n n = y n Thus, the adversary can get a 0 , … a n and correspondingly encryption function f(x).
In order to complicate his task it is necessary to maximize the amount of pairs required for this attack and complexity of the system of equations f x i = y i . Therefore, it was decided to generate noise function from class of function f x = a 0 + a 1 t + a 2 t 2 (a 3 + a 4 sin a 5 + a 6 t + a 7 cos(a 8 + a 9 t)) dt where c is an arbitrary constant and coefficients a i are selected so that a 0 + a 1 t + a 2 t 2 (a 3 + a 4 sin a 5 + a 6 t + a 7 cos(a 8 + a 9 t)) > 0 for ∀t ∈ (c; x max ).
In this case f(x) is strictly increasing function (see Fig. 1). This integral can be calculated explicitly, which increases the speed of function value calculation. Fig. 1. Example of the correct noise function from the class. Due to such combination of sine and cosine, its behavior is hard to predict without a 0 , … a 9 coefficients knowledge. Nevertheless, the system of equations a 0 + a 1 t + a 2 t 2 (a 3 + a 4 sin a 5 + a 6 t + a 7 cos(a 8 + a 9 t)) dt x 0 c = y 0 a 0 + a 1 t + a 2 t 2 (a 3 + a 4 sin a 5 + a 6 t + a 7 cos(a 8 + a 9 t)) dt x 1 c = y 1 ⋮ a 0 + a 1 t + a 2 t 2 (a 3 + a 4 sin a 5 + a 6 t + a 7 cos(a 8 + a 9 t)) dt x k c = y k is difficult to solve, which indicates that proposed algorithm is cryptographically strong against this type of attack.

Key generation
As a private key of encryption algorithm we consider noise function f x = a 0 + a 1 t + a 2 t 2 (a 3 + a 4 sin a 5 + a 6 t + a 7 cos(a 8 + a 9 t)) dt x c and a set of ratios p i , q i .
In order for an encrypted n-bit number to be uniquely decrypted, the length of intervals computed during decryption has to be less than If this conditions if satisfied, go to the step 3, else go back to the step 1. 3. Output the set of ratios p 1 , q 1 , p 2 , q 2 , … , p k , q k . The key is the set K = [ a 0 , … , a 9 , p 1 , q 1 , p 2 , q 2 , … , p k , q k ].

Encryption
Assume we need to encrypt n-bit integer s with the key K = [f(x), p 1 , q 1 , p 2 , q 2 , … , p k , q k ], where f(x) is a noise function, f a 0 = 0, f(b 0 ) = 2 n , and (p i , q i ) is a set of ratios. Consider the i-th iteration of algorithm.
The current interval a i−1 , b i−1 is split in the ratio p i : q i . Let it be split at the point x ∈ a i−1 , b i−1 , i.е.
Notice that ∀i, f −1 (s) ∈ a i , b i according to the selection of a i and b i . After performing k iterations, (where k is the size of the key, i.e. the number of ratios) we obtain the bit sequence β = β 1 , … , β k , β i ∈ 0,1 , which is a ciphertext for s.

Decryption
Suppose there is a bit sequence β = β 1 , … , β k , β i ∈ 0,1 , which is the ciphertext for s, encrypted with some key K. Let us consider the i-th iteration of the algorithm.
Similar to the encryption algorithm, current interval a i−1 , b i−1 is split in the ratio p i : q i . Let it be split at the point x ∈ a i−1 , b i−1 , i.e.
After performing k iterations, we obtain the interval a k , b k and the condition f(b k ) − f(a k ) < 1 2 n is satisfied according to the key selection. As s ∈ f(a k ), f(b k ) , the s is uniquely decoded as follows: where x is the largest integer, which comes before x.

Application of the scheme for fixed-point arithmetic
It is easy to see that this scheme can be generalized to the set of rational numbers. Encryption and decryption algorithms are the same except for the final operationthe length of the segment a k , b k that determines encrypted number is reduced to 2 l times, where l is the number of bit decimal places. It should be known at the stage of key generation and condition from point 2 takes the following form: After key generation number l can't be modified and is a part of the key. So, the secret key K now is the set [l, a 0 , … , a 9 , p 1 , q 1 , p 2 , q 2 , … , p k , q k ].

Strictly increasing hash function
This algorithm can also be modified to produce a strictly increasing hash function. It can be used, for example, in encrypted database, if it stores two entities for each data: ciphertext, that was obtained from cryptographically strong algorithm and hash value returned by hash function. This allows both to be sure that the data won't be decrypted by adversary (first entity is secure and the second can't be decrypted at all) and apply comparison operations on encrypted data to some extent.
To begin, we note that output has the same bit size as the number of ratios p i , q i from the secret key. So, in order to obtain a hash function, it is enough to change the procedure of key generation, and more precisely, its ratios generation part.
Instead of the condition checking from the point 2, satisfaction of which guaranteed that the data can be decrypted, now we need to perform the first pointpair p i , q i generationa number of times. This number, evidently, is equal to the number of bits that hash function returns.
Thus, the key generation algorithm for order-preserving m-bit hash function is: 1. Select strictly increasing noise function f(x). To do this, generate a 0 , … a 9 so that a 0 + a 1 t + a 2 t 2 (a 3 + a 4 sin a 5 + a 6 t + a 7 cos(a 8 + a 9 t)) > 0 for ∀t ∈ (c; x max ), where c is a fixed constant. 2. Generate random set of ratios p 1 , q 1 , p 2 , q 2 , … , p m , q m . 3. The key is the set K = [ a 0 , … , a 9 , p 1 , q 1 , p 2 , q 2 , … , p m , q m ].
To get rid of the big numbers processing, for instance, if we need to get hash of a large file, it is possible to split input data into parts with acceptable size and calculate hash for each of them. The result hash value of the whole file can be found as their concatenation. This approach allows us to hash data of any predetermined dimension. So, there are three parameters that we can select arbitrarily depending on our purpose: s 1size of the processed parts, s 2hash size for each of them (s 2 < s 1 ), and s 3maximum file size. Obviously, final hash is s 2 s 3 s 1 -bit. Since encryption algorithm remains the same, the hash function running time depends linearly on its output size (it is equal to the number of algorithm iterations). Therefore, it is not recommended to choose too-big s 2 number.
In order to process files smaller than the maximum size, they can be padded with zeros on the left. In this case, order is still preserves. Since this is a hash function algorithm, decryption is no longer exists.

6
Encryption security As we have seen (see Section 3) OPE schemes cannot satisfy the standard notions of security against chosen-plaintext attack. Different methods of cryptoanalysis are considered to determine the notion of order-preserving encryption security [2], [8], [9], [10]. Generally, the security of such schemes is based on the fact that monotonic function, the scheme is based on, must be completely indistinguishable from truly random monotonic function. This means that only an access to the private key allows performing accurate data decryption. So let us check this algorithm for this condition in practice. To do that, we encrypted all 16-bit numbers (from 0 to 65535) with the same random key and analyzed the results.
As a subject of analysis we chose the difference between two ciphertexts for nearby integers. For example, if f x = 2186003864819 and f x + 1 = 2186004033407 , where f(x) is encryption function, then f x + 1 − f x = 168588 is considered. One of the reasons for this choice was the fact that success of chosen-plaintext attack by interpolation depends on this differrences (see Fig. 2). As a result, we obtained the following data (see Fig. 3). In this chart the Y-axis displays the difference value between two ciphertexts (higher values were rounded), and the X-axis shows the number of them was found. As we see, this chart and right hyperbola y = 1 x are alike. It is typical for monotonic functions that were generated randomly and indicates that the maximum available security of the algorithm was achieved.
But the distribution of the differences itself is also important (see Fig. 4). The Yaxis displays f x + 1 − f(x) when the X-axis shows x (from 0 to 65535). We can see that the differences are distributed very irregularly. As it is a feature of secure encryption, we can claim that proposed algorithm is cryptographically strong.