High-performance Elliptic Curve Cryptography by Using the CIOS Method for Modular Multiplication

Elliptic Curve Cryptography (ECC) is becoming unavoidable, and should be used for public key protocols. It has gained increasing acceptance in practice due to the signiﬁcantly smaller bit size of the operands compared to RSA for the same security level. Most protocols based on ECC imply the computation of a scalar multiplication. ECC can be performed in aﬃne, projective, Jacobian or others models of coordinates. The arithmetic in a ﬁnite ﬁeld constitutes the core of ECC Public Key Cryptography. This paper discusses an eﬃcient hardware implementation of scalar multiplication in Jacobian coordinates by using the Coarsely Integrated Operand Scanning method (CIOS) of Montgomery Modular Multiplication (MMM) combined with an eﬀective systolic architecture designed with a two-dimensional array of Processing Elements (PE). As far as we know this is the ﬁrst implementation of such a design for large prime ﬁelds. The proposed architectures are designed for Field Programmable Gate Array (FPGA) platforms. The objective is to reduce the number of clock cycles of the modular multiplication, which implies a good performance for ECC. The presented implementation results focuses on various security levels useful for cryptography. This architecture have been designed in order to use the ﬂexible DSP48 on Xilinx FPGAs. Our architecture for MMM is scalable and depends only on the number and size of words.


Introduction
The search for the most optimised architecture for arithmetic has always fascinated the embedded system world.In recent years this has been especially the case in finite fields for cyber security due to the invention of asymmetric encryption systems based on modular arithmetic operations.Throughout the history of cryptography for embedded systems, there has been a need for efficient architectures for these operations.The implementations must be cost-effective, both in terms of area and latency.Finite field arithmetic is the most important primitive of ECC, pairing and RSA.Since 1976, many Public Key Cryptosystems (PKC) have been proposed and all these cryptosystems base their security on the difficulty of some mathematical problem.The hardness of this underlying mathematical problem is essential for security.Elliptic Curve Cryptosystems which were proposed by Koblitz [9] and Miller [15], RSA [21] and the Pairing-Based Cryptography [8] are examples of PKCs.All these systems require an efficient finite field multiplication.As a consequence, the development of efficient architecture for modular multiplication has been a very popular subject of research.In 1985, Montgomery has presented a new method for modular multiplication [17].It's one of the most suitable algorithm for performing modular multiplications in hardware and software implementations.The efficient implementation of the Montgomery Modular Multiplication (MMM) in hardware was considered by many authors [3,6,7,18,19,23].There is a variety of ways to perform the MMM, considering if multiplication and reduction are separated or integrated.A systolic array architecture [12,24] is one possibility for the implementation of the Montgomery algorithm in hardware, with a design both parallel and pipelined [3,18,19,20,23].A similar work [20] has been done for binary fields (field characteristic is a power of 2) without having to deal with carry propagation as a consequence.These architectures use a Processing Elements (PE) array where each Processing Element performs arithmetic additions and multiplications.In accordance with the number of words used, the architecture can employ a variable number of PEs.The systolic architecture uses very simple Processing Elements (as in a pipeline).As a consequence, the systolic architecture decreases the needs for logic elements in hardware implementations.Our contribution is to combine a systolic architecture, which is assumed to be the best choice for hardware implementations, with the CIOS method of Montgomery modular multiplication.We optimize the number of clock cycles required to compute a n-bit MMM and we reduce the utilization of FPGA resources.We have implemented the modular multiplication in a fixed number of clock cycles.To the best of our knowledge, this is the first time that a hardware or a software multiplier of modular Montgomery multiplication, suitable for various security level, is performed in just 33 clock cycles.Furthermore, as far as we know, our work is the first one dealing with systolic architecture and CIOS method over large prime characteristic finite fields.Using our efficient MMM hardware implementation, we propose an efficient design for ECC operations: point addition and doubling.This paper is organized as follows: Section 2 discusses related state-of-the-art works.Section 3 presents the CIOS method of Montgomery Modular Multiplication algorithm.The proposed architectures and results are presented in Section 4 and Section 5. Finally, the conclusion is drawn in Section 6.

Elliptic Curve Cryptography
The use of elliptic curves in cryptography has been independently introduced by Victor S. Miller [16] and N. Koblitz [10] during the 80s.The main advantage of ECC is that the bit sizes of the key are reduced for the same security level when comparing with a classical RSA algorithm [22].For instance, at the AES 128-bit security level the key for ECC is 256 bits, while RSA requires a 3072-bit key.The main operation in ECC is the scalar multiplication over the elliptic curve.This scalar multiplication consists in computing α × P , for α an integer and P a point of an elliptic curve.When such an operation is implemented on an embedded system such as a FPGA, it is subject to constraints of area and speed.Efficient scalar multiplication arithmetic is hence a central issue for cryptography.The interested reader is referred to [2] for a good overview of the question.
ECC preliminaries Let E be an elliptic curve defined over F p with p > 3 according to the following short Weierstrass equation: where a, b ∈ F p such that 4a 3 + 27b 2 = 0.The elliptic curve E(F p ) is the set of points (x, y) ∈ F 2 p whose coordinates satisfy Equation 1.The rational points of E, augmented with a neutral element O called point at infinity, have an abelian group structure.The associated addition law computes the sum of two points in affine coordinates P = (x 1 , y 1 ) and Q = (x 2 , y 2 ) as P + Q = (x 3 , y 3 ) where: The scalar multiplication of a point P by a natural integer α is denoted αP .The discrete logarithm problem is finding the value of α, given P and αP .The security of ECC is based on the hardness of the discrete logarithm problem.Point addition formulae such as in Equation 2 are based on several operations over F p (e.g.multiplication, inversion, addition, and subtraction) which have different computational costs.

ECC in Jacobian coordinates
In Jacobian coordinates, we use (x : y : z) to represent the affine point (x/z 2 ; y/z 3 ).The elliptic curve equation becomes: Doubling step: we represent the point The formulae for doubling T = 2Q = (X T , Y T , Z T ) can be computed as: Algorithm 1: Scalar Multiplication The Coarsely Integrated Operand Scanning (CIOS) method presented in Algorithm 2, improves the Montgomery Algorithm by integrating the multiplication and reduction.More specifically, instead of computing the product a • b, then reducing the result, this method allows an alternation between iterations of the outer loops for multiplication and reduction.The integers (p, a and b) are seen as lists of s words of size w.This algorithm requires an array T of size only s + 2 to store the intermediate state.The final result of the CIOS algorithm is composed by the s + 1 least significant words of this array at the end.
Algorithm 2: CIOS algorithm for Montgomery multiplication [11] The alternation between multiplications and reductions is possible since the value of m (in line 11 of the Algorithm 2) in the i th iteration of the outer loop for reduction depends only on the value T [0], which is computed by the first iteration of the corresponding inner loop.In order to perform the multiplication, we have modified the CIOS algorithm of [11] and designed this method with a systolic architecture.Indeed, instead of using an array to store the intermediate result, we replace T by input and output signals for each Processing Element.As a consequence, our design uses fewer multiplexers decreasing as a consequence the number of slices taken by our design.

Block DSP in Xilinx FPGAs
Modern FPGA devices like Xilinx Virtex-4, Virtex-5 and Artix-7 as well as Altera Stratix FPGAs have been equipped with arithmetic hardcore extensions to accelerate digital signal processing applications.These DSP blocks can be used to build a more efficient implementation in terms of performance and reduce at the same time the demand for area.DSP blocks can be programmed to perform basic arithmetic functions, multiplication, addition and subtraction of unsigned integers.Figure 1 shows the generic DSP structure in advanced FPGAs.DSP can operate on external input A,B and C as well as on feedback values from P or result PCIN.

Proposed Architecture
The idea of our design is to combine the CIOS method for the MMM presented in [11] with a two-dimensional systolic architecture in the model of [13,20,24].As seen in section 3, the CIOS method is an alternation between iterations of the loops for multiplication and reduction.The concept of the two-dimensional systolic architecture presented in Section 2 combines Processing Elements with local connections, which take external inputs and handle them with a predetermined manner in a pipelined fashion.This new architecture is directly based on the arithmetic operations of the CIOS method of Montgomery Algorithm.The arithmetic is performed in a radix-w base (2 w ).The input operands are processed in s words of w bits.We present many versions of this method.We illustrate our design for s = 16 architecture, denoted NW-16 (for Number of Words).We describe it in detail as well as the various Processing Element behaviours.In order to have less states in our Final State Machine (FSM), we divided our Algorithm 2 of Montgomery in five kinds of PEs (α, β, γ, α f , γ f ).Our efficient architecture comes from the fact that the data dependency in the CIOS algorithm allows to perform several operations in parallel.Figure 2 presents the dependency of the different cells.Below we describe precisely each cell.The letters MSW stand for the Most Significant Word and LSW for the Least Significant Word.In our notation the letter C denote the MSW of the result and the letter S the LSW.
α cell The α cell corresponds to lines 5 and 6 of Algorithm 2. Its operations are described in Algorithm 3. Notice how registers are embedded into the cell, avoiding in this manner the usage of an external memory.This cell corresponds to one iteration of the first inner loop.As such it must be used several times in a row (the number of iterations in the inner loop).This chain of α cells is terminated by an α f cell (alpha final) corresponding to lines 7, 8 and 9 of Algorithm 2. Once the first α cell (for each outer loop) is computed, data is available for the β cell (it requires T [0]).It is preferable to consume this data as soon as it is available as to minimize memory usage.
second inner loop with γ cells.
γ cell The γ cell corresponds to one iteration of the second inner loop.As such these cells must be chained so as to complete the whole second inner loop.For each cell in this chain,the same m value is required.Value that is changed for each outer loop iteration (m is computed by the β cell).Since a new m value may be computed before the end of the γ chain, the value must be propagated along the cells in the same chain.Its operations are described in Algorithm 6.

Algorithm 6: γ cell
γ f cell Finally, the γ f cell terminate each γ cell chain.It consists in two additions as shown in Algorithm 7. The difference with the α f cell is that in this case both output Algorithm 7: γ f cell values S and C are used in the rest of the computation.C is used by the α f cell and S by an α cell.

Our architectures
Firstly, we will start with the NW-16 architecture which contains 6 PEs of type alpha and 6 of type gamma.An MMM can be performed with this architecture in 66 clock cycles.Similarly, in order to implement the NW-32 architecture and the NW-64 architecture it is required to double the number of cells each time.We provide a comparison of our architectures at the end of this section.

NW-16 Architecture
In this architecture, the operands and the modulo are divided in 16 words.The NW-16 architecture is designed in the same way as the NW-32 an NW-64.This example illustrates the scalability of our design.The NW-16 architecture is composed of 15 Processing Elements distributed in a two-dimensional array, where every Processing Elements are responsible for the calculus involving w-bit words of the input operands.
The 15 Processing Elements are divided like this: 6 α cells, 1 α f cell, 1 β cell, 6 γ cells et 1 γ f cell.As said previously, the number of other PE type (α f , β, γ f ) remains unchanged whatever the number of words in the design.In order to evaluate the number of clock cycles of the NW-16 architecture, the first parameter is 6 = max{number of α cells, number of γ cells}, implying that our algorithm requires to loop s + 6 times.We can perform the multiplication with our design in 66 clock cycles since our design requires three states (66 = 3 × (s + 6)).The different results of this architecture for bit-length 256 are given in Table 1.

Architectures comparison
The Table 2 shows a comparison between the different architectures.The number of clock cycles for every architecture is equal to 3 × (s + nb), such that nb = max{number of α cells, number of γ cells}, implying that our algorithm   To check the correctness of the hardware results, we compare the results given by the FPGA with sage software implementation.

Experimental Results
The Table 3 summarizes the FPGA post-implementation results for the proposed versions of MMM architectures.We present results for NW-16 architecture.The designs were described in a hardware description languages (VHDL) and synthesized for Artix-7 and Virtex-5 Xilinx FPGAs.We present the different results after implementation of bitlength k which are given in Table 3.As it is shown in Table 3, an interesting property of our design is the fact that the clock cycles are independent from the bit length.This property gives to our design the advantage of suitability to different security level.In order to implement the modular Montgomery multiplication for fixed security level, we must choose the most suitable architecture.The results presented in this work are compared with the previous work [4,5,18,19] in the Table 3.We can notice that our results are better than [19] considering every point of comparison i.e. the number of slice and the number of clock cycles.Considering the number of slices, we recall that [19] used an external memory to optimize the number of slices used by their algorithms.Considering the comparison with [18], our design requires less slices and can ran at a better frequency, without considering the huge progress in the number of clock cycles.Our design performed the MMM in 66 clock cycles for the 512 and 1024 bit length corresponding to AES-256 and AES-512 security level, while [18] performed the multiplication in 1540 clock cycles for the AES-256 security level and 3076 for the AES-512 security level.

Conclusion
In this paper we have presented an efficient hardware implementation of the CIOS method of MMM over large prime characteristic finite fields F p .We give the results of our design after routing and placement using a Artix-7 and Virtex-5 Xilinx FPGAs.Our systolic implementations is suitable for every implementation implying a modular multiplication, for example RSA, ECC and pairing-based cryptography.Our architectures and the designs were matched with features of the FPGAs.The NW-16 design presented a good performance considering latency × area efficiency.This architecture can run for all the bit length corresponding to classical security levels (128, 256, 512 or 1024 bits) in just 66 clock cycles.Our systolic design using this method CIOS is scalable for any other number of words.Then we showed that using this multiplier, it is possible to achieve an efficient scalar multiplication.

Figure 1 :
Figure 1: Structure of DSP block in modern FPGA device.

Figure 2 :
Figure 2: data dependency in general systolic architecture.

α
f cell The α f cell corresponds to the end of the first inner loop.It is just an addition as shown in Algorithm 4.Algorithm 4: α f cell Input: CIn(= C), SIn(= T [s]) Output: COut(= T [s + 1]), SOut(= T [s]) 1 t1 ← SIn + CIn 2 COut ←MSW(t1) 3 SOut ←LSW(t1) 4 return COut, SOutβ cell The β cell is used to compute the m value for each outer loop as well as the first special iteration of the second inner loop.Its operations are described in Algorithm 5. Once the β cell computation has been done, it now becomes possible to compute the Algorithm 5: β cell

Figure 4 :
Figure 4: DFG for the point addition algorithm.

Figure 5 :
Figure 5: DFG for the point doubling algorithm.

Table 1 :
Implementations of cells and MMM (NW-16).require to loop s + nb times.It is interesting to notice that all our architectures are scalable and can target the different security levels useful in cryptography.

Table 2 :
Comparison of our architectures4.4ECCimplementationECCalgorithmswhenimplementedinasequential way have the advantage that the number of finite field arithmetic modules can be reduced to a minimum.For example, only one adder, one multiplier and one subtraction unit (can be an adder) are needed for point addition and doubling.Parallelization between multiplication, addition and subtraction was achieved.We proposed in Figure?? and Figure?? the dataflow graphs for point addition and doubling.In this design we use our efficient systolic architecture of MMM to perform squaring or multiplication.The Table4summarizes the implementation results of the scalar multiplication.We present a results with both NW-8 and NW-16 architectures.