Hardware implementation of a Montgomery modular multiplier in a systolic array

Modular calculations are widely used in many applications, especially in public key cryptography. Such operations are very time consuming, due to their long operands. To improve the performance of these calculations, many methods have been introduced. Montgomery modular multiplication is an example of such a solution to enhance the performance of modular multiplication and modular exponentiation. The radix-2 version of this method is simple and fast for hardware implementation, where multi-operand adders are required for its implementation. So far, Carry-Save-Adder (CSA) gives the best performance for multi-addition. In this paper, we propose a new recoding method for the Montgomery modular multiplier to enhance its performance. This is done through replacing CSA blocks with new blocks that have better performances than CSA in multi-addition calculations. With this replacement, we can theoretically have up to 40% reduction in area gates. In our experiments, we obtained 5.8% area reduction and 3% speed improvement in a hardware implementation. The idea behind our proposed method is the use of bitwise subtraction operator, where no carry propagation is needed. This recoding method of operands can also be used in many aspects of computer arithmetic, algorithms and computational hardware, such as multiplication, exponentiation and etc., in order to enhance their performances.

Download Full-text

High-Speed and Unified ECC Processor for Generic Weierstrass Curves over GF(p) on FPGA

Sensors ◽

10.3390/s21041451 ◽

2021 ◽

Vol 21 (4) ◽

pp. 1451

Author(s):

Asep Muhamad Awaludin ◽

Harashta Tatimma Larasati ◽

Howon Kim

Keyword(s):

Elliptic Curve ◽

Digital Signal Processor ◽

Execution Time ◽

High Speed ◽

Hardware Implementation ◽

Digital Signal ◽

Low Complexity ◽

Montgomery Ladder ◽

Modular Multiplier ◽

Pipelined Multiplier

In this paper, we present a high-speed, unified elliptic curve cryptography (ECC) processor for arbitrary Weierstrass curves over GF(p), which to the best of our knowledge, outperforms other similar works in terms of execution time. Our approach employs the combination of the schoolbook long and Karatsuba multiplication algorithm for the elliptic curve point multiplication (ECPM) to achieve better parallelization while retaining low complexity. In the hardware implementation, the substantial gain in speed is also contributed by our n-bit pipelined Montgomery Modular Multiplier (pMMM), which is constructed from our n-bit pipelined multiplier-accumulators that utilizes digital signal processor (DSP) primitives as digit multipliers. Additionally, we also introduce our unified, pipelined modular adder/subtractor (pMAS) for the underlying field arithmetic, and leverage a more efficient yet compact scheduling of the Montgomery ladder algorithm. The implementation for 256-bit modulus size on the 7-series FPGA: Virtex-7, Kintex-7, and XC7Z020 yields 0.139, 0.138, and 0.206 ms of execution time, respectively. Furthermore, since our pMMM module is generic for any curve in Weierstrass form, we support multi-curve parameters, resulting in a unified ECC architecture. Lastly, our method also works in constant time, making it suitable for applications requiring high speed and SCA-resistant characteristics.

Download Full-text

High-Speed and Unified ECC Processor for Generic Weierstrass Curves over GF(p) on FPGA

10.20944/preprints202101.0250.v1 ◽

2021 ◽

Author(s):

Asep Muhamad Awaludin ◽

Harashta Tatimma Larasati ◽

Howon Kim

Keyword(s):

Elliptic Curve ◽

Elliptic Curve Cryptography ◽

Execution Time ◽

High Speed ◽

Hardware Implementation ◽

Low Complexity ◽

Multiplication Algorithm ◽

Montgomery Ladder ◽

Modular Multiplier ◽

Pipelined Multiplier

In this paper, we present a high-speed, unified elliptic curve cryptography (ECC) processor for arbitrary Weierstrass curves over GF(p), which to the best of our knowledge, outperforms other similar works in terms of execution time. Our approach employs the combination of the schoolbook long and Karatsuba multiplication algorithm for the elliptic curve point multiplication (ECPM) to achieve better parallelization while retaining low complexity. In the hardware implementation, the substantial gain in speed is also contributed by our n-bit pipelined Montgomery Modular Multiplier (pMMM), which is constructed from our n-bit pipelined multiplier-accumulators that utilizes DSP primitives as digit multipliers. Additionally, we also introduce our unified, pipelined modular adder/subtractor (pMAS) for the underlying field arithmetic, and leverage a more efficient yet compact scheduling of the Montgomery ladder algorithm. The implementation on the 7-series FPGA: Virtex-7, Kintex-7, and XC7Z020, yields 0.139, 0.138, and 0.206 ms of execution time, respectively. Furthermore, since our pMMM module is generic for any curve in Weierstrass form, we support multi-curve parameters, resulting in a unified ECC architecture. Lastly, our method also works in constant time, making it suitable for applications requiring high speed and SCA-resistant characteristics.

Download Full-text

An Efficient Hardware Implementation of Smith-Waterman Algorithm Based on the Incremental Approach

International Journal of Electronics and Telecommunications ◽

10.2478/v10177-011-0069-9 ◽

2011 ◽

Vol 57 (4) ◽

pp. 489-496 ◽

Cited By ~ 1

Author(s):

Andrzej Pułka ◽

Adam Milik

Keyword(s):

Dynamic Programming ◽

Systolic Array ◽

Hardware Implementation ◽

Main Idea ◽

Evaluation Process ◽

Processing Unit ◽

Genome Alignment ◽

Trade Off ◽

Incremental Approach ◽

Resource Requirements

An Efficient Hardware Implementation of Smith-Waterman Algorithm Based on the Incremental ApproachThe paper presents optimized hardware structure applied to genome alignment search. The proposed methodology is based on dynamic programming. The authors show how starting from the original Smith-Waterman approach, the algorithm can be optimized and the evaluation process simplified and speeded-up. The main idea is based on the observations of growth trends in the adjacent cells of the systolic array, which leads to the incremental approach. Moreover various coding styles are discussed and the best technique allowing further reduction of resources is selected. The entire processing unit utilizes fully pipelined structure that is well balanced trade-off between performance and resource requirements. The proposed technique is implemented in modern FPGA structures and obtained results proved efficiency of the methodology comparing to other approaches in the field.

Download Full-text