1. Introduction
The necessity of supplying energy to an increasing and already large number of loads and dealing with their dynamics, and including renewable energy sources in the energy matrix has motivated worldwide academic and industry research efforts to put forward Smart Grid (SG) [
1]. The development of SG technologies and solutions have become a priority in the electricity sector because SG is a powerful concept to ensure energy reliability and efficiency, reduce technical and non-technical losses, enable the integration of different types of renewable energy sources [
2], fulfill the current and future needs and demands of all stakeholders and consumers.
In this sense, the use of the Internet of Things (IoT) in electric power systems constitutes a remarkable advance to quickly spread SG solutions worldwide and, consequently, introduces the so-called digital transformation in electric power systems [
3]. Also, the upgrade of existing technologies in electric power systems offers a new perspective for accelerating the implementation of SG concepts [
4,
5]. In this scenario, Smart Meters (SMs) have emerged as enabling technologies for SGs because they can monitor and control the bidirectional electricity consumption and generate consumer or prosumer data. These functionalities are essential to the stakeholders in the electricity sector because it is beyond the traditional unidirectional power billing functionality [
6]. It is recognized that SM data is of great value for allowing effective and dynamic energy planning to enhance electric power systems’ stability, efficiency, predictability, and reliability. Also, SM equipment allows utilities, consumers, and prosumers to access their data in near real-time.
In this scenario, security and privacy awareness have emerged to deal with security threats against the SM systems because unauthorized access to SM data can severely impact the operation of electric power systems and the life of consumers. For example, [
7] showed that an attacker could exploit a flaw to change electric power system data without being noticed. This is because key generation and distribution in a cryptographic scheme may result in a security breach when SM data is transmitted through data networks [
8,
9]. To deal with this scenario, research efforts have been focused on privacy of information, security issues and countermeasures, see [
10,
11,
12,
13,
14] and reference therein. In this sense, the research effort to secure public-key encryption schemes against classical computing attacks is worth mentioning. In particular, for practical public-key cryptosystems based on the hardness of factoring integers or computing discrete logarithms (e.g., Rivest-Shamir-Adleman (RSA) (RSA-based cryptography is much more complex due to its modular exponentiation) and Elliptic Curve Cryptography (ECC)) [
15], which are used to ensure the security of SM data traveling through data networks. However, such cryptographic schemes will be easily broken when a sufficiently powerful and stable quantum computer runs Shor’s algorithm, which is a polynomial time quantum algorithm for integer factorization and discrete logarithms [
16].
The design of post-quantum cryptography (PQC) schemes for SM systems is a timely issue because SM data comprise sensitive information. Furthermore, the feasibility of implementing PQC schemes in SM equipment is also important because the large-scale implementation of SM systems imposes a constraint on the cost of these devices. In this sense, the research community has been working towards implementations of PQC schemes. For instance, ref. [
17] proposed a software implementation of CRYSTALS-Kyber, FrodoKEM, and NewHope using Graphics Processing Unit (GPU) seeking higher performance. On the other hand, ref. [
18] presented a pure hardware implementation of the CRYSTALS-Kyber scheme using Field Programmable Gate Array (FPGA) focusing on high performance. Note that these aforementioned solutions are note suitable for SM systems because they are based on high-end platforms. It can also be found in the literature solutions using System-on-a-Chip (SoC) devices. For instance, ref. [
19] discussed a RISC-V co-processor for lattice-based cryptography using a hardware to accelerate the Number Theoretic Transform (NTT) transform and hash generation using an SoC device, and [
20] presented an Instruction Set Architecture (ISA) for lattice-based cryptography also based on SoC devices. Note that these SoC-based solutions, focused on a general purpose architecture for lattice-base cryptography, lacks optimizations for a specific scheme, which is of utmost importance when the final target is an SM equipment (i.e., hardware-constrained equipment).
In this regard, the following issues must be addressed to consider the effective and optimized implementation of PQC schemes for smart metering applications:
To identify the most time-consuming routines, which are good candidates to be implemented in hardware, leading to better performance.
To certify the PQC scheme can be implemented in hardware-constrained equipment, keeping in mind that a PQC scheme uses long keys and long ciphertext.
If the implementation is viable, it must also be executed in a period that does not impact the performance of data exchanges between nodes in data networks.
In this regard, this paper focuses on the feasibility of using a PQC scheme as a Key Encapsulation Mechanism (KEM) for ensuring the security of SM data traveling through data networks (we do not aim to propose a new authentication scheme for SM equipment. When the KEM is completed, each part (SM and Meter Data Management System (MDMS)) will share a symmetric key, enabling any kind of secure communication using a symmetric cryptography [
21], such as Advanced Encryption Standard (AES) using the Galois/Counter Mode, for instance). In this context, the implementation of the FrodoKEM scheme in an SoC device is detailed because this scheme demands the most extensive use of hardware resources [
22] in comparison to other candidates for the National Institute for Standards and Technology (NIST) post-quantum standardization process [
23]. Consequently, the successful implementation of the FrodoKEM scheme in an SM equipment, which is supposed to be hardware-constrained, constitutes a baseline for fostering the adoption of PQC in SM systems. Note that [
24] presented the initial results of this investigation. The main contributions of this work are as follows:
A description of a proposed hardware/software implementation of the FrodoKEM scheme for an SM equipment, which is hardware constrained, relying on an SoC device. In this sense, hardware accelerations for matrix-by-matrix multiplications and Secure Hash Algorithm and Keccak 128 (SHAKE128) hash function are presented.
A performance comparison between the proposed hardware/software implementation and the software implementation (i.e., benchmark implementation) in terms of processing and execution times. Also, an analysis of the impact of the communication burden between the Advanced RISC Machine (ARM) and the FPGA when the SoC device is used to hardware accelerate matrix-by-matrix multiplications and the SHAKE128 hash function.
An evaluation of the suitability of PQC schemes for hardware-constrained equipment, such as SM, which relies on SoC devices.
Based on the use of the Xilinx Zynq-7000 SoC device [
25], the attained results show that the execution time of the proposed hardware/software implementation of the FrodoKEM scheme is around
compared to the benchmark implementation while being fully compliant with it, which is the official specification of the FrodoKEM scheme. In other words, a choice in favor of a PQC scheme based on the standard lattice (e.g., FrodoKEM) can efficiently run in hardware resource-constrained equipment. The attained results show that the proposed hardware/software implementation is suitable for SM equipment and, consequently, can remarkably benefit the security of SM data that are traveling through data networks. Last, we can extend this analysis for a dedicated cryptographic module presented in the MDMS server. Usually, this module is used to physically separate the cryptography part (especially private and shared keys) from the management part, adding another security layer.
The rest of this paper is organized as follows:
Section 2 formulates the investigated problem;
Section 3 pays attention to the background of the FrodoKEM scheme;
Section 4 details the benchmark and proposed hardware/software implementations of the FrodoKEM scheme;
Section 5 analyzes the hardware resource usage, execution time, and processing time of benchmark and proposed hardware/software implementations; finally,
Section 6 outlines concluding remarks.
Notation
For the sake of simplicity, we adopted the same notation used in the submitted specification of FrodoKEM [
26]. Uppercase and lowercase bold letters are used for matrices and vectors, respectively. The set of all integers is indicated by
and
as quotient ring of integers modulo
q. The inner product of two
n-dimensional vectors
a,
b is represented by
. Finally, the concatenation of two vectors is denoted by the
symbol.
2. Problem Formulation
Security and privacy are among the main challenges faced by SM systems as sensitive data is constantly produced and exchanged between consumers, prosumers, and electric utilities [
27]. Energy consumption and other data collected from SMs are sensitive because they hide private information of consumers or prosumers from which their lifestyles and economic status can be leaked. Consequently, the concern about security breaches in SM systems and data has attracted more attention from the electricity sector. The design of privacy-preserving solutions that enable SM equipment to perform billing, operations, and value-added services is of utmost importance because SM equipment is one of the leading enablers of SG. Besides, SMs data is exchanged over Internet Protocol (IP)-based and dedicated networks to the MDMS and will be over non-dedicated ones (i.e., Internet) soon and, consequently, the risk of a security breach will be significant if an eavesdropper uses a quantum computer.
The security of SM systems involves several aspects related to consumers, prosumers, and data networks [
28]. It is well-established that data networks must present protection against threats to system-level security (e.g., credential compromise, denial of service attacks), threats to services (e.g., SMs cloning, location migration), and privacy threats (e.g., interception/eavesdroppers, misuse of private data). In this context, cryptography has emerged as one of the most applied methods to protect SM data against eavesdroppers [
29,
30,
31,
32,
33] when they are transmitted through data networks.
The vast majority of cryptographic schemes rely on the hardness of integer factoring assumption or the discrete logarithm problem due to its simplicity and lightweight implementation. Data networks relying on these schemes are secure nowadays because a classical computer can easily perform them; however, it is challenging for classical computers to undo them, preventing unauthorized disclosure and data breaches. This scenario will change significantly with the advent of powerful quantum computers, which running the Shor’s factoring algorithm can easily solve the problems mentioned above. Consequently, an eavesdropper, who uses a quantum computer, will be able to access sensitive data transmitted through data networks serving SM systems and bring severe consequences for stakeholders in the electricity sector. Moreover, what is more, alarming is that not only will SMs data communication be compromised in the future, but also an attacker who nowadays stores smart meter data might be able to decipher it using a quantum computer in the future. That said, recent advances in quantum computing have sparked interest in the research and development of PQC schemes, which are thought to be secure against quantum and classical computer attacks [
34]. Therefore, preparing for this scenario is paramount to providing long-term security of SM data.
Figure 1 illustrates the scenario we are interested in addressing in this paper. The focus is on the security of SM data, transmitted between SM equipment and MDMS, against eavesdroppers that are capable of performing sniffer attacks. SM equipment, which is equipped with the lowest processing power in an SG, may constitute a security flaw if it is not capable of embedding a PQC scheme. Note that a PQC scheme is more complex than the current ones, such as RSA and ECC, and requires more processing power, which might be unfeasible for hardware-constrained devices such as SMs. One approach to overcome this hardware constraint is to use an SoC device, enabling a hardware/software implementation. This kind of implementation allows a remarkable acceleration in the most time-consuming and complex operations, which are hardware-implemented, while the less-complex and time-expensive tasks are software-implemented.
Based on this discussion, the following research questions arise: Is it possible to implement a PQC scheme in a hardware-constrained equipment? Can a hardware/software implementation bring significant advantages over a software implementation? If yes, how much faster will it be? Is this implementation feasible in terms of hardware resource usage? The following sections present answers to these research questions.
3. Background of the FrodoKEM Scheme
Recently, research about lattice-based cryptography has grown significantly, achieving important advances. For instance, one of the most relevant contributions was made by Lyubashevsky [
35,
36], who proposed a new class of lattices called ideal lattices. A lattice is considered ideal when it corresponds to an ideal in a particular algebraic structure, such as polynomial rings. This new class is the basis of the Ring-Learning With Errors (R-LWE) problem in contrast to the standard lattices, which are the basis of the Learning With Errors (LWE) problem. LWE is a mature and well-studied cryptographic primitive that relies on the hardness of the worst case of Shortest Vector Problem (SVP) in a standard lattice. On the other hand, R-LWE has an additional algebraic structure and relies on the worst case of an ideal lattice. Due to such additional algebraic, ideal lattices are more efficient than standard lattices because they need a small memory and perform better. Furthermore, while standard lattices are based mainly on matrix-vector (or matrix-matrix) multiplications, ideal lattices are based on polynomial multiplication, which considerably reduces the complexity and increases efficiency [
37]. Despite this, it is hard to say a quantum attack will explore the weakness of this added algebraic structure in ideal lattices, consequently cracking the system in the future. On the other hand, standard lattices do not suffer from this potential vulnerability and can be considered a more conservative choice. Based on this analysis, the German Federal Office for Information Security (BSI) recommends the FrodoKEM scheme to protect confidential information on a long-term basis [
38].
Moreover, lattice-based cryptosystems are more attractive because they are based on the worst-case hardness of lattice problems. If one succeeds in breaking the cryptography system by a slight chance, one can also solve any instance of a particular lattice problem. It brings a strong notion of security because the average-case instance is at least as hard as the worst-case instance of a related lattice problem [
37].
The NIST initiated a process to solicit, evaluate, and standardize one or more quantum-secure public-key cryptography schemes [
39]. More than 40% of the candidate schemes submitted to the NIST PQC standardization process were based on lattice-based cryptography, initially proposed by Ajtai [
40]. Simplicity and parallelizable operations (e.g., addition, multiplication, and modular reduction) heavily influenced such submissions.
The FrodoCCS key exchange scheme [
41] was designed by exchanging a little efficiency for high-security trust in the post-quantum era. Its simplicity is confirmed by applying only basic operations, such as addition and multiplication. Furthermore, its parameter adjustments are more flexible and easier to scale than ideal lattice-based schemes, such as the NewHope [
42]. The latter has more restrictions as it uses the NTT algorithm for polynomial multiplication. Consequently, FrodoCCS can achieve different security levels with linear resource expenditure.
Based on FrodoCCS, the FrodoKEM scheme [
26], which is a KEM, was submitted to the NIST post-quantum standardization process. It was selected for the third round of the NIST competition as one of eight alternate candidates. A benchmark implementation and a vectorized implementation for high-end Intel CPUs were posted along with its submission. There have been a few research studies on the feasibility of Frodo variants on embedded devices, such as ARMs and FPGAs [
43]. However, none of these studies used an analysis based on an ARM processor and an FPGA device together. In this paper, we fill this gap by evaluating standard lattice-based cryptography and its feasibility for constraint embedded devices, with an eye on the SMs data communications. In this sense, we analyzed the benchmark implementation submitted and identified the most time-consuming operations. Subsequently, these routines are hardware accelerated using an FPGA owning a direct communication path with an ARM, which is responsible for performing the remaining tasks.
Before proceeding to the implementation section, a theoretical background review of the LWE problem, the basis of FrodoKEM, and how it applies to the FrodoKEM scheme is outlined.
3.1. Learning with Errors
The security of the proposed FrodoKEM relies on the hardness of the LWE problem. According to Regev [
44], the LWE problem asks to recover a secret
given a sequence of “approximate” random linear equations on
s. The formal definition is as follows. Fix a size parameter
, a modulus
, and an error probability distribution
on
. Now, let
on
be the probability distribution obtained by choosing a vector
uniformly at random, choosing
according to
, and outputting
, where additions are performed in
. Finally, it can be said that an algorithm solves LWE with modulus
q and error distribution
if, for any
, given an arbitrary number of independent samples from
it outputs
(with high probability).
Therefore, the LWE problem is nothing more than a noisy system with linear equations. In general, this problem is not trivial to solve. No quantum algorithms are currently known to solve the LWE problem in polynomial time [
44]. Consequently, schemes based on LWE are considered quantum-secure.
3.2. The Frodo Key Encapsulation Mechanism Scheme
The FrodoKEM scheme can be basically divided into three algorithms: key pair generation, encapsulation, and decapsulation [
26] as described in Algorithms 1–3, respectively. A few subroutines are called by these algorithms, see [
26] for more details. Briefly, the
Gen(.) function receives as input a seed and outputs a matrix
which was generated using a hash function. Similarly, the
SampleMatrix(.) function outputs a matrix sampled from the
error probability distribution. The
Pack(.) function transforms the received matrix into a bit string, while
Unpack(.) function does the opposite. Finally, the
Encode(.) function encodes bit strings as mod-
q integer matrices. On the other hand, the
Decode(.) function does the inverse operation. Finally, all bit string lengths (
,
,
,
,
,
,
,
,
) are previously known constants;
D is the exponent which defines the scheme modulus
;
n,
and
are integer matrix dimensions with
; and
is the distribution table for sampling.
Algorithm 1: Key pair generation |
Input: None Output: Public key: Secret key: Procedure: Choose uniformly random seeds: Generate a pseudo-random seed: Generate Matrix via Generate pseudo-random bit string: Sample error matrix Sample error matrix Compute Compute Compute Return: Public key Secret key |
The main part of the key pair generation (Algorithm 1) is the calculation of the LWE sample operation
. The matrix
is generated by a pseudo-random seed while
is created from a uniformly random seed hashed by a function. The FrodoKEM scheme uses two hash functions: a hash based on the AES cipher and the SHAKE128 hash algorithm. The matrices
and
are sampled according to the distribution
. Later, matrix
is packed into bit string
, and the bit string
and
are hashed to get a hash value
. Finally, the public key
is composed of
and
, while the secret key
is composed of
(previously uniformly random generated),
,
,
, and
.
Algorithm 2: Encapsulation |
Input: Public key: Output: Ciphertext Shared secret Procedure: Choose a uniformly random key: Compute Generate pseudo-random values Generate pseudo-random bit string: Sample error matrix Sample error matrix Generate Compute Compute Sample error matrix : Compute Compute Compute Compute Compute Return: Ciphertext Shared secret |
In the encapsulation (Algorithm 2), three noise matrices are generated: , , and . To create these matrices, a pseudo-random bit string is sampled according to . The input of the algorithm, bit strings and , are used to retrieve matrices and . Later, they are used to calculate and . Using the matrix added by the encoded (previously uniformly random generated), the matrix is created. Then, matrices and are packed, generating bit strings and , which concatenated form the ciphertext. Finally, bit strings , , and (pseudo-randomly generated using hash function) are hashed, creating the shared secret .
The decapsulation (Algorithm 3) aims to check if the ciphertext () is valid. To keep it short, bit strings and are unpacked, retrieving matrices and . Then, is calculated and then decoded, getting . The encapsulation steps are redone, although this time generating matrices and . If matrices and matches with matrices and , the shared secret returned is the hash of , , and (pseudo-randomly generated using hash function based on ). Otherwise, the shared secret returned is the hash of , , and (part of the secret key ).
More information and details about the parameters, the error sampling procedure, and the lattice structure can be consulted in the official specification of FrodoKEM [
26,
45].
Algorithm 3: Decapsulation |
Input: Ciphertext: Secret key: Output: Shared secret Procedure: Compute Compute Compute Compute Parse Generate pseudo-random values: Generate pseudo-random bit string: Sample error matrix Sample error matrix Generate Compute Sample error matrix Compute Compute Compute ifthen else end Return: Shared secret |
4. Implementation
Motivated by the performance results using an FPGA device for hybrid transceiver [
46], this section details the proposed hardware/software implementation of the FrodoKEM scheme in an SoC device. Furthermore, it takes advantage of SoC device’s feasibility for implementing and running fast complex algorithms [
47], such as the FrodoKEM-640 variant using the SHAKE128 hash function. This variant matches or exceeds the brute-force security of AES-128.
Regarding practical applications, the literature points out the following approaches to implementing PQC schemes [
48]:
Software-based: the implementation is carried out in microprocessors, microcontrollers, and ARM-processors. In other words, it is totally in software. The implementation is concluded in a short period, and the cost is low; however, it may not offer real-time performance.
Hardware/software-based with soft-core processors: it uses an FPGA with a soft-core processor implemented inside it, typically MicroBlaze or Nios II. Usually, the processor holds the main application, and the FPGA acts as a hardware accelerator. The implementation takes more time and costs than the software implementation; however, it offers real-time performance.
Hardware/software-based with hard-core processors: it is similar to the previous approach. The difference is that a hard-core processor, typically an ARM, is integrated into an SoC device, in which an FPGA device exists. The hard-core processor does not consume any FPGA area. Moreover, the implementation takes more time and costs than the software implementation; however, it offers better real-time performance than the previous approach.
Hardware-based: it refers to the hardware implementation in an FPGA device. The implementation takes much more time and costs than the previous approaches; however, it offers the best real-time performance.
It is worth mentioning that SM equipment is supposed to be a commodity in the electricity sector, which requires low production costs and capacity to embed a PQC scheme. Therefore, full software-based implementation may not be fast enough to support a PQC scheme and comply with the constraints of SM systems. On the other hand, a complete hardware-based implementation would be expensive. In this regard, a hardware/software-based implementation is the alternative with the most attractive trade-offs.
To the author’s knowledge, some implementations of PQC schemes using the first and fourth approaches [
18,
49,
50,
51], but there is a gap in the literature concerning the second and third approaches. Regarding the third approach, only high-speed implementations are found in the literature [
48], rather than the lightweight implementation targeted by this paper. The second and third approaches have a high potential to present a considerably better performance than software-based implementation. Also, they can offer good performance compared to hardware-based implementation, even considering the overhead of exchanging data between ARM and FPGA components. Regarding the second and third approaches, the hardware/software-based with a hard-core processor has a higher potential because the processor is hardware-implemented. In this sense, the third approach, hardware/software-based with a hard-core processor, seems to be the best approach to implement a PQC scheme in a hardware-constrained device, such as a SM equipment.
4.1. Preliminary Analysis
To identify the main bottlenecks of a benchmark implementation of FrodoKEM, we executed the code from [
39] on a Cortex-A9 ARM processor. After a detailed analysis, three operations stood out as the most time-consuming. As expected, the operation
, used in the key pair generation process, has a high computational burden because of its large matrix-by-matrix multiplication, which requires long loops with addition and multiplication operations. For the same reason, the operation
is even more expensive, because it is used in both encryption and decryption algorithms. Finally, the most expensive operation is the SHAKE128 hash function, a specific version of the SHAKE hash function. It is time-consuming in all three algorithms due to its loop-based structure. As mentioned above, all operations demand relevant computational burdens because of the large size of FrodoKEM keys and the use of a public or private key, directly or indirectly.
Table 1 summarizes the relative execution time of FrodoKEM-640 in the aforementioned ARM processor.
With this preliminary analysis accomplished, our strategy for accelerating the code execution is to implement the three operations above in hardware and keep the software part (including the three operations) as close as possible to the original to ensure fairness in further comparisons.
4.2. Main Components
The MicroZed 7010 board was chosen to implement the two matrix-by-matrix multiplications and the SHAKE128 hash function in an FPGA and the rest of the codes in an ARM processor. The MicroZed board is a low-cost System on Module (SoM) based on the Xilinx Zyqn-7000 SoC. In addition to the Zyqn-7000, the module contains 1 GB of DDR3 SDRAM, 128 Mb of QSPI Flash, a 33.33 MHz oscillator, and other functions and interfaces.
The Xilinx Zyqn-7000 SoC used is a XC7Z010-1CLG400C, which has a Processing System (PS) and Programmable Logic (PL). The PS is based on an ARM Cortex-A9 with ARMv7-architecture. On the other hand, the PL is based on an FPGA with 28K Programmable Logic Cells, 17,600 Lookup Tables (LUTs), 35,200 Registers, 60 Block Random Access Memorys (BRAMs) with 36 Kb each, and 80 Digital Signal Processing (DSP) blocks [
25]. Using a Phase-Locked Loop (PLL), 666.66 MHz and 100 MHz clocks are derived from the 33.33 MHz built-in oscillator to feed PS and PL, respectively.
4.3. Structure
The schematic diagram representing the operation of the FrodoKEM scheme is shown in
Figure 2. It can be divided into three main instances: PS, Interconnect, and PL.
The PS, based on an ARM Cortex-A9, is where the benchmark implementation is placed. The interconnect instance is responsible for providing an interface between the PL using the AXI-MM protocol. Finally, the PL, based on an FPGA, is where the operations and together with the SHAKE128 hash function are hardware implemented.
4.4. Benchmark Implementation
The benchmark implementation refers to the software implementation of the FrodoKEM scheme in the ARM processor. In this sense, the code was transferred entirely and adapted for the ARM processor, ensuring it complies with the software implementation detailed in [
39]. The software implementation can be divided into three main parts that individually correspond to Algorithms 1–3.
Moreover, it is worth detailing the processing of the operations and , and SHAKE128 hash function. As mentioned earlier, the matrix is large, and consequently, it is unfeasible to generate it all at once in an embedded device due to its memory constraints. Therefore, matrix is generated slightly differently as presented in Algorithm 1.
To reduce the necessity of large memory, it is proposed to generate parts of the matrix
on-the-fly and overwritten by a new part after use. Due to this technique, the FrodoKEM scheme can be embedded in a hardware-constrained device. Therefore, the matrix-by-matrix multiplication
has three main loops. Four rows of the matrix
are generated on-the-fly using the SHAKE128 hash function in the outer loop. The middle loop is responsible for selecting one column from the matrix
, previously fully generated. In the inner loop, four elements, one element of each of four generated rows of the matrix
, are multiplied by one element from the selected column of the matrix
, and the four results are accumulated. Algorithm 4 shows in detail the process, which is also illustrated in
Figure 3.
Algorithm 4: |
Input: Matrix : Matrix : Seed: Output: Matrix : Procedure: for do for do for do end end end |
The operation
occurs differently from the previous one. It also (re)generates the matrix
on-the-fly, although the multiplication process must be adapted. It occurs because the regenerated matrix
, used in the encapsulation and decapsulation processes, must be the same as used in the key pair generation. Therefore, each SHAKE128 hash function will generate a row of the matrix
which will slightly complicate the logic of the multiplication process, considering that the matrix
in this operation is on the right-hand side and it is no longer possible to multiple an entire row of the matrix
by an entire column of the matrix
, in this case, the matrix
. The Algorithm 5 presents this matrix-by-matrix multiplication process, which is also pictured in
Figure 4.
Algorithm 5: |
Input: Matrix : Matrix : Seed: Output: Matrix : Procedure: for do for do for do for do end end for do end end end |
Finally, SHAKE128 [
52] is a hash function with an output length of 256-bits and a security level of 128-bits. SHAKE128 is an instance of SHA-3, the latest member of the Secure Hash Algorithm family of standards, released by NIST. The SHA-3 family is based on the sponge construction [
53], which is shown in
Figure 5. The SHAKE128 hash function uses the Keccak-
[1600] function as the transform function composed of five permutation steps, whose parameters are the block size
r equal to 1344 bits and its capacity
c equal to 256 bits, resulting in the internal state with 1600 bits. The SHAKE128 hash function can be divided into two main parts: the absorbing and squeezing phases. In the absorbing phase, the input blocks (message) are XORed into the bit string
r of the internal state. Then, the internal state is inputted in the Keccak-
(1600) function. When the entire input message is absorbed, the squeeze phase begins. In this case, the outputted blocks are read from the bit string
r of the state, alternating with the Keccak-
(1600) function until the desired output size is reached.
4.5. The Proposed Hardware/Software Implementation
Based on the results presented in
Table 1, the most time-consuming operations are implemented in hardware (i.e., the FPGA) to accelerate the execution of the FrodoKEM scheme. These implementations are detailed in the following subsections.
4.5.1. The Operation
The operation , which consumes more than 8% of the total time, has its matrix-by-matrix multiplication implemented in hardware. The operation of adding matrix is performed in software, as it is not an expensive operation and would not bring significant time saving, under the assumption that the transferring time of the matrix from PS to PL is taken into account.
Figure 6 shows the hardware schematic of the operation
. The schematic consists of four main instances: BRAMs of the matrix
, BRAMs of the matrix
, a multiplier instance, and BRAMs of the matrix
. Note that BRAMs are Block Random Access Memory where a larger amount of data are stored.
As mentioned earlier, matrix is generated on-the-fly to save memory resources. Four rows are generated and transmitted from the PS to the BRAMs of the matrix , located in PL, to be stored. Four -bits BRAMs are ready to receive these rows. Each generated row is composed of -bits. To speed up the data transfer process, 32-bits are transferred at a time, which means that two subsequent values are concatenated and stored in the BRAMs. The matrix is transmitted from PS to PL. The whole matrix is transferred since it has already been fully generated in PS. The transmission process follows the same 32-bits transmit principle as with matrix , although, for the matrix a -bits BRAM is used.
The multiplication process begins when both matrices are completely stored in their BRAMs. From BRAMs of the matrix the n-element of each BRAM is read. As -bits words are stored concatenated in a 32-bits BRAM position, eight values are loaded from memory and sent to the multiplier instance. In parallel, the n-element of the matrix is also read. For the same reason as the concatenated storage, -bits values are loaded and transferred to the multiplier instance.
In the multiplication instance, each 32-bits element is split into -bits words using the function S. Then, the multiplications are properly performed, and their results are added to the previous iteration results. Any concern about overflow issues is needed, which is an advantage of the FrodoKEM scheme in saving hardware resources. While the multiplication is performed, the following values from the BRAMs are loaded, keeping the pipeline full to achieve maximum performance. When all elements of the BRAMs are loaded and processed, the -bits accumulated values are transferred to the BRAMs of the matrix , which has four -bits BRAMs.
The next four rows of the matrix must be generated and transferred to PL. The matrix has already been completely transferred in the first iteration and should not be sent again. The next iteration begins, and the process mentioned above is performed again. When all iterations are complete, the BRAMs of the matrix store the result and transfers it to PS using a 32-bits bus.
4.5.2. The Operation
This is responsible for more than 30% of the execution time. The operation also has its matrix-by-matrix multiplication implemented in hardware. Similarly to the operation , the addition operation of the matrix is carried out in software due to the same reasons.
In
Figure 7 we can see the schematic representation of the operation
. It also has four instances: BRAMs of the matrix
, BRAMs of the matrix
, a multiplier instance, and BRAM of the matrix
.
Two -bits BRAMs are used to store half of the matrix . As mentioned earlier, concatenated values are transferred from PS to PL using 32-bits bus; therefore, each BRAM can be split in half (upper 16-bits and lower 16-bits), storing two concatenated word by word columns of the matrix , giving a total of four columns. On the other hand, the matrix and its transfer process have exactly the same configuration as used in the operation : four -bits BRAMs and 32-bits bus for transfer.
After the matrices are received, multiplication is performed. The n-element of each matrix BRAM are loaded. Since -bits words are stored concatenated in a 32-bits BRAM position, four values are read and sent to the multiplier instance in 16-bits buses. In parallel, the p-element of each BRAM of the matrix are read. Only the upper or lower 16-bits of the four loaded values are sent to the multiplier instance, depending on the p parity.
Therefore, the multiplier instance receives -bits values, four from each matrix. Finally, the values are correctly multiplied, and the results are added, resulting in a 16-bits value. Immediately, the multiplication and addition process results are sent to matrix , stored in one -bits BRAM, to be added to a previous value stored in a particular position. Simultaneously, new values from matrices and are loaded to keep the pipeline full, restarting the process.
When all iterations are completed, the other half of the matrix and new rows of the matrix must be transferred to PL. When the process ends, the BRAMs of the matrix will store the result of operation and it is transmitted to PS using a 32-bits bus.
4.5.3. SHAKE128 Hash Function
FrodoKEM’s most time-consuming operation is the SHAKE128 hash function, which is responsible for almost 54% of the total execution time.
The proposed scheme can be organized into three main instances: a BRAM, the SHAKE128 hash function instance, and a Keccak-
[1600], as illustrated in
Figure 8.
The BRAM is a -bits memory. This size was chosen based on the maximum size that FrodoKEM needs. This BRAM is responsible for storing the input values (message) received by the 32-bits bus. Each 32-bits word is a concatenation of -bits characters. -bits words or -bits characters are concatenated and stored in the BRAM.
When all values are received and stored in the BRAM, 168-bytes are sent to the SHAKE128 hash function instance via a 64-bits bus, starting the absorb phase. These bytes built the internal state. If necessary in the absorb phase, the Keccak-[1600] function is called, and the entire internal state is sent to the Keccak-[1600] function, which performs its five steps (, , , , and ) in a single clock. Then, the new scrambled internal state returns to the SHAKE128 hash function. Next, another 168-bytes are loaded from BRAM and the process is repeated until the entire inputted message is read, finalizing the absorb phase.
When the absorb phase ends, the squeeze phase starts using the Keccak-[1600] function. To save resources, the first 168-bytes of the internal state of each step in the squeeze phase are stored in the same BRAM, which previously stored the inputted message and now stores the output values (cipher). When the desired output length is reached, the process is complete, and the BRAM uses the 32-bits bus to send the stored output values back to PS.
6. Conclusions
This work has investigated the feasibility of implementing a PQC scheme in a hardware-constrained SM equipment for ensuring the security of SM data traveling through data networks. In this context, FrodoKEM, a post-quantum lattice-based scheme, was detailed and, in the sequel, implemented in an SoC device.
Numerical results have shown that the proposed hardware/software implementation of the most time-consuming and complex routines of FrodoKEM in hardware results in a one-third reduction of the execution time compared to the benchmark implementation (i.e., software implementation). According to the execution time analysis, three operations consume almost all execution time. The separate implementation of these three operations in hardware reaches an improvement of 68.15% (i.e., the execution time reduces from 1.7 s to less than 0.6 s). Consequently, these results show that it is possible to implement the FrodoKEM scheme in hardware-constrained equipment, such as SM, in which an SoC device is used. In such a case, the FPGA component is dedicated to running the most time-consuming routines.
Overall, it has shown that implementing the FrodoKEM scheme is feasible to secure SMs data traveling through data networks. Last but not least, all discussions and analyses constitute baselines to adopt PQC schemes in SM systems.