CryptoDB
Weijia Wang
ORCID: 0000-0001-6982-2537
Publications
Year
Venue
Title
2024
TCHES
eLIMInate: a Leakage-focused ISE for Masked Implementation
Abstract
Even given a state-of-the-art masking scheme, masked software implementation of some cryptography functionality can pose significant challenges stemming, e.g., from simultaneous requirements for efficiency and security. In this paper we design an Instruction Set Extension (ISE) to address a specific element of said challenge, namely the elimination of leakage stemming from architectural and microarchitectural overwriting. Conceptually, the ISE allows a leakage-focused behavioural hint to be communicated from software to the micro-architecture: using it informs how computation is realised when applied to masking-specific data, which then offers an opportunity to eliminate associated leakage. We develop prototype, latencyand area-optimised implementations of the ISE design based on the RISC-V Ibex core. Using them, we demonstrate that use of the ISE can close the gap between assumptions about and actual behaviour of a device and thereby deliver an improved security guarantee.
2024
TCHES
Efficient Table-Based Masking with Pre-processing
Abstract
Masking is one of the most investigated countermeasures against sidechannel attacks. In a nutshell, it randomly encodes each sensitive variable into a number of shares, and compiles the cryptographic implementation into a masked one that operates over the shares instead of the original sensitive variables. Despite its provable security benefits, masking inevitably introduces additional overhead. Particularly, the software implementation of masking largely slows down the cryptographic implementations and requires a large number of random bits that need to be produced by a true random number generator. In this respect, reducing the< overhead of masking is still an essential and challenging task. Among various known schemes, Table-Based Masking (TBM) stands out as a promising line of work enjoying the advantages of generality to any lookup tables. It also allows the pre-processing paradigm, wherein a pre-processing phase is executed independently of the inputs, and a much more efficient online (using the precomputed tables) phase takes place to calculate the result. Obviously, practicality of pre-processing paradigm relies heavily on the efficiency of online phase and the size of precomputed tables.In this paper, we investigate the TBM scheme that offers a combination of linear complexity (in terms of the security order, denoted as d) during the online phase and small precomputed tables. We then apply our new scheme to the AES-128, and provide an implementation on the ARM Cortex architecture. Particularly, for a security order d = 8, the online phase outperforms the current state-of-the-art AES implementations on embedded processors that are vulnerable to the side-channel attacks. The security order of our scheme is proven in theory and verified by the T-test in practice. Moreover, we investigate the speed overhead associated with the random bit generation in our masking technique. Our findings indicate that the speed overhead can be effectively balanced. This is mainly because that the true random number generator operates in parallel with the processor’s execution, ensuring a constant supply of fresh random bits for the masked computation at regular intervals.
2023
EUROCRYPT
Improved Power Analysis Attacks on Falcon
Abstract
Falcon is one of the three post-quantum signature schemes selected for standardization by NIST. Due to its low bandwidth and high efficiency, Falcon is seen as an attractive option for quantum-safe embedded systems. In this work, we study Falcon’s side-channel resistance by analysing its Gaussian samplers. Our results are mainly twofold.
The first result is an improved key recovery exploiting the leakage within the base sampler investigated by Guerreau et al. (CHES 2022). Instead of resorting to the fourth moment as in former parallelepiped-learning attacks, we work with the second order statistics covariance and use its spectral decomposition to recover the secret information. Our approach substantially reduces the the requirement of measurements and computation resources: 220 000 traces is sufficient to recover the secret key of Falcon-512 within half an hour with a probability of ≈ 25%. As a comparison, even with 106 traces, the former attack still needs about 1000 hours CPU time of lattice reduction for a full key recovery. In addition, our approach is robust to inaccurate leakage classification, which is another advantage over parallelepiped-learning attacks.
Our second result is a practical power analysis targeting the integer Gaussian sampler of Falcon. The analysis relies on the leakage of random sign flip within the integer Gaussian sampling. This leakage was exposed in 2018 by Kim and Hong, but it is not considered in the Falcon’s implementation and unexploited for side-channel analysis until now. We identify the leakage within the reference implementation of Falcon on an ARM Cortex-M4 STM32F407IGT6 microprocessor. We also show that this single bit of leakage is in effect enough for practical key recovery: with 170 000 traces one can fully recover the key of Falcon-512 within half an hour. Furthermore, combining the sign leakage and the aforementioned leakage, one can recover the key with only 45 000 signature measurements in a short time.
As a by-product, we also extend our power analysis to Mitaka that is a recent variant of Falcon. The same leakages exist within the integer Gaussian samplers of Mitaka, and they can also be used to mount key recovery attacks. Nevertheless, the key recovery in Mitaka requires much more traces than it does in Falcon, due to their different lattice Gaussian samplers.
2023
TCHES
Efficient Private Circuits with Precomputation
Abstract
At CHES 2022, Wang et al. described a new paradigm for masked implementations using private circuits, where most intermediates can be precomputed before the input shares are accessed, significantly accelerating the online execution of masked functions. However, the masking scheme they proposed mainly featured (and was designed for) the cost amortization, leaving its (limited) suitability in the above precomputation-based paradigm just as a bonus. This paper aims to provide an efficient, reliable, easy-to-use, and precomputation-compatible masking scheme. We propose a new masked multiplication over the finite field Fq suitable for the precomputation, and prove its security in the composable notion called Probing-Isolating Non-Inference (PINI). Particularly, the operations (e.g., AND and XOR) in the binary field can be achieved by assigning q = 2, allowing the bitsliced implementation that has been shown to be quite efficient for the software implementations. The new masking scheme is applied to leverage the masking of AES and SKINNY block ciphers on ARM Cortex M architecture. The performance results show that the new scheme contributes to a significant speed-up compared with the state-of-the-art implementations. For SKINNY with block size 64, the speed and RAM requirement can be significantly improved (saving around 45% cycles in the online-computation and 60% RAM space for precomputed values) from AES-128, thanks to its smaller number of AND gates. Besides the security proof by hand, we provide formal verifications for the multiplication and T-test evaluations for the masked implementations of AES and SKINNY. Because of the structure of the new masked multiplication, our formal verification can be performed for security orders up to 16.
2022
TOSC
Towards Low-Latency Implementation of Linear Layers
📺
Abstract
Lightweight cryptography features a small footprint and/or low computational complexity. Low-cost implementations of linear layers usually play an important role in lightweight cryptography. Although it has been shown by Boyar et al. that finding the optimal implementation of a linear layer is a Shortest Linear Program (SLP) problem and NP-hard, there exist a variety of heuristic methods to search for near-optimal solutions. This paper considers the low-latency criteria and focuses on the heuristic search of lightweight implementation for linear layers. Most of the prior approach iteratively combines the inputs (of linear layers) to reach the output, which can be regarded as the forward search. To better adapt the low-latency criteria, we propose a new framework of backward search that attempts to iteratively split every output (into an XORing of two bits) until all inputs appear. By bounding the time of splitting, the new framework can find a sub-optimal solution with a minimized depth of circuits.We apply our new search algorithm to linear layers of block ciphers and find many low-latency candidates for implementations. Notably, for AES Mixcolumns, we provide an implementation with 103 XOR gates with a depth of 3, which is among the best hardware implementations of the AES linear layer. Besides, we obtain better implementations in XOR gates for 54.3% of 4256 Maximum Distance Separable (MDS) matrices proposed by Li et al. at FSE 2019. We also achieve an involutory MDS matrix (in M4(GL(8, F2))) whose implementation uses the lowest number (i.e., 86, saving 2 from the state-of-the-art result) of XORs with the minimum depth.
2022
TCHES
Side-Channel Masking with Common Shares
Abstract
To counter side-channel attacks, a masking scheme randomly encodes keydependent variables into several shares, and transforms operations into the masked correspondence (called gadget) operating on shares. This provably achieves the de facto standard notion of probing security.We continue the long line of works seeking to reduce the overhead of masking. Our main contribution is a new masking scheme over finite fields in which shares of different variables have a part in common. This enables the reuse of randomness / variables across different gadgets, and reduces the total cost of masked implementation. For security order d and circuit size l, the randomness requirement and computational complexity of our scheme are Õ(d2) and Õ(ld2) respectively, strictly improving upon the state-of-the-art Õ(d2) and Õ(ld3) of Coron et al. at Eurocrypt 2020.A notable feature of our scheme is that it enables a new paradigm in which many intermediates can be precomputed before executing the masked function. The precomputation consumes Õ(ld2) and produces Õ(ld) variables to be stored in RAM. The cost of subsequent (online) computation is reduced to Õ(ld), effectively speeding up e.g., challenge-response authentication protocols. We showcase our method on the AES on ARM Cortex M architecture and perform a T-test evaluation. Our results show a speed-up during the online phase compared with state-of-the-art implementations, at the cost of acceptable RAM consumption and precomputation time.To prove security for our scheme, we propose a new security notion intrinsically supporting randomness / variables reusing across gadgets, and bridging the security of parallel compositions of gadgets to general compositions, which may be of independent interest.
2022
TOSC
More Inputs Makes Difference: Implementations of Linear Layers Using Gates with More Than Two Inputs
Abstract
Lightweight cryptography ensures cryptography applications to devices with limited resources. Low-area implementations of linear layers usually play an essential role in lightweight cryptography. The previous works have provided plenty of methods to generate low-area implementations using 2-input xor gates for various linear layers. However, it is still challenging to search for smaller implementations using two or more inputs xor gates. This paper, inspired by Banik et al., proposes a novel approach to construct a quantity of lower area implementations with (n + 1)- input gates based on the given implementations with n-input gates. Based on the novel algorithm, we present the corresponding search algorithms for n = 2 and n = 3, which means that we can efficiently convert an implementation with 2-input xor gates and 3-input xor gates to lower-area implementations with 3-input xor gates and 4-input xor gates, respectively.We improve the previous implementations of linear layers for many block ciphers according to the area with these search algorithms. For example, we achieve a better implementation with 4-input xor gates for AES MixColumns, which only requires 243 GE in the STM 130 nm library, while the previous public result is 258.9 GE. Besides, we obtain better implementations for all 5500 lightweight matrices proposed by Li et al. at FSE 2019, and the area for them is decreased by about 21% on average.
2021
TOSC
Provable Security of SP Networks with Partial Non-Linear Layers
📺
Abstract
Motivated by the recent trend towards low multiplicative complexity blockciphers (e.g., Zorro, CHES 2013; LowMC, EUROCRYPT 2015; HADES, EUROCRYPT 2020; MALICIOUS, CRYPTO 2020), we study their underlying structure partial SPNs, i.e., Substitution-Permutation Networks (SPNs) with parts of the substitution layer replaced by an identity mapping, and put forward the first provable security analysis for such partial SPNs built upon dedicated linear layers. For different instances of partial SPNs using MDS linear layers, we establish strong pseudorandom security as well as practical provable security against impossible differential attacks. By extending the well-established MDS code-based idea, we also propose the first principled design of linear layers that ensures optimal differential propagation. Our results formally confirm the conjecture that partial SPNs achieve the same security as normal SPNs while consuming less non-linearity, in a well-established framework.
2020
TOSC
Efficient Side-Channel Secure Message Authentication with Better Bounds
📺
Abstract
We investigate constructing message authentication schemes from symmetric cryptographic primitives, with the goal of achieving security when most intermediate values during tag computation and verification are leaked (i.e., mode-level leakage-resilience). Existing efficient proposals typically follow the plain Hash-then-MAC paradigm T = TGenK(H(M)). When the domain of the MAC function TGenK is {0, 1}128, e.g., when instantiated with the AES, forgery is possible within time 264 and data complexity 1. To dismiss such cheap attacks, we propose two modes: LRW1-based Hash-then-MAC (LRWHM) that is built upon the LRW1 tweakable blockcipher of Liskov, Rivest, and Wagner, and Rekeying Hash-then-MAC (RHM) that employs internal rekeying. Built upon secure AES implementations, LRWHM is provably secure up to (beyond-birthday) 278.3 time complexity, while RHM is provably secure up to 2121 time. Thus in practice, their main security threat is expected to be side-channel key recovery attacks against the AES implementations. Finally, we benchmark the performance of instances of our modes based on the AES and SHA3 and confirm their efficiency.
2020
TCHES
Efficient and Private Computations with Code-Based Masking
📺
Abstract
Code-based masking is a very general type of masking scheme that covers Boolean masking, inner product masking, direct sum masking, and so on. The merits of the generalization are twofold. Firstly, the higher algebraic complexity of the sharing function decreases the information leakage in “low noise conditions” and may increase the “statistical security order” of an implementation (with linear leakages). Secondly, the underlying error-correction codes can offer improved fault resistance for the encoded variables. Nevertheless, this higher algebraic complexity also implies additional challenges. On the one hand, a generic multiplication algorithm applicable to any linear code is still unknown. On the other hand, masking schemes with higher algebraic complexity usually come with implementation overheads, as for example witnessed by inner-product masking. In this paper, we contribute to these challenges in two directions. Firstly, we propose a generic algorithm that allows us (to the best of our knowledge for the first time) to compute on data shared with linear codes. Secondly, we introduce a new amortization technique that can significantly mitigate the implementation overheads of code-based masking, and illustrate this claim with a case study. Precisely, we show that, although performing every single code-based masked operation is relatively complex, processing multiple secrets in parallel leads to much better performances. This property enables code-based masked implementations of the AES to compete with the state-of-the-art in randomness complexity. Since our masked operations can be instantiated with various linear codes, we hope that these investigations open new avenues for the study of code-based masking schemes, by specializing the codes for improved performances, better side-channel security or improved fault tolerance.
2020
TOSC
Beyond-Birthday-Bound Security for 4-round Linear Substitution-Permutation Networks
📺
Abstract
Recent works of Cogliati et al. (CRYPTO 2018) have initiated provable treatments of Substitution-Permutation Networks (SPNs), one of the most popular approach to construct modern blockciphers. Such theoretical SPN models may employ non-linear diffusion layers, which enables beyond-birthday-bound provable security. Though, for the model of real world blockciphers, i.e., SPN models with linear diffusion layers, existing provable results are capped at birthday security up to $2^{n/2}$ adversarial queries, where $n$ is the size of the idealized S-boxes.
In this paper, we overcome this birthday barrier and prove that a 4-round SPN with linear diffusion layers and independent round keys is secure up to $2^{2n/3}$ queries. For this, we identify conditions on the linear layers that are sufficient for such security, which, unsurprisingly, turns out to be slightly stronger than Cogliati et al.'s conditions for birthday security. These provides additional theoretic supports for real world SPN blockciphers.
2020
ASIACRYPT
Packed Multiplication: How to Amortize the Cost of Side-channel Masking?
📺
Abstract
Higher-order masking countermeasures provide strong provable security against side-channel attacks at the cost of incurring significant overheads, which largely hinders its applicability. Previous works towards remedying cost mostly concentrated on ``local'' calculations, i.e., optimizing the cost of computation units such as a single AND gate or a field multiplication. This paper explores a complementary ``global'' approach, i.e., considering multiple operations in the masked domain as a batch and reducing randomness and computational cost via amortization. In particular, we focus on the amortization of $\ell$ parallel field multiplications for appropriate integer $\ell > 1$, and design a kit named {\it packed multiplication} for implementing such a batch.
Higher-order masking countermeasures provide strong provable security against side-channel attacks at the cost of incurring significant overheads, which largely hinders its applicability. Previous works towards remedying cost mostly concentrated on ``local'' calculations, i.e., optimizing the cost of computation units such as a single AND gate or a field multiplication. This paper explores a complementary ``global'' approach, i.e., considering multiple operations in the masked domain as a batch and reducing randomness and computational cost via amortization.
In particular, we focus on the amortization of $\ell$ parallel field multiplications for appropriate integer $\ell > 1$, and design a kit named {\it packed multiplication} for implementing such a batch.
For $\ell+d\leq2^m$, when $\ell$ parallel multiplications over $\mathbb{F}_{2^{m}}$ with $d$-th order probing security are implemented, packed multiplication consumes $d^2+2\ell d + \ell$ bilinear multiplications and $2d^2 + d(d+1)/2$ random field variables, outperforming the state-of-the-art results with $O(\ell d^2)$ multiplications and $\ell \left \lfloor d^2/4\right \rfloor + \ell d$ randomness. To prove $d$-probing security for packed multiplications, we introduce some weaker security notions for multiple-inputs-multiple-outputs gadgets and use them as intermediate steps, which may be of independent interest.
As parallel field multiplications exist almost everywhere in symmetric cryptography, lifting optimizations from ``local'' to ``global'' substantially enlarges the space of improvements. To demonstrate, we showcase the method on the AES Subbytes step, GCM and TET (a popular disk encryption). Notably, when $d=8$, our implementation of AES Subbytes in ARM Cortex M architecture achieves a gain of up to $33\%$ in total speeds and saves up to $68\%$ random bits than the state-of-the-art bitsliced implementation reported at ASIACRYPT~2018.
Coauthors
- Joël Alwen (1)
- Gaëtan Cassiers (2)
- Hao Cheng (1)
- Yanhong Fan (2)
- Rong Fu (1)
- Yansong Gao (1)
- Dawu Gu (1)
- Zheng Guo (1)
- Chun Guo (5)
- Fanjie Ji (3)
- Lu Li (1)
- Xiuhan Lin (1)
- Junrong Liu (1)
- Qun Liu (2)
- Pierrick Méaux (1)
- Daniel Page (1)
- François-Xavier Standaert (5)
- Yang Su (1)
- Yiteng Sun (1)
- Ling Sun (2)
- Weijia Wang (13)
- Taoyun Wang (1)
- Bohan Wang (1)
- Xiao Wang (1)
- Meiqin Wang (3)
- Lixuan Wu (2)
- Sen Xu (1)
- Yang Yu (1)
- Yu Yu (7)
- Juelin Zhang (2)
- Shiduo Zhang (1)