Explore (579)Live Systems (52)Pricing
Log InGet API Key✓ Verify It Yourself
Cryptography

NTT Deep Dive: Number Theoretic Transform in Post-Quantum Cryptography

| Eric Beans, CEO | 20 min read

If you look inside any production implementation of ML-KEM (FIPS 203), ML-DSA (FIPS 204), or any lattice-based cryptographic scheme, you will find the Number Theoretic Transform at the center of the computation. The NTT is to lattice cryptography what modular exponentiation is to RSA: the single most performance-critical operation, the one that determines whether the scheme is practical at scale or merely theoretically interesting. Understanding the NTT is necessary for anyone who wants to go beyond using post-quantum libraries as black boxes and understand what is actually happening when a lattice-based key is generated, an encryption is performed, or a signature is created.

What the NTT Is

The Number Theoretic Transform is the finite field analog of the Fast Fourier Transform (FFT). Where the FFT operates on complex numbers, the NTT operates on integers modulo a prime number. Where the FFT uses complex roots of unity (points on the unit circle in the complex plane), the NTT uses roots of unity in a finite field (elements whose repeated multiplication eventually cycles back to 1 modulo the prime).

The purpose of both transforms is the same: to convert polynomial multiplication from an O(n^2) operation (where n is the polynomial degree) into an O(n log n) operation. This speedup is not merely a constant factor improvement; it is the difference between polynomial multiplication being a bottleneck and polynomial multiplication being fast enough for cryptographic use at the required security levels.

Consider the polynomials used in ML-KEM. The ring is R_q = Z_q[X]/(X^n + 1), where n = 256 and q = 3329. A polynomial in this ring has 256 coefficients, each an integer modulo 3329. Multiplying two such polynomials naively requires 256 x 256 = 65,536 multiplications and a similar number of additions, followed by reduction modulo X^256 + 1. With the NTT, the same multiplication requires three steps: transform both polynomials into the NTT domain (two transforms, each O(n log n) = O(256 x 8) = O(2048) operations), perform 256 pointwise multiplications in the NTT domain, and transform the result back (one inverse NTT, another O(2048) operations). The total is approximately 6,400 operations instead of 65,536. For n = 1024 or n = 2048, which are common in FHE schemes, the savings are even more dramatic.

Why Post-Quantum Algorithms Need NTT

Lattice-based cryptography builds its security on the hardness of problems in polynomial rings. The most important of these problems are the Module Learning With Errors (MLWE) problem and the Module Short Integer Solution (MSIS) problem. Both problems involve operations on polynomials in rings of the form R_q = Z_q[X]/(X^n + 1).

Key generation, encryption, decryption, signing, and verification in lattice-based schemes all involve multiplying polynomials in this ring. In ML-KEM, the public key is computed as t = A*s + e, where A is a matrix of polynomials, s is a vector of short polynomials (the secret key), and e is a vector of short error polynomials. Each element of the matrix-vector product A*s requires polynomial multiplication. For the ML-KEM-768 parameter set (security level 3), A is a 3x3 matrix, s is a 3x1 vector, and the computation requires 9 polynomial multiplications.

In ML-DSA, the signing operation involves polynomial multiplications in a larger ring (n = 256, q = 8380417) with additional computations for the rejection sampling loop. A single signing operation might execute dozens of polynomial multiplications before producing a valid signature.

Without the NTT, each polynomial multiplication would require O(n^2) operations. At n = 256 and with multiple multiplications per operation, the total computation would be prohibitively slow for the throughput requirements of production systems. The NTT reduces each multiplication to O(n log n), making lattice-based cryptography practical at the performance levels needed for real-world deployment.

The Cooley-Tukey Butterfly

The NTT is computed using a divide-and-conquer algorithm analogous to the Cooley-Tukey FFT. The computation is organized into log2(n) layers, each consisting of n/2 butterfly operations. Each butterfly operation takes two input values and produces two output values using one multiplication and two additions.

The butterfly operation in the NTT is: given inputs a and b and a twiddle factor w (a power of the root of unity), compute a' = a + w*b and b' = a - w*b, both modulo q. This is identical in structure to the radix-2 FFT butterfly, except that all arithmetic is performed in the finite field Z_q instead of over the complex numbers.

The algorithm proceeds as follows. In the first layer, elements at distance n/2 apart are paired and butterflied. In the second layer, elements at distance n/4 apart are paired. In each subsequent layer, the distance halves. After log2(n) layers, the polynomial is fully transformed into the NTT domain.

For n = 256, this means 8 layers of 128 butterflies each, for a total of 1,024 butterfly operations. Each butterfly requires one modular multiplication (for w*b) and two modular additions/subtractions. The total operation count is 1,024 multiplications and 2,048 additions, compared to 65,536 multiplications for naive polynomial multiplication.

The twiddle factors (powers of the root of unity) can be precomputed and stored in a lookup table. For ML-KEM with q = 3329, the primitive 512th root of unity is 17 (since X^256 + 1 splits into linear factors modulo 3329 when 3329 = 1 mod 512). The twiddle factors are 17^0, 17^1, 17^2, ..., 17^255, all reduced modulo 3329. These 256 values occupy 256 x 2 = 512 bytes and are constant for a given parameter set.

The Negacyclic NTT

There is a subtlety in the NTT as used in lattice cryptography that distinguishes it from the standard NTT. The ring R_q = Z_q[X]/(X^n + 1) requires reduction modulo X^n + 1, not modulo X^n - 1. This changes the transform from a standard cyclic NTT to a negacyclic NTT.

The standard NTT computes polynomial multiplication modulo X^n - 1 (cyclic convolution). The negacyclic NTT computes polynomial multiplication modulo X^n + 1 (negacyclic convolution). The difference is important because X^n + 1 is the defining polynomial of the cyclotomic ring used in lattice-based cryptography. Using the standard NTT would give wrong results.

The negacyclic NTT is implemented by pre-multiplying the input polynomial by powers of a primitive 2nth root of unity (psi), performing a standard length-n NTT, and post-multiplying the inverse NTT output by inverse powers of psi. In practice, these pre- and post-multiplications are folded into the butterfly twiddle factors, so the negacyclic NTT has the same computational cost as the standard NTT.

For ML-KEM with n = 256 and q = 3329, the negacyclic NTT is actually computed as a length-128 NTT on pairs of coefficients (using the fact that X^256 + 1 factors into 128 quadratic factors modulo 3329). The NIST specification uses this factored form, which produces 128 degree-1 polynomials in the NTT domain. Pointwise multiplication in the NTT domain then involves multiplying pairs of degree-1 polynomials, which requires 3 multiplications each (using Karatsuba or schoolbook) rather than the single multiplication per coefficient that a full NTT would provide.

Montgomery Multiplication

Every butterfly operation in the NTT requires a modular multiplication: computing a*b mod q. The naive approach to modular multiplication involves integer multiplication followed by division to compute the remainder. Division is expensive on modern processors, typically 5-10x slower than multiplication. In a performance-critical inner loop like the NTT butterfly, this cost is unacceptable.

Montgomery multiplication replaces division with multiplication by a precomputed constant. The idea is to represent numbers in "Montgomery form" where a number a is represented as a*R mod q, where R is a power of 2 larger than q. In Montgomery form, the product of two numbers a*R and b*R can be reduced modulo q using only multiplication, addition, and bit shifting, without any division.

The Montgomery reduction algorithm works as follows. Given a product T = (a*R) * (b*R) = a*b*R^2, we want to compute a*b*R mod q (the product in Montgomery form). The algorithm computes m = T * q_inv mod R (where q_inv is the precomputed inverse of -q modulo R), then t = (T + m*q) / R. Because R is a power of 2, the division by R is a simple right shift. The result t is a*b*R mod q, which is the product in Montgomery form.

For ML-KEM with q = 3329, R can be chosen as 2^16 = 65536 (since coefficients fit in 16 bits). The precomputed constant q_inv = -q^(-1) mod R = -3329^(-1) mod 65536 = 62209. Each Montgomery multiplication requires: one 32-bit multiplication (to compute T), one 16-bit multiplication (to compute m), one 32-bit multiplication (to compute m*q), one addition (T + m*q), and one right shift by 16 bits. Total: 3 multiplications and 1 addition, compared to 1 multiplication and 1 division for naive modular multiplication. Since the shift and the small multiplications are much cheaper than a division, Montgomery multiplication is typically 3-5x faster.

The Goldilocks Prime and Field Selection

The choice of the modulus q in a lattice-based scheme affects NTT performance. The ideal modulus for NTT is a prime q such that q = 1 mod 2n (ensuring that the required roots of unity exist) and q has a form that enables efficient modular reduction.

ML-KEM uses q = 3329, which satisfies 3329 = 1 mod 512 = 1 mod 2*256. This small prime fits in 12 bits, and all intermediate computations fit in 32-bit integers, which is efficient on 32-bit and 64-bit processors alike.

ML-DSA uses q = 8380417, which satisfies 8380417 = 1 mod 512 and has the special form 2^23 - 2^13 + 1. This form enables fast reduction: instead of computing a mod q using division, you can compute a mod (2^23 - 2^13 + 1) using shifts and additions. Specifically, if a = a_high * 2^23 + a_low, then a mod q = a_low + a_high * 2^13 - a_high (with a conditional subtraction of q if the result is negative or exceeds q). This is significantly faster than general-purpose modular reduction.

For applications requiring larger moduli, the "Goldilocks prime" p = 2^64 - 2^32 + 1 is notable. This prime is the smallest prime greater than 2^63 that satisfies p = 1 mod 2^32, making it suitable for NTTs of length up to 2^32. Its special form enables extremely fast modular reduction: for a 128-bit product a = a_high * 2^64 + a_low, we have a mod p = a_low + a_high * 2^32 - a_high, which requires only two 64-bit additions and a subtraction. The Goldilocks prime is widely used in STARK proof systems, where polynomial degrees of 2^20 or larger are common and NTT performance directly determines proving time.

Radix-4 and Higher-Radix NTT

The basic Cooley-Tukey algorithm uses radix-2 butterflies: each layer splits the problem into two halves. Radix-4 NTT splits each layer into four parts, processing four elements at a time. This reduces the number of layers from log2(n) to log4(n) = log2(n)/2 and reduces the number of twiddle factor multiplications by approximately 25% compared to radix-2.

The radix-4 butterfly takes four inputs (a0, a1, a2, a3) and four twiddle factors and produces four outputs. The computation involves 3 complex (or field) multiplications and 8 additions, compared to 4 multiplications and 8 additions for two consecutive radix-2 stages on the same four elements. The savings come from the fact that some twiddle factors in consecutive radix-2 stages are related by squaring, and the radix-4 butterfly exploits this relationship to eliminate one multiplication.

For n = 256, a radix-4 NTT has 4 layers of 64 radix-4 butterflies each, compared to 8 layers of 128 radix-2 butterflies. The fewer layers also improve cache performance because the data is traversed fewer times, which can provide additional speedup beyond the reduced arithmetic count.

Some implementations use mixed-radix approaches: radix-4 for the first several layers (where the data fits in cache) and radix-2 for the final layers (where the data access pattern becomes more sequential). The optimal choice depends on the specific hardware platform, cache sizes, and SIMD capabilities.

SIMD and Hardware Acceleration

The NTT butterfly is naturally parallel: within each layer, all n/2 (or n/4 for radix-4) butterflies are independent and can execute simultaneously. This makes the NTT an excellent candidate for SIMD (Single Instruction, Multiple Data) acceleration.

On x86-64 processors with AVX2, 256-bit SIMD registers can process 16 16-bit coefficients or 8 32-bit coefficients in a single instruction. For ML-KEM with q = 3329 (12-bit coefficients stored as 16-bit values), a single AVX2 register can hold 16 coefficients, and a single vectorized butterfly processes 8 pairs simultaneously. The AVX2 NTT for ML-KEM processes the full 256-coefficient transform in approximately 50-80 cycles on modern x86-64 processors.

On ARM processors with NEON, 128-bit SIMD registers process 8 16-bit or 4 32-bit values per instruction. ARM NEON implementations of the ML-KEM NTT achieve similar performance when measured in cycles per byte. ARM's more efficient pipeline and lower power consumption make NEON NTT implementations particularly attractive for mobile and embedded deployments.

Some processors include dedicated hardware for the NTT or for the modular arithmetic operations that the NTT requires. Academic proposals exist for NTT-specific instructions that would accelerate the twiddle factor multiplication and butterfly operations directly in the instruction set. As lattice-based cryptography becomes standard (ML-KEM is FIPS 203, ML-DSA is FIPS 204), hardware acceleration for the NTT is likely to follow the same trajectory as AES-NI did for AES: initially software-only, then ISA extensions, then dedicated hardware.

Lazy Reduction and Harvey's Algorithm

In a standard NTT implementation, every multiplication and addition is followed by a modular reduction to keep values within [0, q). This reduction (either by conditional subtraction or by Montgomery reduction) has a cost. Lazy reduction defers these reductions, allowing intermediate values to exceed q as long as they do not overflow the integer representation.

The key insight is that addition in Z_q does not require the inputs to be fully reduced. If a and b are both less than 2q, then a + b is less than 4q, and a - b is in the range (-2q, 2q). As long as the representation can hold values up to 4q (or more), reduction can be deferred until the end of the layer or until just before a multiplication (where overflow could occur).

Harvey's algorithm for modular multiplication takes this further. It precomputes q_prime = floor(w * 2^k / q) for each twiddle factor w, where k is the bit width of the representation. The modular multiplication w*b mod q is then computed as: t = w*b; t_prime = q_prime * b >> k; r = t - t_prime * q. The result r is at most one conditional subtraction away from the correct residue. This algorithm avoids the full Montgomery reduction and is particularly efficient when the twiddle factors are known at compile time (which they are in the NTT, since they depend only on the parameter set).

Combining lazy reduction with Harvey's algorithm can reduce the total number of modular reductions in an NTT by 30-50%, with corresponding performance improvements. The trade-off is increased code complexity and the need to carefully track the range of intermediate values to prevent overflow.

NTT in Fully Homomorphic Encryption

The NTT is even more critical in Fully Homomorphic Encryption (FHE) than in standard lattice-based cryptography. FHE schemes like BFV, BGV, and CKKS operate on polynomials with much larger parameters: degree n = 4096, 8192, 16384, or even 32768, with moduli q that are hundreds or thousands of bits wide.

For these large parameters, the O(n^2) vs O(n log n) difference is dramatic. At n = 16384, naive polynomial multiplication requires 268 million coefficient multiplications. NTT-based multiplication requires approximately 14 * 16384 = 229,376 operations (forward NTT on both inputs, pointwise multiply, inverse NTT on result). That is a 1,000x reduction in arithmetic operations.

FHE schemes also use the Residue Number System (RNS), which decomposes a large modulus q into a product of smaller co-prime moduli q_1 * q_2 * ... * q_k. Each small modulus supports its own NTT, and the NTTs can be computed independently and in parallel. This is a natural fit for SIMD and multi-core parallelism: each core or SIMD lane can compute the NTT for one RNS limb.

The choice of RNS moduli affects NTT performance. Ideal moduli are primes of the form q_i = 1 mod 2n (ensuring NTT-friendliness) that fit in a machine word (32 or 64 bits) and have special forms enabling fast reduction. Finding enough such primes for a given n is a constraint that influences the overall parameter selection of the FHE scheme.

Constant-Time Implementation

Cryptographic implementations must be constant-time to prevent timing side-channel attacks. An implementation is constant-time if its execution time does not depend on the values of secret data. For the NTT, this means that the butterfly operations, modular reductions, and memory access patterns must be independent of the polynomial coefficients.

The NTT is naturally constant-time in its control flow: the sequence of operations does not depend on the data. However, several implementation details can introduce timing variations. Conditional subtractions in modular reduction may take different times depending on the branch. Division operations in naive modular reduction have data-dependent latency on some processors. Table lookups for twiddle factors may have variable latency depending on cache state.

Montgomery multiplication is naturally constant-time because it uses only multiplication, addition, and shift operations, none of which have data-dependent timing on modern processors. The conditional subtraction at the end of Montgomery reduction must be implemented as a constant-time conditional move (CMOV on x86, CSEL on ARM) rather than a branch.

For the twiddle factor table, constant-time access can be ensured either by keeping the table small enough to fit in L1 cache (which is the case for ML-KEM and ML-DSA parameter sets) or by using oblivious memory access patterns that touch every cache line regardless of which entry is needed.

Performance Characteristics

The practical performance of an NTT implementation depends on the parameter set, the hardware platform, and the implementation quality. Some reference points from published literature and open-source implementations as of 2025:

For ML-KEM (n = 256, q = 3329): optimized AVX2 implementations achieve approximately 50-100 cycles per NTT on modern x86-64 processors. NEON implementations on ARM Cortex-A76 achieve approximately 80-150 cycles. Straightforward C implementations without SIMD run in 2,000-5,000 cycles.

For ML-DSA (n = 256, q = 8380417): the larger modulus increases per-multiplication cost. AVX2 implementations achieve approximately 150-300 cycles per NTT. The wider coefficients (23 bits vs 12 bits) reduce the number of coefficients per SIMD register, limiting the parallelism achievable through vectorization.

For FHE parameter sets (n = 4096-16384, q = 50-600 bit moduli with RNS): NTT performance dominates the overall computation time, typically accounting for 60-80% of the total time for operations like ciphertext multiplication. Optimized implementations on 96-core ARM Graviton4 processors can achieve millions of polynomial multiplications per second using RNS decomposition with per-core NTT computation.

Common Implementation Pitfalls

Implementing the NTT correctly and efficiently requires attention to several details that are easy to get wrong.

Root of unity selection. The NTT requires a primitive nth root of unity (or 2nth root for negacyclic). Not every element of Z_q is a root of unity. If the wrong element is used, the NTT will produce incorrect results. The root must be computed or verified during parameter setup, and it must satisfy w^n = 1 mod q and w^(n/2) != 1 mod q.

Bit-reversal permutation. The Cooley-Tukey NTT produces output in bit-reversed order (or requires input in bit-reversed order, depending on the formulation). The bit-reversal permutation must be applied either before the forward transform or after the inverse transform. Omitting or misapplying the permutation produces results that look correct in some tests but fail in others.

Inverse NTT scaling. The inverse NTT must multiply each output by n^(-1) mod q (the modular inverse of n) to produce the correct result. Omitting this scaling factor is a common bug that causes all polynomial products to be scaled by n.

Modular reduction overflow. When using lazy reduction, intermediate values can exceed the machine word size if the ranges are not carefully tracked. For 16-bit coefficients in ML-KEM, intermediate products fit in 32 bits. For 23-bit coefficients in ML-DSA, intermediate products of two unreduced values can exceed 32 bits and require 64-bit arithmetic.

Montgomery domain conversion. When using Montgomery multiplication in the NTT, the input must be converted to Montgomery form before the transform, and the output must be converted back after the inverse transform. Mixing Montgomery-form and standard-form values produces nonsensical results that are difficult to debug because they pass some structural tests but fail numerical tests.

Further Reading

The NTT is a deep topic with an extensive literature. For the mathematical foundations, the definitive reference is the original Cooley-Tukey paper (1965) and its adaptation to finite fields by Pollard (1971). For modern cryptographic NTT implementations, the pqm4 project provides optimized ARM Cortex-M4 implementations of NIST post-quantum candidates. The PQClean project provides portable C implementations. The NIST FIPS 203 and FIPS 204 standards contain complete NTT specifications for ML-KEM and ML-DSA, including the choice of roots of unity and the exact algorithm used.

For FHE-specific NTT optimization, the literature on RNS-based polynomial arithmetic by Bajard, Eynard, and others provides the theoretical framework. The Microsoft SEAL and OpenFHE libraries contain production NTT implementations for FHE parameter sets.

For related topics, see our Post-Quantum Cryptography Architecture page and the FHE Overview for how the NTT fits into the broader landscape of post-quantum and encrypted computation.

Post-Quantum Cryptography in Production

ML-KEM, ML-DSA, and FHE running at scale. See the benchmarks.

Benchmarks Documentation
Verify It Yourself