NTT Performance Deep Dive

The Number Theoretic Transform (NTT) is the heart of modern FHE. Every homomorphic operation—addition, multiplication, rotation—requires multiple NTT transformations. If NTT is slow, your entire FHE system is slow.

H33's January 2026 benchmarks show production-ready NTT performance: 1.85ms roundtrip for N=16384, the most common production size. Here's how we got there.

1.85ms

NTT 16K Roundtrip

386.9µs

NTT 4K Roundtrip

45.2µs

CKKS Encode

99.2%

Cache Hit Rate

Why NTT Dominates FHE Latency

Polynomial multiplication is the most expensive primitive in lattice-based cryptography. A naive multiply of two degree-N polynomials costs O(N²) operations. The NTT reduces this to O(N log N) by converting polynomials into their evaluation form, performing element-wise multiplication, then transforming back. In H33's BFV pipeline with N=4096 and a single 56-bit modulus, this transformation is invoked during encryption, decryption, key switching, and every homomorphic operation in between.

In practice, a single biometric authentication batch touches 8–12 NTTs. At N=4096, each forward or inverse transform takes roughly 190µs, meaning NTT alone can account for 1.5–2.3ms of a batch if left unoptimized. When H33 processes 32 users per ciphertext and targets ~1,109µs per batch, every microsecond in the NTT hot path matters.

Key Insight

H33 keeps the secret key in NTT form at all times and stores enrolled biometric templates pre-transformed. This eliminates redundant forward NTTs during multiply_plain_accumulate, saving two full transforms per modulus per call—a technique that dropped batch latency from 1,375µs to 1,109µs in production.

NTT Across All Sizes

FHE applications use different polynomial degrees based on security and precision requirements. We benchmarked every common size:

NTT Size	Forward	Inverse	Roundtrip	Use Case
N=256	4.2 µs	4.5 µs	8.7 µs	Testing/Development
N=1024	21.3 µs	22.8 µs	44.1 µs	Simple computations
N=4096	188.5 µs	198.4 µs	386.9 µs	Standard security
N=8192	412 µs	438 µs	850 µs	Extended precision
N=16384	892 µs	956 µs	1.85 ms	Production default
N=32768	1.92 ms	2.06 ms	3.98 ms	High security

Montgomery Arithmetic: Eliminating Division

Traditional NTT butterflies require modular reduction after every multiply—a costly integer division. H33 replaces all modular reduction with Montgomery multiplication (REDC), which substitutes division by the modulus q with a multiply-and-shift by a precomputed constant R = 2^k. All twiddle factors are stored in Montgomery form at initialization time, so the hot loop performs only multiplies and bitwise operations—no division instruction appears anywhere in the critical path.

// Montgomery butterfly (simplified)
fn butterfly_mont(a: &mut u64, b: &mut u64, w: u64, q: u64, q_inv: u64) {
    let t = mont_reduce(*b as u128 * w as u128, q, q_inv);
    *b = a.wrapping_sub(t).wrapping_add(q & mask);  // lazy reduction
    *a = a.wrapping_add(t);                          // value in [0, 2q)
}

Combined with Harvey lazy reduction, intermediate butterfly values remain in the range [0, 2q) between NTT stages. The final reduction to [0, q) happens only once after the last stage, saving one conditional subtraction per butterfly per stage—a 10–15% speedup across all polynomial sizes.

AVX-512 Optimization

Our NTT implementation uses AVX-512 SIMD instructions to process 4 coefficients in parallel. The speedup over scalar code is consistent across sizes:

Poly Size	Scalar	AVX-512	Speedup
N=1024	68.4 µs	54.2 µs	1.26x
N=4096	586 µs	412 µs	1.42x
N=8192	1.28 ms	892 µs	1.43x
N=16384	2.78 ms	1.92 ms	1.45x

The ~1.4x speedup might seem modest, but it compounds: a typical FHE operation requires 8–12 NTTs, so the cumulative savings are significant. On ARM Graviton4—where H33 runs its production workload—we use a radix-4 Montgomery NTT instead, since ARM NEON lacks native 64x64→128 multiply intrinsics. The radix-4 structure reduces the number of NTT stages by half (log₄N vs log₂N), cutting loop overhead and twiddle-factor loads proportionally.

SIMD Batching: 32 Users Per Ciphertext

NTT performance is only part of the picture. H33's BFV scheme operates at N=4096 with plaintext modulus t=65537, which satisfies the CRT batching condition t ≡ 1 (mod 2N). This unlocks 4096 plaintext slots per ciphertext. Each biometric template occupies 128 dimensions, so a single ciphertext packs 32 users (4096 ÷ 128). The entire batch—FHE inner product, ZKP lookup, and Dilithium attestation—completes in ~1,109µs, yielding approximately 42µs per authentication.

NTT-Domain Persistence

Secret keys and enrolled templates remain in NTT form permanently. multiply_plain_ntt() returns NTT-form ciphertexts, skipping two inverse transforms per modulus. This single optimization reduced batch latency by 19.3%.

Fused Inner Product

Instead of performing INTT after each chunk multiply, H33 accumulates results in NTT domain and runs a single final INTT. For a 128-dim template across 32 users, this eliminates 31 redundant inverse transforms.

Batch Attestation

One Dilithium sign+verify covers all 32 users in a batch rather than signing individually. This reduces attestation overhead from ~7,808µs (32 × 244µs) to ~244µs—a 31x reduction.

Post-Quantum Secure

The entire pipeline—BFV lattice encryption, SHA3-256 ZKP hashing, and ML-DSA (Dilithium) signatures—is post-quantum secure. No classical-only primitives appear in the authentication path.

CKKS Encoding Performance

CKKS is the scheme of choice for approximate arithmetic on encrypted data—perfect for biometrics, ML inference, and analytics. Our encoding benchmarks:

Operation	Slots	Time	Target
encode_real	512	45.2 µs	<100 µs
encode_complex	512	52.8 µs	<100 µs
decode_real	512	38.6 µs	<100 µs

512 Slots = 512 Values in Parallel

CKKS batching means a single ciphertext can hold 512 values that are all processed simultaneously. A 45.2µs encode time amortizes to just 88 nanoseconds per value.

Parameter Caching

NTT requires precomputed root-of-unity tables and other parameters. Regenerating these for every operation would be wasteful. Our caching architecture:

Pre-warmed sizes: 1024, 4096, 8192, 16384, 32768 are always hot
Cache hit rate: 99.2% in production
Redis backend: Parameters persist across process restarts
Memory footprint: ~50MB for all common sizes

For the ZKP verification layer, H33 uses an in-process DashMap cache rather than a TCP-based cache proxy. At 96 concurrent workers on Graviton4, a single-container TCP proxy serialized all connections and caused an 11x throughput regression (1.51M→136K auth/sec). The in-process DashMap achieves 0.085µs lookups with zero network contention—44x faster than raw STARK proof verification.

Production Results at Scale

These NTT optimizations feed directly into H33's production throughput. On a c8g.metal-48xl Graviton4 instance (192 vCPUs, 377 GiB RAM), the full authentication pipeline—BFV FHE batch, ZKP cache lookup, and Dilithium attestation—sustains 2,172,518 authentications per second across 96 parallel workers. Each 32-user batch completes in ~1,109µs end to end, with the NTT-domain inner product consuming the majority of that budget.

Comparison to Competitors

How does H33 compare to other FHE providers?

Provider	N=16384 NTT	H33 Advantage
Zama (Concrete)	~15ms	8x faster
SEAL (Microsoft)	~8ms	4x faster
OpenFHE	~5ms	2.7x faster
H33	1.85ms	Baseline

Our Rust implementation with AVX-512 intrinsics consistently outperforms C++ libraries. The combination of Rust's zero-cost abstractions and careful memory layout optimization makes a difference. On ARM, the Montgomery radix-4 NTT with Harvey lazy reduction achieves comparable throughput without relying on wide SIMD—proving that algorithmic choices matter more than instruction set width when the math is right.

Experience Production FHE Performance

Run encrypted computations at scale with sub-2ms latency.

Get API Key

NTT Performance Deep Dive:
Sub-2ms FHE at Production Scale

Why NTT Dominates FHE Latency

NTT Across All Sizes

Montgomery Arithmetic: Eliminating Division

AVX-512 Optimization

SIMD Batching: 32 Users Per Ciphertext

NTT-Domain Persistence

Fused Inner Product

Batch Attestation

Post-Quantum Secure

CKKS Encoding Performance

Parameter Caching

Production Results at Scale

Comparison to Competitors

Experience Production FHE Performance

Build With Post-Quantum Security

Why NTT Dominates FHE Latency

NTT Across All Sizes

Montgomery Arithmetic: Eliminating Division

AVX-512 Optimization

SIMD Batching: 32 Users Per Ciphertext

NTT-Domain Persistence

Fused Inner Product

Batch Attestation

Post-Quantum Secure

CKKS Encoding Performance

Parameter Caching

Production Results at Scale

Comparison to Competitors

Experience Production FHE Performance

Related Articles

Build With Post-Quantum Security

Related Articles