The Number Theoretic Transform (NTT) is the heart of modern FHE. Every homomorphic operation—addition, multiplication, rotation—requires multiple NTT transformations. If NTT is slow, your entire FHE system is slow.
H33's January 2026 benchmarks show production-ready NTT performance: 1.85ms roundtrip for N=16384, the most common production size. Here's how we got there.
Why NTT Dominates FHE Latency
Polynomial multiplication is the most expensive primitive in lattice-based cryptography. A naive multiply of two degree-N polynomials costs O(N²) operations. The NTT reduces this to O(N log N) by converting polynomials into their evaluation form, performing element-wise multiplication, then transforming back. In H33's BFV pipeline with N=4096 and a single 56-bit modulus, this transformation is invoked during encryption, decryption, key switching, and every homomorphic operation in between.
In practice, a single biometric authentication batch touches 8–12 NTTs. At N=4096, each forward or inverse transform takes roughly 190µs, meaning NTT alone can account for 1.5–2.3ms of a batch if left unoptimized. When H33 processes 32 users per ciphertext and targets ~1,109µs per batch, every microsecond in the NTT hot path matters.
H33 keeps the secret key in NTT form at all times and stores enrolled biometric templates pre-transformed. This eliminates redundant forward NTTs during multiply_plain_accumulate, saving two full transforms per modulus per call—a technique that dropped batch latency from 1,375µs to 1,109µs in production.
NTT Across All Sizes
FHE applications use different polynomial degrees based on security and precision requirements. We benchmarked every common size:
| NTT Size | Forward | Inverse | Roundtrip | Use Case |
|---|---|---|---|---|
| N=256 | 4.2 µs | 4.5 µs | 8.7 µs | Testing/Development |
| N=1024 | 21.3 µs | 22.8 µs | 44.1 µs | Simple computations |
| N=4096 | 188.5 µs | 198.4 µs | 386.9 µs | Standard security |
| N=8192 | 412 µs | 438 µs | 850 µs | Extended precision |
| N=16384 | 892 µs | 956 µs | 1.85 ms | Production default |
| N=32768 | 1.92 ms | 2.06 ms | 3.98 ms | High security |
Montgomery Arithmetic: Eliminating Division
Traditional NTT butterflies require modular reduction after every multiply—a costly integer division. H33 replaces all modular reduction with Montgomery multiplication (REDC), which substitutes division by the modulus q with a multiply-and-shift by a precomputed constant R = 2^k. All twiddle factors are stored in Montgomery form at initialization time, so the hot loop performs only multiplies and bitwise operations—no division instruction appears anywhere in the critical path.
// Montgomery butterfly (simplified)
fn butterfly_mont(a: &mut u64, b: &mut u64, w: u64, q: u64, q_inv: u64) {
let t = mont_reduce(*b as u128 * w as u128, q, q_inv);
*b = a.wrapping_sub(t).wrapping_add(q & mask); // lazy reduction
*a = a.wrapping_add(t); // value in [0, 2q)
}Combined with Harvey lazy reduction, intermediate butterfly values remain in the range [0, 2q) between NTT stages. The final reduction to [0, q) happens only once after the last stage, saving one conditional subtraction per butterfly per stage—a 10–15% speedup across all polynomial sizes.
AVX-512 Optimization
Our NTT implementation uses AVX-512 SIMD instructions to process 4 coefficients in parallel. The speedup over scalar code is consistent across sizes:
| Poly Size | Scalar | AVX-512 | Speedup |
|---|---|---|---|
| N=1024 | 68.4 µs | 54.2 µs | 1.26x |
| N=4096 | 586 µs | 412 µs | 1.42x |
| N=8192 | 1.28 ms | 892 µs | 1.43x |
| N=16384 | 2.78 ms | 1.92 ms | 1.45x |
The ~1.4x speedup might seem modest, but it compounds: a typical FHE operation requires 8–12 NTTs, so the cumulative savings are significant. On ARM Graviton4—where H33 runs its production workload—we use a radix-4 Montgomery NTT instead, since ARM NEON lacks native 64x64→128 multiply intrinsics. The radix-4 structure reduces the number of NTT stages by half (log₄N vs log₂N), cutting loop overhead and twiddle-factor loads proportionally.
SIMD Batching: 32 Users Per Ciphertext
NTT performance is only part of the picture. H33's BFV scheme operates at N=4096 with plaintext modulus t=65537, which satisfies the CRT batching condition t ≡ 1 (mod 2N). This unlocks 4096 plaintext slots per ciphertext. Each biometric template occupies 128 dimensions, so a single ciphertext packs 32 users (4096 ÷ 128). The entire batch—FHE inner product, ZKP lookup, and Dilithium attestation—completes in ~1,109µs, yielding approximately 42µs per authentication.
NTT-Domain Persistence
Secret keys and enrolled templates remain in NTT form permanently. multiply_plain_ntt() returns NTT-form ciphertexts, skipping two inverse transforms per modulus. This single optimization reduced batch latency by 19.3%.
Fused Inner Product
Instead of performing INTT after each chunk multiply, H33 accumulates results in NTT domain and runs a single final INTT. For a 128-dim template across 32 users, this eliminates 31 redundant inverse transforms.
Batch Attestation
One Dilithium sign+verify covers all 32 users in a batch rather than signing individually. This reduces attestation overhead from ~7,808µs (32 × 244µs) to ~244µs—a 31x reduction.
Post-Quantum Secure
The entire pipeline—BFV lattice encryption, SHA3-256 ZKP hashing, and ML-DSA (Dilithium) signatures—is post-quantum secure. No classical-only primitives appear in the authentication path.
CKKS Encoding Performance
CKKS is the scheme of choice for approximate arithmetic on encrypted data—perfect for biometrics, ML inference, and analytics. Our encoding benchmarks:
| Operation | Slots | Time | Target |
|---|---|---|---|
| encode_real | 512 | 45.2 µs | <100 µs |
| encode_complex | 512 | 52.8 µs | <100 µs |
| decode_real | 512 | 38.6 µs | <100 µs |
CKKS batching means a single ciphertext can hold 512 values that are all processed simultaneously. A 45.2µs encode time amortizes to just 88 nanoseconds per value.
Parameter Caching
NTT requires precomputed root-of-unity tables and other parameters. Regenerating these for every operation would be wasteful. Our caching architecture:
- Pre-warmed sizes: 1024, 4096, 8192, 16384, 32768 are always hot
- Cache hit rate: 99.2% in production
- Redis backend: Parameters persist across process restarts
- Memory footprint: ~50MB for all common sizes
For the ZKP verification layer, H33 uses an in-process DashMap cache rather than a TCP-based cache proxy. At 96 concurrent workers on Graviton4, a single-container TCP proxy serialized all connections and caused an 11x throughput regression (1.51M→136K auth/sec). The in-process DashMap achieves 0.085µs lookups with zero network contention—44x faster than raw STARK proof verification.
Production Results at Scale
These NTT optimizations feed directly into H33's production throughput. On a c8g.metal-48xl Graviton4 instance (192 vCPUs, 377 GiB RAM), the full authentication pipeline—BFV FHE batch, ZKP cache lookup, and Dilithium attestation—sustains 2,172,518 authentications per second across 96 parallel workers. Each 32-user batch completes in ~1,109µs end to end, with the NTT-domain inner product consuming the majority of that budget.
Comparison to Competitors
How does H33 compare to other FHE providers?
| Provider | N=16384 NTT | H33 Advantage |
|---|---|---|
| Zama (Concrete) | ~15ms | 8x faster |
| SEAL (Microsoft) | ~8ms | 4x faster |
| OpenFHE | ~5ms | 2.7x faster |
| H33 | 1.85ms | Baseline |
Our Rust implementation with AVX-512 intrinsics consistently outperforms C++ libraries. The combination of Rust's zero-cost abstractions and careful memory layout optimization makes a difference. On ARM, the Montgomery radix-4 NTT with Harvey lazy reduction achieves comparable throughput without relying on wide SIMD—proving that algorithmic choices matter more than instruction set width when the math is right.
Experience Production FHE Performance
Run encrypted computations at scale with sub-2ms latency.
Get API Key