BenchmarksStack Ranking
APIsPricingDocsWhite PaperTokenBlogAboutSecurity Demo
Log InGet API Key
FHE · 5 min read

NTT Performance Deep Dive:
Sub-2ms FHE at Production Scale

How H33 achieves 1.85ms NTT roundtrip for N=16384. AVX-512 optimization, CKKS encoding, and polynomial multiplication benchmarks.

~42µs
Per Auth
2.17M/s
Throughput
128-bit
Security
32
Users/Batch

The Number Theoretic Transform (NTT) is the heart of modern FHE. Every homomorphic operation—addition, multiplication, rotation—requires multiple NTT transformations. If NTT is slow, your entire FHE system is slow.

H33's January 2026 benchmarks show production-ready NTT performance: 1.85ms roundtrip for N=16384, the most common production size. Here's how we got there.

1.85ms
NTT 16K Roundtrip
386.9µs
NTT 4K Roundtrip
45.2µs
CKKS Encode
99.2%
Cache Hit Rate

Why NTT Dominates FHE Latency

Polynomial multiplication is the most expensive primitive in lattice-based cryptography. A naive multiply of two degree-N polynomials costs O(N²) operations. The NTT reduces this to O(N log N) by converting polynomials into their evaluation form, performing element-wise multiplication, then transforming back. In H33's BFV pipeline with N=4096 and a single 56-bit modulus, this transformation is invoked during encryption, decryption, key switching, and every homomorphic operation in between.

In practice, a single biometric authentication batch touches 8–12 NTTs. At N=4096, each forward or inverse transform takes roughly 190µs, meaning NTT alone can account for 1.5–2.3ms of a batch if left unoptimized. When H33 processes 32 users per ciphertext and targets ~1,109µs per batch, every microsecond in the NTT hot path matters.

Key Insight

H33 keeps the secret key in NTT form at all times and stores enrolled biometric templates pre-transformed. This eliminates redundant forward NTTs during multiply_plain_accumulate, saving two full transforms per modulus per call—a technique that dropped batch latency from 1,375µs to 1,109µs in production.

NTT Across All Sizes

FHE applications use different polynomial degrees based on security and precision requirements. We benchmarked every common size:

NTT Size Forward Inverse Roundtrip Use Case
N=256 4.2 µs 4.5 µs 8.7 µs Testing/Development
N=1024 21.3 µs 22.8 µs 44.1 µs Simple computations
N=4096 188.5 µs 198.4 µs 386.9 µs Standard security
N=8192 412 µs 438 µs 850 µs Extended precision
N=16384 892 µs 956 µs 1.85 ms Production default
N=32768 1.92 ms 2.06 ms 3.98 ms High security

Montgomery Arithmetic: Eliminating Division

Traditional NTT butterflies require modular reduction after every multiply—a costly integer division. H33 replaces all modular reduction with Montgomery multiplication (REDC), which substitutes division by the modulus q with a multiply-and-shift by a precomputed constant R = 2^k. All twiddle factors are stored in Montgomery form at initialization time, so the hot loop performs only multiplies and bitwise operations—no division instruction appears anywhere in the critical path.

// Montgomery butterfly (simplified)
fn butterfly_mont(a: &mut u64, b: &mut u64, w: u64, q: u64, q_inv: u64) {
    let t = mont_reduce(*b as u128 * w as u128, q, q_inv);
    *b = a.wrapping_sub(t).wrapping_add(q & mask);  // lazy reduction
    *a = a.wrapping_add(t);                          // value in [0, 2q)
}

Combined with Harvey lazy reduction, intermediate butterfly values remain in the range [0, 2q) between NTT stages. The final reduction to [0, q) happens only once after the last stage, saving one conditional subtraction per butterfly per stage—a 10–15% speedup across all polynomial sizes.

AVX-512 Optimization

Our NTT implementation uses AVX-512 SIMD instructions to process 4 coefficients in parallel. The speedup over scalar code is consistent across sizes:

Poly Size Scalar AVX-512 Speedup
N=1024 68.4 µs 54.2 µs 1.26x
N=4096 586 µs 412 µs 1.42x
N=8192 1.28 ms 892 µs 1.43x
N=16384 2.78 ms 1.92 ms 1.45x

The ~1.4x speedup might seem modest, but it compounds: a typical FHE operation requires 8–12 NTTs, so the cumulative savings are significant. On ARM Graviton4—where H33 runs its production workload—we use a radix-4 Montgomery NTT instead, since ARM NEON lacks native 64x64→128 multiply intrinsics. The radix-4 structure reduces the number of NTT stages by half (log₄N vs log₂N), cutting loop overhead and twiddle-factor loads proportionally.

SIMD Batching: 32 Users Per Ciphertext

NTT performance is only part of the picture. H33's BFV scheme operates at N=4096 with plaintext modulus t=65537, which satisfies the CRT batching condition t ≡ 1 (mod 2N). This unlocks 4096 plaintext slots per ciphertext. Each biometric template occupies 128 dimensions, so a single ciphertext packs 32 users (4096 ÷ 128). The entire batch—FHE inner product, ZKP lookup, and Dilithium attestation—completes in ~1,109µs, yielding approximately 42µs per authentication.

NTT-Domain Persistence

Secret keys and enrolled templates remain in NTT form permanently. multiply_plain_ntt() returns NTT-form ciphertexts, skipping two inverse transforms per modulus. This single optimization reduced batch latency by 19.3%.

Fused Inner Product

Instead of performing INTT after each chunk multiply, H33 accumulates results in NTT domain and runs a single final INTT. For a 128-dim template across 32 users, this eliminates 31 redundant inverse transforms.

Batch Attestation

One Dilithium sign+verify covers all 32 users in a batch rather than signing individually. This reduces attestation overhead from ~7,808µs (32 × 244µs) to ~244µs—a 31x reduction.

Post-Quantum Secure

The entire pipeline—BFV lattice encryption, SHA3-256 ZKP hashing, and ML-DSA (Dilithium) signatures—is post-quantum secure. No classical-only primitives appear in the authentication path.

CKKS Encoding Performance

CKKS is the scheme of choice for approximate arithmetic on encrypted data—perfect for biometrics, ML inference, and analytics. Our encoding benchmarks:

Operation Slots Time Target
encode_real 512 45.2 µs <100 µs
encode_complex 512 52.8 µs <100 µs
decode_real 512 38.6 µs <100 µs
512 Slots = 512 Values in Parallel

CKKS batching means a single ciphertext can hold 512 values that are all processed simultaneously. A 45.2µs encode time amortizes to just 88 nanoseconds per value.

Parameter Caching

NTT requires precomputed root-of-unity tables and other parameters. Regenerating these for every operation would be wasteful. Our caching architecture:

For the ZKP verification layer, H33 uses an in-process DashMap cache rather than a TCP-based cache proxy. At 96 concurrent workers on Graviton4, a single-container TCP proxy serialized all connections and caused an 11x throughput regression (1.51M→136K auth/sec). The in-process DashMap achieves 0.085µs lookups with zero network contention—44x faster than raw STARK proof verification.

Production Results at Scale

These NTT optimizations feed directly into H33's production throughput. On a c8g.metal-48xl Graviton4 instance (192 vCPUs, 377 GiB RAM), the full authentication pipeline—BFV FHE batch, ZKP cache lookup, and Dilithium attestation—sustains 2,172,518 authentications per second across 96 parallel workers. Each 32-user batch completes in ~1,109µs end to end, with the NTT-domain inner product consuming the majority of that budget.

Comparison to Competitors

How does H33 compare to other FHE providers?

Provider N=16384 NTT H33 Advantage
Zama (Concrete) ~15ms 8x faster
SEAL (Microsoft) ~8ms 4x faster
OpenFHE ~5ms 2.7x faster
H33 1.85ms Baseline

Our Rust implementation with AVX-512 intrinsics consistently outperforms C++ libraries. The combination of Rust's zero-cost abstractions and careful memory layout optimization makes a difference. On ARM, the Montgomery radix-4 NTT with Harvey lazy reduction achieves comparable throughput without relying on wide SIMD—proving that algorithmic choices matter more than instruction set width when the math is right.

Experience Production FHE Performance

Run encrypted computations at scale with sub-2ms latency.

Get API Key

Build With Post-Quantum Security

Enterprise-grade FHE, ZKP, and post-quantum cryptography. One API call. Sub-millisecond latency.

Get Free API Key → Read the Docs
Free tier · 10,000 API calls/month · No credit card required
Verify It Yourself