BenchmarksStack Ranking
APIsPricingDocsWhite PaperTokenBlogAboutSecurity Demo
Log InGet API Key
FHE · 5 min read

FHE Performance Optimization:
From Seconds to Microseconds

Techniques for achieving production-ready FHE performance in real applications.

~42µs
Per Auth
2.17M/s
Throughput
128-bit
Security
32
Users/Batch

FHE has a reputation for being slow, but modern implementations achieve remarkable performance. H33 performs complete FHE biometric verification in 2,648 microseconds. Here's how we optimize FHE for production use — see the full numbers on our benchmarks page.

Understanding FHE Overhead

FHE operations are inherently more expensive than plaintext operations due to:

However, careful optimization can reduce this overhead dramatically. The gap between a naive FHE implementation and a production-tuned one is often three or four orders of magnitude. Every layer of the stack presents opportunities for improvement, from the abstract algorithm design down to the specific CPU instructions executing the arithmetic.

Parameter Optimization

FHE parameters directly impact performance:

Key Parameters

Polynomial degree (N): Higher = more security but slower
Coefficient modulus: Larger = more multiplication depth but slower
Plaintext modulus: Affects encoding efficiency

Choose the minimum parameters that meet your security requirements. Over-provisioning wastes performance. H33 uses N=4096 with a single 56-bit modulus and plaintext modulus t=65537 for its BFV biometric mode. This configuration satisfies the CRT batching condition (t is congruent to 1 mod 2N), enabling SIMD slot packing while keeping the polynomial ring small enough for sub-millisecond operations. Doubling N to 8192 would quadruple NTT cost for security margin that biometric verification does not require.

NTT-Domain Persistence

The Number Theoretic Transform is the single largest performance lever in any FHE system. NTT converts polynomial multiplication from O(n^2) to O(n log n), but even O(n log n) adds up when you perform dozens of transforms per operation. The key insight is to avoid unnecessary domain conversions.

Key Optimization

Keep data in NTT (frequency) domain as long as possible. Every forward/inverse NTT pair you eliminate saves two full O(n log n) passes over your polynomial ring.

In H33's pipeline, secret keys are stored in NTT form from the moment of keygen. Public key components (pk0) are pre-transformed at key generation time, eliminating a clone-and-transform on every encrypt call. Enrolled biometric templates are also stored in NTT form, so multiply_plain_accumulate skips the forward NTT entirely. The result: our multiply_plain_ntt() path saves 2*M transforms per call, where M is the moduli count. On H33-256, this single optimization reduced batch latency from 1,375 to 1,080 microseconds -- a 21% improvement.

Montgomery Arithmetic and Lazy Reduction

Division is the most expensive integer operation on modern CPUs. Classical modular reduction requires division after every multiply, which dominates NTT butterfly cost. Montgomery multiplication replaces division with shifts and additions by working in a transformed domain where reduction is a single multiply-and-shift.

// Classical NTT butterfly (expensive division)
a[j] = (a[j] + w * a[j+half]) % q;

// Montgomery butterfly (no division in hot path)
let t = montgomery_reduce(w_mont * a[j+half]);
a[j] = a[j] + t;          // lazy: stays in [0, 2q)
a[j+half] = a[j] - t + q; // lazy: stays in [0, 2q)

H33 stores all NTT twiddle factors in Montgomery form at initialization. Combined with Harvey lazy reduction -- where intermediate butterfly values are allowed to stay in [0, 2q) between stages rather than being fully reduced to [0, q) -- this eliminates every division from the NTT hot path. The final reduction to [0, q) happens once, at the end of the transform, amortized over all log(n) stages.

Algorithmic Optimization

Structure your computation for FHE efficiency:

// Inefficient: Deep multiplication chain
result = a * b * c * d;  // Depth 3

// Better: Balanced tree
result = (a * b) * (c * d);  // Depth 2

SIMD Batching: Amortizing Cost Across Users

BFV supports SIMD (Single Instruction, Multiple Data) batching through the Chinese Remainder Theorem. With N=4096 and a 128-dimensional biometric template, each ciphertext holds 4096 / 128 = 32 users. Every FHE operation on that ciphertext processes all 32 users simultaneously for essentially the same cost as processing one.

MetricSingle User32-User BatchImprovement
FHE inner product~1,109 µs~1,109 µs32x throughput
ZKP lookup (DashMap)0.085 µs~2.7 µsIn-process, zero contention
Dilithium attestation~244 µs~244 µs1 sign+verify per batch
Total per auth~1,356 µs~42 µs32x amortization

This is how H33 achieves 1.595 million authentications per second on a single Graviton4 instance: 96 parallel workers, each processing 32-user batches, with the full post-quantum stack -- BFV FHE, ZKP verification, and Dilithium attestation -- completing in approximately 1,356 microseconds per batch.

Hardware Acceleration

Modern hardware significantly accelerates FHE:

Hardware choice matters more than most teams realize. H33's production runs on AWS Graviton4 (ARM aarch64), where the system allocator outperforms jemalloc by 8% because glibc's malloc is heavily optimized for ARM's flat memory model. NEON SIMD helps with add/subtract/compare operations in Galois rotations, but lacks native 64x64-to-128-bit multiply, so NTT butterflies stay scalar. The lesson: always benchmark on your actual production hardware. Microbenchmark wins on one architecture can become regressions under real workload contention.

Memory Optimization

FHE is memory-intensive. Optimize memory usage:

Cache behavior is particularly critical when running dozens of parallel workers. Fused optimizations that win microbenchmarks may pollute L1 cache under contention and actually increase latency in production. H33 learned this firsthand: a fused NTT pre-twist optimization showed a 15% gain in isolated benchmarks but added 24 microseconds per batch under 96-worker load due to extra lookup table pressure. Always validate optimizations under realistic concurrency.

Caching Strategies

Strategic caching eliminates redundant computation:

Beyond FHE-internal caching, the verification pipeline itself benefits from result caching. H33 uses an in-process DashMap for ZKP proof lookups, achieving 0.085-microsecond access times. This replaced a TCP-based cache proxy that serialized all connections through a single container, causing an 11x throughput regression at scale. For single-instance deployments, in-process caching is nearly always faster than network-based caching, regardless of how fast the network layer claims to be.

H33's Optimization Stack

Our production performance comes from the accumulation of every technique described above:

No single optimization delivers the result. It is the compound effect of eliminating waste at every layer -- arithmetic, memory, concurrency, caching -- that turns seconds into microseconds.

FHE performance is no longer a barrier to production deployment. With proper optimization, you can achieve real-time encrypted computation. H33 proves this daily at 1.595 million authentications per second, with every biometric match fully encrypted under BFV, every proof verified, and every result attested with post-quantum Dilithium signatures.

Ready to Go Quantum-Secure?

Start protecting your users with post-quantum authentication today. 1,000 free auths, no credit card required.

Get Free API Key →

Build With Post-Quantum Security

Enterprise-grade FHE, ZKP, and post-quantum cryptography. One API call. Sub-millisecond latency.

Get Free API Key → Read the Docs
Free tier · 10,000 API calls/month · No credit card required
Verify It Yourself