FHE has a reputation for being slow, but modern implementations achieve remarkable performance. H33 performs complete FHE biometric verification in 2,648 microseconds. Here's how we optimize FHE for production use — see the full numbers on our benchmarks page.
Understanding FHE Overhead
FHE operations are inherently more expensive than plaintext operations due to:
- Large ciphertext sizes (kilobytes vs bytes)
- Complex polynomial arithmetic
- Noise management overhead
- Key switching operations
However, careful optimization can reduce this overhead dramatically. The gap between a naive FHE implementation and a production-tuned one is often three or four orders of magnitude. Every layer of the stack presents opportunities for improvement, from the abstract algorithm design down to the specific CPU instructions executing the arithmetic.
Parameter Optimization
FHE parameters directly impact performance:
Key Parameters
Polynomial degree (N): Higher = more security but slower
Coefficient modulus: Larger = more multiplication depth but slower
Plaintext modulus: Affects encoding efficiency
Choose the minimum parameters that meet your security requirements. Over-provisioning wastes performance. H33 uses N=4096 with a single 56-bit modulus and plaintext modulus t=65537 for its BFV biometric mode. This configuration satisfies the CRT batching condition (t is congruent to 1 mod 2N), enabling SIMD slot packing while keeping the polynomial ring small enough for sub-millisecond operations. Doubling N to 8192 would quadruple NTT cost for security margin that biometric verification does not require.
NTT-Domain Persistence
The Number Theoretic Transform is the single largest performance lever in any FHE system. NTT converts polynomial multiplication from O(n^2) to O(n log n), but even O(n log n) adds up when you perform dozens of transforms per operation. The key insight is to avoid unnecessary domain conversions.
Keep data in NTT (frequency) domain as long as possible. Every forward/inverse NTT pair you eliminate saves two full O(n log n) passes over your polynomial ring.
In H33's pipeline, secret keys are stored in NTT form from the moment of keygen. Public key components (pk0) are pre-transformed at key generation time, eliminating a clone-and-transform on every encrypt call. Enrolled biometric templates are also stored in NTT form, so multiply_plain_accumulate skips the forward NTT entirely. The result: our multiply_plain_ntt() path saves 2*M transforms per call, where M is the moduli count. On H33-256, this single optimization reduced batch latency from 1,375 to 1,080 microseconds -- a 21% improvement.
Montgomery Arithmetic and Lazy Reduction
Division is the most expensive integer operation on modern CPUs. Classical modular reduction requires division after every multiply, which dominates NTT butterfly cost. Montgomery multiplication replaces division with shifts and additions by working in a transformed domain where reduction is a single multiply-and-shift.
// Classical NTT butterfly (expensive division)
a[j] = (a[j] + w * a[j+half]) % q;
// Montgomery butterfly (no division in hot path)
let t = montgomery_reduce(w_mont * a[j+half]);
a[j] = a[j] + t; // lazy: stays in [0, 2q)
a[j+half] = a[j] - t + q; // lazy: stays in [0, 2q)H33 stores all NTT twiddle factors in Montgomery form at initialization. Combined with Harvey lazy reduction -- where intermediate butterfly values are allowed to stay in [0, 2q) between stages rather than being fully reduced to [0, q) -- this eliminates every division from the NTT hot path. The final reduction to [0, q) happens once, at the end of the transform, amortized over all log(n) stages.
Algorithmic Optimization
Structure your computation for FHE efficiency:
- Minimize multiplication depth: Additions are efficient; multiplications are expensive
- Use SIMD batching: Process thousands of values in parallel
- Precompute when possible: Move computation to setup phase
- Reduce rotation count: Rotations are costly in batched operations
// Inefficient: Deep multiplication chain
result = a * b * c * d; // Depth 3
// Better: Balanced tree
result = (a * b) * (c * d); // Depth 2
SIMD Batching: Amortizing Cost Across Users
BFV supports SIMD (Single Instruction, Multiple Data) batching through the Chinese Remainder Theorem. With N=4096 and a 128-dimensional biometric template, each ciphertext holds 4096 / 128 = 32 users. Every FHE operation on that ciphertext processes all 32 users simultaneously for essentially the same cost as processing one.
| Metric | Single User | 32-User Batch | Improvement |
|---|---|---|---|
| FHE inner product | ~1,109 µs | ~1,109 µs | 32x throughput |
| ZKP lookup (DashMap) | 0.085 µs | ~2.7 µs | In-process, zero contention |
| Dilithium attestation | ~244 µs | ~244 µs | 1 sign+verify per batch |
| Total per auth | ~1,356 µs | ~42 µs | 32x amortization |
This is how H33 achieves 1.595 million authentications per second on a single Graviton4 instance: 96 parallel workers, each processing 32-user batches, with the full post-quantum stack -- BFV FHE, ZKP verification, and Dilithium attestation -- completing in approximately 1,356 microseconds per batch.
Hardware Acceleration
Modern hardware significantly accelerates FHE:
- AVX-512: Vector instructions speed up polynomial operations 4-8x
- Intel HEXL: Optimized NTT library for FHE
- GPU acceleration: Massive parallelism for independent operations
- Custom ASICs: Purpose-built hardware achieving 10,000x speedups
Hardware choice matters more than most teams realize. H33's production runs on AWS Graviton4 (ARM aarch64), where the system allocator outperforms jemalloc by 8% because glibc's malloc is heavily optimized for ARM's flat memory model. NEON SIMD helps with add/subtract/compare operations in Galois rotations, but lacks native 64x64-to-128-bit multiply, so NTT butterflies stay scalar. The lesson: always benchmark on your actual production hardware. Microbenchmark wins on one architecture can become regressions under real workload contention.
Memory Optimization
FHE is memory-intensive. Optimize memory usage:
- Reuse ciphertext objects instead of allocating new ones
- Use memory pools for frequent allocations
- Consider lazy evaluation to reduce intermediate storage
- Profile memory access patterns for cache efficiency
Cache behavior is particularly critical when running dozens of parallel workers. Fused optimizations that win microbenchmarks may pollute L1 cache under contention and actually increase latency in production. H33 learned this firsthand: a fused NTT pre-twist optimization showed a 15% gain in isolated benchmarks but added 24 microseconds per batch under 96-worker load due to extra lookup table pressure. Always validate optimizations under realistic concurrency.
Caching Strategies
Strategic caching eliminates redundant computation:
- Cache encrypted constants
- Store precomputed rotation keys
- Reuse evaluation keys across operations
- Cache intermediate results in repeated computations
Beyond FHE-internal caching, the verification pipeline itself benefits from result caching. H33 uses an in-process DashMap for ZKP proof lookups, achieving 0.085-microsecond access times. This replaced a TCP-based cache proxy that serialized all connections through a single container, causing an 11x throughput regression at scale. For single-instance deployments, in-process caching is nearly always faster than network-based caching, regardless of how fast the network layer claims to be.
H33's Optimization Stack
Our production performance comes from the accumulation of every technique described above:
- Custom BFV implementation optimized for biometric distances
- SIMD batching: 32 users per ciphertext
- Montgomery NTT with Harvey lazy reduction -- zero division in the hot path
- NTT-domain persistence across keygen, enrollment, and verification
- Pre-computed delta*m values eliminating u128 multiplies from encrypt
- Batch Dilithium attestation: one sign+verify per 32-user batch
- In-process DashMap ZKP caching at 0.085 microseconds per lookup
- Careful parameter selection: N=4096, single 56-bit modulus, t=65537
No single optimization delivers the result. It is the compound effect of eliminating waste at every layer -- arithmetic, memory, concurrency, caching -- that turns seconds into microseconds.
FHE performance is no longer a barrier to production deployment. With proper optimization, you can achieve real-time encrypted computation. H33 proves this daily at 1.595 million authentications per second, with every biometric match fully encrypted under BFV, every proof verified, and every result attested with post-quantum Dilithium signatures.
Ready to Go Quantum-Secure?
Start protecting your users with post-quantum authentication today. 1,000 free auths, no credit card required.
Get Free API Key →