FHE operates on encrypted polynomials in large rings, making each operation thousands of times more expensive than plaintext. Key bottlenecks are polynomial multiplication (solved by NTT), noise management, and large ciphertext sizes.

How fast can FHE get?

With careful optimization, FHE can achieve sub-millisecond latencies. H33 demonstrates 42 microseconds per authentication using BFV with SIMD batching, Montgomery NTT, and hardware-specific tuning on ARM Graviton4 processors.

What is NTT and why does it matter for FHE?

The Number Theoretic Transform converts polynomial multiplication from O(n squared) to O(n log n) operations. Since FHE relies heavily on polynomial arithmetic, NTT optimization is the single biggest performance lever.

Yes, GPU acceleration can speed up FHE operations significantly. The parallel nature of NTT and polynomial operations maps well to GPU architectures. However, data transfer overhead between CPU and GPU must be carefully managed.

FHE Performance Optimization

FHE has a reputation for being slow, but modern implementations achieve remarkable performance. H33 performs complete FHE biometric verification in 2,648 microseconds. Here's how we optimize FHE for production use — see the full numbers on our benchmarks page.

Understanding FHE Overhead

FHE operations are inherently more expensive than plaintext operations due to:

Large ciphertext sizes (kilobytes vs bytes)
Complex polynomial arithmetic
Noise management overhead
Key switching operations

However, careful optimization can reduce this overhead dramatically. The gap between a naive FHE implementation and a production-tuned one is often three or four orders of magnitude. Every layer of the stack presents opportunities for improvement, from the abstract algorithm design down to the specific CPU instructions executing the arithmetic.

Parameter Optimization

FHE parameters directly impact performance:

Key Parameters

Polynomial degree (N): Higher = more security but slower
Coefficient modulus: Larger = more multiplication depth but slower
Plaintext modulus: Affects encoding efficiency

Choose the minimum parameters that meet your security requirements. Over-provisioning wastes performance. H33 uses N=4096 with a single 56-bit modulus and plaintext modulus t=65537 for its BFV biometric mode. This configuration satisfies the CRT batching condition (t is congruent to 1 mod 2N), enabling SIMD slot packing while keeping the polynomial ring small enough for sub-millisecond operations. Doubling N to 8192 would quadruple NTT cost for security margin that biometric verification does not require.

NTT-Domain Persistence

The Number Theoretic Transform is the single largest performance lever in any FHE system. NTT converts polynomial multiplication from O(n^2) to O(n log n), but even O(n log n) adds up when you perform dozens of transforms per operation. The key insight is to avoid unnecessary domain conversions.

Key Optimization

Keep data in NTT (frequency) domain as long as possible. Every forward/inverse NTT pair you eliminate saves two full O(n log n) passes over your polynomial ring.

In H33's pipeline, secret keys are stored in NTT form from the moment of keygen. Public key components (pk0) are pre-transformed at key generation time, eliminating a clone-and-transform on every encrypt call. Enrolled biometric templates are also stored in NTT form, so multiply_plain_accumulate skips the forward NTT entirely. The result: our multiply_plain_ntt() path saves 2*M transforms per call, where M is the moduli count. On H33-256, this single optimization reduced batch latency from 1,375 to 1,080 microseconds -- a 21% improvement.

Montgomery Arithmetic and Lazy Reduction

Division is the most expensive integer operation on modern CPUs. Classical modular reduction requires division after every multiply, which dominates NTT butterfly cost. Montgomery multiplication replaces division with shifts and additions by working in a transformed domain where reduction is a single multiply-and-shift.

// Classical NTT butterfly (expensive division)
a[j] = (a[j] + w * a[j+half]) % q;

// Montgomery butterfly (no division in hot path)
let t = montgomery_reduce(w_mont * a[j+half]);
a[j] = a[j] + t;          // lazy: stays in [0, 2q)
a[j+half] = a[j] - t + q; // lazy: stays in [0, 2q)

H33 stores all NTT twiddle factors in Montgomery form at initialization. Combined with Harvey lazy reduction -- where intermediate butterfly values are allowed to stay in [0, 2q) between stages rather than being fully reduced to [0, q) -- this eliminates every division from the NTT hot path. The final reduction to [0, q) happens once, at the end of the transform, amortized over all log(n) stages.

Algorithmic Optimization

Structure your computation for FHE efficiency:

Minimize multiplication depth: Additions are efficient; multiplications are expensive
Use SIMD batching: Process thousands of values in parallel
Precompute when possible: Move computation to setup phase
Reduce rotation count: Rotations are costly in batched operations

// Inefficient: Deep multiplication chain
result = a * b * c * d;  // Depth 3

// Better: Balanced tree
result = (a * b) * (c * d);  // Depth 2

SIMD Batching: Amortizing Cost Across Users

BFV supports SIMD (Single Instruction, Multiple Data) batching through the Chinese Remainder Theorem. With N=4096 and a 128-dimensional biometric template, each ciphertext holds 4096 / 128 = 32 users. Every FHE operation on that ciphertext processes all 32 users simultaneously for essentially the same cost as processing one.

Metric	Single User	32-User Batch	Improvement
FHE inner product	~1,109 µs	~1,109 µs	32x throughput
ZKP lookup (DashMap)	0.085 µs	~2.7 µs	In-process, zero contention
Dilithium attestation	~244 µs	~244 µs	1 sign+verify per batch
Total per auth	~1,356 µs	~42 µs	32x amortization

This is how H33 achieves 1.595 million authentications per second on a single Graviton4 instance: 96 parallel workers, each processing 32-user batches, with the full post-quantum stack -- BFV FHE, ZKP verification, and Dilithium attestation -- completing in approximately 1,356 microseconds per batch.

Hardware Acceleration

Modern hardware significantly accelerates FHE:

AVX-512: Vector instructions speed up polynomial operations 4-8x
Intel HEXL: Optimized NTT library for FHE
GPU acceleration: Massive parallelism for independent operations
Custom ASICs: Purpose-built hardware achieving 10,000x speedups

Hardware choice matters more than most teams realize. H33's production runs on AWS Graviton4 (ARM aarch64), where the system allocator outperforms jemalloc by 8% because glibc's malloc is heavily optimized for ARM's flat memory model. NEON SIMD helps with add/subtract/compare operations in Galois rotations, but lacks native 64x64-to-128-bit multiply, so NTT butterflies stay scalar. The lesson: always benchmark on your actual production hardware. Microbenchmark wins on one architecture can become regressions under real workload contention.

Memory Optimization

FHE is memory-intensive. Optimize memory usage:

Reuse ciphertext objects instead of allocating new ones
Use memory pools for frequent allocations
Consider lazy evaluation to reduce intermediate storage
Profile memory access patterns for cache efficiency

Cache behavior is particularly critical when running dozens of parallel workers. Fused optimizations that win microbenchmarks may pollute L1 cache under contention and actually increase latency in production. H33 learned this firsthand: a fused NTT pre-twist optimization showed a 15% gain in isolated benchmarks but added 24 microseconds per batch under 96-worker load due to extra lookup table pressure. Always validate optimizations under realistic concurrency.

Caching Strategies

Strategic caching eliminates redundant computation:

Cache encrypted constants
Store precomputed rotation keys
Reuse evaluation keys across operations
Cache intermediate results in repeated computations

Beyond FHE-internal caching, the verification pipeline itself benefits from result caching. H33 uses an in-process DashMap for ZKP proof lookups, achieving 0.085-microsecond access times. This replaced a TCP-based cache proxy that serialized all connections through a single container, causing an 11x throughput regression at scale. For single-instance deployments, in-process caching is nearly always faster than network-based caching, regardless of how fast the network layer claims to be.

H33's Optimization Stack

Our production performance comes from the accumulation of every technique described above:

Custom BFV implementation optimized for biometric distances
SIMD batching: 32 users per ciphertext
Montgomery NTT with Harvey lazy reduction -- zero division in the hot path
NTT-domain persistence across keygen, enrollment, and verification
Pre-computed delta*m values eliminating u128 multiplies from encrypt
Batch Dilithium attestation: one sign+verify per 32-user batch
In-process DashMap ZKP caching at 0.085 microseconds per lookup
Careful parameter selection: N=4096, single 56-bit modulus, t=65537

No single optimization delivers the result. It is the compound effect of eliminating waste at every layer -- arithmetic, memory, concurrency, caching -- that turns seconds into microseconds.

FHE performance is no longer a barrier to production deployment. With proper optimization, you can achieve real-time encrypted computation. H33 proves this daily at 1.595 million authentications per second, with every biometric match fully encrypted under BFV, every proof verified, and every result attested with post-quantum Dilithium signatures.

Ready to Go Quantum-Secure?

Start protecting your users with post-quantum authentication today. 1,000 free auths, no credit card required.

Get Free API Key →

FHE Performance Optimization:
From Seconds to Microseconds

Understanding FHE Overhead

Parameter Optimization

Key Parameters

NTT-Domain Persistence

Montgomery Arithmetic and Lazy Reduction

Algorithmic Optimization

SIMD Batching: Amortizing Cost Across Users

Hardware Acceleration

Memory Optimization

Caching Strategies

H33's Optimization Stack

Ready to Go Quantum-Secure?

Build With Post-Quantum Security

Understanding FHE Overhead

Parameter Optimization

Key Parameters

NTT-Domain Persistence

Montgomery Arithmetic and Lazy Reduction

Algorithmic Optimization

SIMD Batching: Amortizing Cost Across Users

Hardware Acceleration

Memory Optimization

Caching Strategies

H33's Optimization Stack

Ready to Go Quantum-Secure?

Build With Post-Quantum Security

Related Articles