When we started optimizing H33's fully homomorphic encryption pipeline, our BFV encryption took 2.02 milliseconds. Today, it takes 331 microseconds. This is the story of how we achieved 84% faster encryption and became the fastest FHE library in the world -- and how that single-encrypt speedup compounds into a production system processing 1.595 million authentications per second. See the full results on our benchmarks page.
The Starting Point: 2.02ms
Our initial implementation used a straightforward approach to BFV encryption. It worked correctly, passed all test vectors, and was reasonably fast. But "reasonably fast" isn't good enough when you're building quantum-resistant biometric authentication that needs to compete with traditional auth systems.
The bottleneck was clear: modular arithmetic. BFV encryption requires thousands of modular multiplications across large polynomial rings. Each multiplication involved division operations that were killing performance. With N=4096 polynomial degree and a 56-bit ciphertext modulus, our encrypt path was executing roughly 4,096 modular multiplications per NTT butterfly pass, across multiple residue limbs. The div instruction on modern x86 takes 30-90 cycles depending on operand size -- and we were calling it millions of times per second.
Optimization 1: Montgomery Multiplication
The first major breakthrough came from implementing Montgomery multiplication. Instead of performing expensive division operations for each modular reduction, Montgomery form converts multiplications into shifts and additions.
What is Montgomery Multiplication?
Traditional modular multiplication: a * b mod n requires division by n.
Montgomery multiplication pre-computes a special form where the modular reduction becomes a simple bit shift. The overhead of converting to/from Montgomery form is amortized across many operations.
The key insight is that BFV encryption is dominated by NTT -- the Number Theoretic Transform. A single NTT of degree N=4096 requires N/2 * log2(N) = 24,576 butterfly operations, each containing a modular multiply. By keeping all NTT twiddle factors in Montgomery form and performing the entire forward-inverse NTT cycle without leaving Montgomery domain, we eliminate the conversion overhead entirely. The secret key, once generated, stays in NTT-Montgomery form for the lifetime of the system.
// Montgomery reduction (REDC) -- no division
fn mont_reduce(t: u128, q: u64, q_inv: u64) -> u64 {
let m = (t as u64).wrapping_mul(q_inv);
let u = (t + m as u128 * q as u128) >> 64;
if u >= q as u128 { (u - q as u128) as u64 } else { u as u64 }
}This single change dropped our encryption time from 2.02ms to 351 microseconds -- an 83% reduction. We were now at 17% of our original runtime.
Optimization 2: SIMD Parallelization
Montgomery multiplication made each operation faster, but we were still processing coefficients sequentially. Modern CPUs have wide SIMD registers (AVX-512 gives us 512 bits) that can process multiple coefficients simultaneously.
We restructured our polynomial operations to leverage:
- AVX-512 on x86: Process 8 x 64-bit coefficients per instruction
- NEON on ARM: Process 2 x 64-bit or 4 x 32-bit coefficients per instruction
- Parallel NTT: Number Theoretic Transform with vectorized butterfly operations
The parallel Montgomery implementation brought us from 351us to 330 microseconds. A modest 6% improvement, but we were already operating near the memory bandwidth limit.
On ARM Graviton4 (our production hardware), NEON lacks a native 64x64-to-128-bit multiply instruction. The NEON butterfly extracts to scalar, performs the u128 multiply, then repacks -- making NEON useful primarily for add/sub/compare vectorization in the NTT, not for the multiplication core itself. This architectural detail drove us toward Harvey lazy reduction (keeping butterfly values in [0, 2q) between stages) rather than deeper SIMD fusion.
Optimization 3: SIMD Batching and NTT-Domain Persistence
The encryption speedup alone does not tell the full production story. H33 uses CRT-based SIMD batching to pack 32 biometric templates into a single ciphertext. With N=4096 polynomial slots and 128-dimensional feature vectors, each BFV ciphertext carries an entire batch of users, all computed on while still encrypted. This means the 331-microsecond encrypt cost is amortized across 32 authentications.
We also restructured the verify pipeline to keep intermediate ciphertexts in NTT form wherever possible. The function multiply_plain_ntt() returns ciphertexts with is_ntt_form: true, skipping two inverse NTT transforms per modulus per call. This dropped our batch FHE inner-product time from 1,375 microseconds to ~1,109 microseconds -- a 19.3% reduction at the batch level.
| Pipeline Stage | Latency | Post-Quantum Secure |
|---|---|---|
| FHE Batch (32 users) | ~1,109 µs | Yes (lattice-based) |
| ZKP Lookup | 0.085 µs | Yes (SHA3-256) |
| Dilithium Attestation | ~244 µs | Yes (ML-DSA) |
| Per Authentication | ~42 µs | Fully PQ |
Dead Ends: What Did Not Work
Not every optimization attempt was successful. Two strategies that looked promising in microbenchmarks failed catastrophically in production at 96-worker concurrency on Graviton4:
- Fused NTT pre-twist tables: Pre-computing combined twiddle-and-twist factors showed a 15% win in isolated benchmarks on M4 Max. But in production, the extra lookup table polluted L1 cache under 96-worker contention, adding 24 microseconds per batch. Net negative.
- jemalloc allocator: On x86 servers, jemalloc typically improves multi-threaded throughput. On Graviton4 with its flat ARM memory model, glibc malloc is already heavily optimized. jemalloc's arena bookkeeping added pure overhead, causing an 8% throughput regression (36,994 vs 40,350 batch/sec).
These failures underscore an important lesson: FHE optimization is architecture-specific. A technique that wins on laptop silicon may lose on server-grade ARM, and vice versa. Always benchmark on your production target.
Final Results: World's Fastest FHE
Final Performance
2.02ms → 331us (84% faster, 6.1x improvement)
vs Microsoft SEAL: 4.5x faster at equivalent security parameters
Status: #1 worldwide for BFV encryption speed
Comparison with Microsoft SEAL
| Operation | H33 | SEAL | H33 Advantage |
|---|---|---|---|
| BFV Encrypt (N=4096) | 331us | 5.98ms | 4.5x faster |
| Memory Allocation | Pre-pooled | Per-operation | Zero alloc |
| SIMD Support | AVX-512 + NEON | AVX-512 | Wider platform coverage |
| Batch Throughput (96 cores) | 2.17M auth/sec | N/A | Production proven |
| Post-Quantum Attestation | Dilithium (ML-DSA) | None | Full PQ stack |
Key Takeaways
- Montgomery multiplication is essential -- It provided 83% of our total speedup. If you're doing FHE without Montgomery form, you're leaving massive performance on the table.
- SIMD is the final mile -- Once arithmetic is optimized, vectorization squeezes out the remaining gains, but beware of architecture-specific pitfalls like missing 64-bit NEON multiply.
- Batch amortization matters more than single-op speed -- Packing 32 users per ciphertext turns a 1,109-microsecond batch operation into ~42 microseconds per authentication. System-level design amplifies micro-level optimizations.
- Profile on production hardware -- Cache behavior, allocator performance, and memory models vary dramatically between M4 Max, Xeon, and Graviton4. Microbenchmarks lie.
- Memory matters -- Our pre-pooled memory allocator eliminates allocation overhead entirely, and keeping keys in NTT-Montgomery form avoids redundant transforms on every request.
This is just one piece of the H33 optimization story. In the next post, we'll cover how we achieved 65% faster STARK proving through NEON NTT, parallel Merkle trees, and batch inversion.
Try the World's Fastest FHE
331 microsecond encryption. 2.17M auth/sec at scale. NIST-compliant post-quantum security. Production ready.
Get API Key