When we started optimizing H33's fully homomorphic encryption pipeline, our BFV encryption took 2.02 milliseconds. Today, it takes 331 microseconds. This is the story of how we achieved 84% faster encryption and became the fastest FHE library in the world.
The Starting Point: 2.02ms
Our initial implementation used a straightforward approach to BFV encryption. It worked correctly, passed all test vectors, and was reasonably fast. But "reasonably fast" isn't good enough when you're building quantum-resistant biometric authentication that needs to compete with traditional auth systems.
The bottleneck was clear: modular arithmetic. BFV encryption requires thousands of modular multiplications across large polynomial rings. Each multiplication involved division operations that were killing performance.
Optimization 1: Montgomery Multiplication
The first major breakthrough came from implementing Montgomery multiplication. Instead of performing expensive division operations for each modular reduction, Montgomery form converts multiplications into shifts and additions.
What is Montgomery Multiplication?
Traditional modular multiplication: a * b mod n requires division by n.
Montgomery multiplication pre-computes a special form where the modular reduction becomes a simple bit shift. The overhead of converting to/from Montgomery form is amortized across many operations.
This single change dropped our encryption time from 2.02ms to 351 microseconds - an 83% reduction. We were now at 17% of our original runtime.
Optimization 2: SIMD Parallelization
Montgomery multiplication made each operation faster, but we were still processing coefficients sequentially. Modern CPUs have wide SIMD registers (AVX-512 gives us 512 bits) that can process multiple coefficients simultaneously.
We restructured our polynomial operations to leverage:
- AVX-512 on x86: Process 8 x 64-bit coefficients per instruction
- NEON on ARM: Process 2 x 64-bit or 4 x 32-bit coefficients per instruction
- Parallel NTT: Number Theoretic Transform with vectorized butterfly operations
The parallel Montgomery implementation brought us from 351us to 330 microseconds. A modest 6% improvement, but we were already operating near the memory bandwidth limit.
Final Results: World's Fastest FHE
Final Performance
2.02ms → 331us (84% faster, 6.1x improvement)
vs Microsoft SEAL: 4.5x faster at equivalent security parameters
Status: #1 worldwide for BFV encryption speed
Comparison with Microsoft SEAL
| Operation | H33 | SEAL | H33 Advantage |
|---|---|---|---|
| BFV Encrypt (N=4096) | 331us | 5.98ms | 4.5x faster |
| Memory Allocation | Pre-pooled | Per-operation | Zero alloc |
| SIMD Support | AVX-512 + NEON | AVX-512 | Wider vectors |
Key Takeaways
- Montgomery multiplication is essential - It provided 83% of our total speedup. If you're doing FHE without Montgomery form, you're leaving massive performance on the table.
- SIMD is the final mile - Once arithmetic is optimized, vectorization squeezes out the remaining gains.
- Memory matters - We didn't cover it here, but our pre-pooled memory allocator eliminates allocation overhead entirely.
This is just one piece of the H33 optimization story. In the next post, we'll cover how we achieved 65% faster STARK proving through NEON NTT, parallel Merkle trees, and batch inversion.
Try the World's Fastest FHE
331 microsecond encryption. NIST-compliant security. Production ready.
Get API Key