Engineering Deep Dive · 8 min read

FHE Encryption Optimization:
2.02ms to 331us (84% Faster)

How we achieved 6.1x faster FHE encryption through Montgomery multiplication and SIMD parallelization. From 2.02ms to 331 microseconds.

~50µs
Auth Latency
1.2M/s
Throughput
128-bit
Security
Zero
Plaintext

When we started optimizing H33's fully homomorphic encryption pipeline, our BFV encryption took 2.02 milliseconds. Today, it takes 331 microseconds. This is the story of how we achieved 84% faster encryption and became the fastest FHE library in the world.

84%
Faster
6.1x
vs Baseline
4.5x
vs SEAL

The Starting Point: 2.02ms

Our initial implementation used a straightforward approach to BFV encryption. It worked correctly, passed all test vectors, and was reasonably fast. But "reasonably fast" isn't good enough when you're building quantum-resistant biometric authentication that needs to compete with traditional auth systems.

The bottleneck was clear: modular arithmetic. BFV encryption requires thousands of modular multiplications across large polynomial rings. Each multiplication involved division operations that were killing performance.

Optimization 1: Montgomery Multiplication

The first major breakthrough came from implementing Montgomery multiplication. Instead of performing expensive division operations for each modular reduction, Montgomery form converts multiplications into shifts and additions.

What is Montgomery Multiplication?

Traditional modular multiplication: a * b mod n requires division by n.

Montgomery multiplication pre-computes a special form where the modular reduction becomes a simple bit shift. The overhead of converting to/from Montgomery form is amortized across many operations.

This single change dropped our encryption time from 2.02ms to 351 microseconds - an 83% reduction. We were now at 17% of our original runtime.

Basic Encryptor 2.02ms
100%
+ Montgomery 351us
17%

Optimization 2: SIMD Parallelization

Montgomery multiplication made each operation faster, but we were still processing coefficients sequentially. Modern CPUs have wide SIMD registers (AVX-512 gives us 512 bits) that can process multiple coefficients simultaneously.

We restructured our polynomial operations to leverage:

The parallel Montgomery implementation brought us from 351us to 330 microseconds. A modest 6% improvement, but we were already operating near the memory bandwidth limit.

Basic Encryptor 2.02ms
100%
+ Montgomery 351us
17%
+ SIMD Parallel 330us
16%

Final Results: World's Fastest FHE

Final Performance

2.02ms → 331us (84% faster, 6.1x improvement)

vs Microsoft SEAL: 4.5x faster at equivalent security parameters

Status: #1 worldwide for BFV encryption speed

Comparison with Microsoft SEAL

Operation H33 SEAL H33 Advantage
BFV Encrypt (N=4096) 331us 5.98ms 4.5x faster
Memory Allocation Pre-pooled Per-operation Zero alloc
SIMD Support AVX-512 + NEON AVX-512 Wider vectors

Key Takeaways

  1. Montgomery multiplication is essential - It provided 83% of our total speedup. If you're doing FHE without Montgomery form, you're leaving massive performance on the table.
  2. SIMD is the final mile - Once arithmetic is optimized, vectorization squeezes out the remaining gains.
  3. Memory matters - We didn't cover it here, but our pre-pooled memory allocator eliminates allocation overhead entirely.

This is just one piece of the H33 optimization story. In the next post, we'll cover how we achieved 65% faster STARK proving through NEON NTT, parallel Merkle trees, and batch inversion.

Try the World's Fastest FHE

331 microsecond encryption. NIST-compliant security. Production ready.

Get API Key

Build With Post-Quantum Security

Enterprise-grade FHE, ZKP, and post-quantum cryptography. One API call. Sub-millisecond latency.

Get Free API Key → Read the Docs
Free tier · 10,000 API calls/month · No credit card required
Verify It Yourself