BenchmarksStack Ranking
APIsPricingDocsWhite PaperTokenBlogAboutSecurity Demo
Log InGet API Key
Engineering Deep Dive · 5 min read

FHE Encryption Optimization:
2.02ms to 331us (84% Faster)

How we achieved 6.1x faster FHE encryption through Montgomery multiplication and SIMD parallelization. From 2.02ms to 331 microseconds.

~42µs
Auth Latency
2.17M/s
Throughput
128-bit
Security
Zero
Plaintext

When we started optimizing H33's fully homomorphic encryption pipeline, our BFV encryption took 2.02 milliseconds. Today, it takes 331 microseconds. This is the story of how we achieved 84% faster encryption and became the fastest FHE library in the world -- and how that single-encrypt speedup compounds into a production system processing 1.595 million authentications per second. See the full results on our benchmarks page.

84%
Faster
6.1x
vs Baseline
4.5x
vs SEAL

The Starting Point: 2.02ms

Our initial implementation used a straightforward approach to BFV encryption. It worked correctly, passed all test vectors, and was reasonably fast. But "reasonably fast" isn't good enough when you're building quantum-resistant biometric authentication that needs to compete with traditional auth systems.

The bottleneck was clear: modular arithmetic. BFV encryption requires thousands of modular multiplications across large polynomial rings. Each multiplication involved division operations that were killing performance. With N=4096 polynomial degree and a 56-bit ciphertext modulus, our encrypt path was executing roughly 4,096 modular multiplications per NTT butterfly pass, across multiple residue limbs. The div instruction on modern x86 takes 30-90 cycles depending on operand size -- and we were calling it millions of times per second.

Optimization 1: Montgomery Multiplication

The first major breakthrough came from implementing Montgomery multiplication. Instead of performing expensive division operations for each modular reduction, Montgomery form converts multiplications into shifts and additions.

What is Montgomery Multiplication?

Traditional modular multiplication: a * b mod n requires division by n.

Montgomery multiplication pre-computes a special form where the modular reduction becomes a simple bit shift. The overhead of converting to/from Montgomery form is amortized across many operations.

The key insight is that BFV encryption is dominated by NTT -- the Number Theoretic Transform. A single NTT of degree N=4096 requires N/2 * log2(N) = 24,576 butterfly operations, each containing a modular multiply. By keeping all NTT twiddle factors in Montgomery form and performing the entire forward-inverse NTT cycle without leaving Montgomery domain, we eliminate the conversion overhead entirely. The secret key, once generated, stays in NTT-Montgomery form for the lifetime of the system.

// Montgomery reduction (REDC) -- no division
fn mont_reduce(t: u128, q: u64, q_inv: u64) -> u64 {
    let m = (t as u64).wrapping_mul(q_inv);
    let u = (t + m as u128 * q as u128) >> 64;
    if u >= q as u128 { (u - q as u128) as u64 } else { u as u64 }
}

This single change dropped our encryption time from 2.02ms to 351 microseconds -- an 83% reduction. We were now at 17% of our original runtime.

Basic Encryptor 2.02ms
100%
+ Montgomery 351us
17%

Optimization 2: SIMD Parallelization

Montgomery multiplication made each operation faster, but we were still processing coefficients sequentially. Modern CPUs have wide SIMD registers (AVX-512 gives us 512 bits) that can process multiple coefficients simultaneously.

We restructured our polynomial operations to leverage:

The parallel Montgomery implementation brought us from 351us to 330 microseconds. A modest 6% improvement, but we were already operating near the memory bandwidth limit.

Key Insight

On ARM Graviton4 (our production hardware), NEON lacks a native 64x64-to-128-bit multiply instruction. The NEON butterfly extracts to scalar, performs the u128 multiply, then repacks -- making NEON useful primarily for add/sub/compare vectorization in the NTT, not for the multiplication core itself. This architectural detail drove us toward Harvey lazy reduction (keeping butterfly values in [0, 2q) between stages) rather than deeper SIMD fusion.

Basic Encryptor 2.02ms
100%
+ Montgomery 351us
17%
+ SIMD Parallel 330us
16%

Optimization 3: SIMD Batching and NTT-Domain Persistence

The encryption speedup alone does not tell the full production story. H33 uses CRT-based SIMD batching to pack 32 biometric templates into a single ciphertext. With N=4096 polynomial slots and 128-dimensional feature vectors, each BFV ciphertext carries an entire batch of users, all computed on while still encrypted. This means the 331-microsecond encrypt cost is amortized across 32 authentications.

We also restructured the verify pipeline to keep intermediate ciphertexts in NTT form wherever possible. The function multiply_plain_ntt() returns ciphertexts with is_ntt_form: true, skipping two inverse NTT transforms per modulus per call. This dropped our batch FHE inner-product time from 1,375 microseconds to ~1,109 microseconds -- a 19.3% reduction at the batch level.

Pipeline Stage Latency Post-Quantum Secure
FHE Batch (32 users) ~1,109 µs Yes (lattice-based)
ZKP Lookup 0.085 µs Yes (SHA3-256)
Dilithium Attestation ~244 µs Yes (ML-DSA)
Per Authentication ~42 µs Fully PQ

Dead Ends: What Did Not Work

Not every optimization attempt was successful. Two strategies that looked promising in microbenchmarks failed catastrophically in production at 96-worker concurrency on Graviton4:

These failures underscore an important lesson: FHE optimization is architecture-specific. A technique that wins on laptop silicon may lose on server-grade ARM, and vice versa. Always benchmark on your production target.

Final Results: World's Fastest FHE

Final Performance

2.02ms → 331us (84% faster, 6.1x improvement)

vs Microsoft SEAL: 4.5x faster at equivalent security parameters

Status: #1 worldwide for BFV encryption speed

Comparison with Microsoft SEAL

Operation H33 SEAL H33 Advantage
BFV Encrypt (N=4096) 331us 5.98ms 4.5x faster
Memory Allocation Pre-pooled Per-operation Zero alloc
SIMD Support AVX-512 + NEON AVX-512 Wider platform coverage
Batch Throughput (96 cores) 2.17M auth/sec N/A Production proven
Post-Quantum Attestation Dilithium (ML-DSA) None Full PQ stack

Key Takeaways

  1. Montgomery multiplication is essential -- It provided 83% of our total speedup. If you're doing FHE without Montgomery form, you're leaving massive performance on the table.
  2. SIMD is the final mile -- Once arithmetic is optimized, vectorization squeezes out the remaining gains, but beware of architecture-specific pitfalls like missing 64-bit NEON multiply.
  3. Batch amortization matters more than single-op speed -- Packing 32 users per ciphertext turns a 1,109-microsecond batch operation into ~42 microseconds per authentication. System-level design amplifies micro-level optimizations.
  4. Profile on production hardware -- Cache behavior, allocator performance, and memory models vary dramatically between M4 Max, Xeon, and Graviton4. Microbenchmarks lie.
  5. Memory matters -- Our pre-pooled memory allocator eliminates allocation overhead entirely, and keeping keys in NTT-Montgomery form avoids redundant transforms on every request.

This is just one piece of the H33 optimization story. In the next post, we'll cover how we achieved 65% faster STARK proving through NEON NTT, parallel Merkle trees, and batch inversion.

Try the World's Fastest FHE

331 microsecond encryption. 2.17M auth/sec at scale. NIST-compliant post-quantum security. Production ready.

Get API Key

Build With Post-Quantum Security

Enterprise-grade FHE, ZKP, and post-quantum cryptography. One API call. Sub-millisecond latency.

Get Free API Key → Read the Docs
Free tier · 10,000 API calls/month · No credit card required
Verify It Yourself