BenchmarksStack Ranking
APIsPricingDocsWhite PaperTokenBlogAboutSecurity Demo
Log InGet API Key
Engineering Deep Dive · 5 min read

FHE Homomorphic Multiply:
168us to 24us (86% Faster)

How we achieved 6.3x faster homomorphic multiplication through RNS representation and NEON vectorization. 168 microseconds to 24 microseconds.

~42µs
Auth Latency
2.17M/s
Throughput
128-bit
Security
Zero
Plaintext

Homomorphic multiplication is the most expensive operation in FHE. It's what enables computation on encrypted data, but it's also what makes FHE slow. We took our multiply operation from 168 microseconds to 24 microseconds - an 86% reduction that makes real-time encrypted computation possible.

Before
168us
After
24us
86%
Faster
6.3x
vs Baseline
7.5x
vs SEAL

Why Homomorphic Multiply is Hard

In BFV encryption, ciphertexts are polynomials with very large coefficients. A single coefficient might be 109 bits. When you multiply two ciphertexts:

Our initial implementation handled this correctly but slowly. Every multiplication involved arbitrary-precision arithmetic to handle overflow, followed by expensive modular reduction.

In H33's production BFV configuration, we use N=4096 with a single 56-bit modulus Q and plaintext modulus t=65537. This parameter set targets 128-bit security while keeping the polynomial ring small enough for sub-millisecond operations. The challenge is that even at N=4096, a naive polynomial multiply requires N-squared coefficient multiplications -- over 16 million operations. The Number Theoretic Transform (NTT) reduces this to O(N log N), but the constants still matter enormously when you need to sustain 1.595 million authentications per second across 96 parallel workers.

The RNS Breakthrough

The key insight is the Residue Number System (RNS). Instead of working with one giant 109-bit modulus, we decompose it into several smaller moduli that fit in 64 bits:

RNS Decomposition

Original: Q = 109-bit modulus

RNS form: Q = q1 * q2 (where q1, q2 are ~55 bits each)

Now each coefficient is represented as a tuple (x mod q1, x mod q2)

Multiplications become parallel 64-bit operations - no overflow!

The Chinese Remainder Theorem guarantees we can reconstruct the full result from the residues. But the magic is: for most operations, we never need to reconstruct. We can add, subtract, and multiply entirely in RNS form.

Key Insight

RNS representation eliminates the single biggest bottleneck in FHE multiply: arbitrary-precision integer arithmetic. By keeping every coefficient in machine-word-sized residues, we stay entirely within the CPU's native 64-bit ALU. No heap allocations, no bignum libraries, no branch-heavy division loops. Each RNS channel is also independently parallelizable -- across both SIMD lanes and CPU cores -- which is why RNS scales linearly with hardware threads in a way that bignum arithmetic never can.

// Before: Big integer multiplication
let result = bigint_mul(a, b);  // Slow, allocates
let reduced = result % Q;       // Even slower

// After: RNS multiplication (parallel)
let r1 = (a.r1 * b.r1) % q1;   // 64-bit, fast
let r2 = (a.r2 * b.r2) % q2;   // 64-bit, fast
// No reconstruction needed for most operations!

Montgomery Reduction: Eliminating Division

Even with RNS, each 64-bit modular multiplication still requires a modular reduction step. The textbook approach uses integer division, which on ARM Graviton4 costs 20-40 cycles per operation. Across 4,096 coefficients and multiple RNS channels, that adds up to tens of microseconds of pure division overhead.

Montgomery reduction replaces division with a sequence of multiplications and bit-shifts. The idea is to work in a transformed "Montgomery domain" where reduction is achieved by multiplying by a pre-computed inverse and right-shifting by R (a power of two). Entering and leaving the domain each costs one multiplication, but once inside, every reduction is division-free.

// Montgomery REDC: no division in the hot path
fn mont_reduce(t: u128, q: u64, q_inv: u64) -> u64 {
    let m = (t as u64).wrapping_mul(q_inv);
    let reduced = (t + (m as u128) * (q as u128)) >> 64;
    if reduced >= q as u128 { (reduced - q as u128) as u64 }
    else { reduced as u64 }
}

In H33's NTT pipeline, twiddle factors are stored permanently in Montgomery form. The secret key is also kept in NTT-domain Montgomery representation after keygen. This means the forward and inverse NTT transforms -- which dominate the cost of multiply -- never leave the Montgomery domain. The net effect: two REDC operations per butterfly instead of two divisions, a roughly 2x improvement in the NTT inner loop.

NEON Vectorization

With RNS, our coefficients fit in 64 bits. This unlocks SIMD parallelism. ARM NEON can process two 64-bit multiplications simultaneously:

// Process 2 coefficient multiplications at once
let a_vec = vld1q_u64(&a_coeffs[i]);
let b_vec = vld1q_u64(&b_coeffs[i]);
let prod = vmulq_u64(a_vec, b_vec);
let result = montgomery_reduce_neon(prod, q_vec);
vst1q_u64(&mut out[i], result);

Combined with Montgomery reduction (from our encryption optimizations), each coefficient multiply-reduce takes just a few cycles. One important caveat: ARM NEON lacks a native 64x64-to-128-bit multiply instruction, so the full Montgomery REDC must extract to scalar for the widening multiply and then repack. NEON still wins on the add, subtract, and compare steps of the butterfly, and the vectorized loads and stores eliminate memory stalls that would otherwise dominate at N=4096.

The Complete Picture

Technique Impact Why It Works
RNS Representation ~4x faster Eliminates arbitrary-precision arithmetic
Montgomery Reduction ~2x faster Replaces division with shifts + multiply
NEON Vectorization ~1.5x faster Process 2 coefficients per instruction
Harvey Lazy Reduction ~1.1x faster Defer reduction between NTT stages
Combined 6.3x faster 168us → 26.7us → 24us

The final 10% gain -- from 26.7us to 24us -- came from Harvey lazy reduction, where butterfly outputs are kept in the range [0, 2q) rather than fully reduced to [0, q) between NTT stages. This eliminates one conditional subtraction per butterfly, which at 4,096 * log2(4,096) = 49,152 butterflies per transform, saves roughly 50,000 branch predictions.

vs Microsoft SEAL

SEAL is the industry standard for FHE. It's well-engineered and widely used. Here's how we compare:

H33 vs SEAL Homomorphic Multiply (N=4096)

H33: 24 microseconds

SEAL: ~180 microseconds

H33 Advantage: 7.5x faster

The difference comes from:

Production Impact: 32 Users Per Ciphertext

Fast multiply is necessary but not sufficient. The real production win comes from SIMD batching: H33 packs 32 user biometric templates into a single BFV ciphertext using 4,096 plaintext slots (4,096 slots / 128 embedding dimensions = 32 users). This means one homomorphic inner product -- which involves multiply_plain followed by accumulate -- processes an entire batch of 32 authentications simultaneously.

Key Insight

At 24µs per homomorphic multiply, a full 32-user batch completes its FHE stage in approximately 1,109µs. Combined with in-process DashMap ZKP lookups (0.085µs) and a single batched Dilithium signature for attestation (~244µs), the total per-authentication cost drops to ~42µs. On a Graviton4 c8g.metal-48xl with 96 workers, this sustains 2,172,518 authentications per second -- entirely post-quantum secure, with zero plaintext exposure at any stage.

What This Enables

At 24 microseconds per multiply, encrypted computation becomes practical for real-time applications:

When your competitors are at 180+ microseconds per multiply, you can do 7x more computation in the same time budget. That's the difference between "FHE is a research toy" and "FHE is production ready."

Key Takeaways

  1. RNS is mandatory - If you're doing FHE without RNS representation, you're not competitive.
  2. Design for SIMD from the start - Retrofitting vectorization is painful. We designed our data structures for NEON from day one.
  3. Montgomery + RNS is the winning combo - Each technique gives you 2-4x; together they compound.
  4. Stay in the transform domain - Converting between coefficient and NTT form is expensive. Keep ciphertexts, keys, and twiddles in their optimal domain and minimize round-trips.
  5. Measure at production scale - Micro-benchmarks lie. An optimization that wins on a single core can regress under 96-worker contention due to L1 cache pressure. Always validate on the target hardware at full parallelism.

This is part 3 of our performance series. Next: we'll show the complete optimization journey and what it means for quantum-resistant authentication.

Try the World's Fastest FHE Multiply

24 microseconds per homomorphic multiplication. 7.5x faster than SEAL.

Get API Key

Build With Post-Quantum Security

Enterprise-grade FHE, ZKP, and post-quantum cryptography. One API call. Sub-millisecond latency.

Get Free API Key → Read the Docs
Free tier · 10,000 API calls/month · No credit card required
Verify It Yourself