Engineering Deep Dive · 7 min read

FHE Homomorphic Multiply:
168us to 24us (86% Faster)

How we achieved 6.3x faster homomorphic multiplication through RNS representation and NEON vectorization. 168 microseconds to 24 microseconds.

~50µs
Auth Latency
1.2M/s
Throughput
128-bit
Security
Zero
Plaintext

Homomorphic multiplication is the most expensive operation in FHE. It's what enables computation on encrypted data, but it's also what makes FHE slow. We took our multiply operation from 168 microseconds to 24 microseconds - an 86% reduction that makes real-time encrypted computation possible.

Before
168us
After
24us
86%
Faster
6.3x
vs Baseline
7.5x
vs SEAL

Why Homomorphic Multiply is Hard

In BFV encryption, ciphertexts are polynomials with very large coefficients. A single coefficient might be 109 bits. When you multiply two ciphertexts:

Our initial implementation handled this correctly but slowly. Every multiplication involved arbitrary-precision arithmetic to handle overflow, followed by expensive modular reduction.

The RNS Breakthrough

The key insight is the Residue Number System (RNS). Instead of working with one giant 109-bit modulus, we decompose it into several smaller moduli that fit in 64 bits:

RNS Decomposition

Original: Q = 109-bit modulus

RNS form: Q = q1 * q2 (where q1, q2 are ~55 bits each)

Now each coefficient is represented as a tuple (x mod q1, x mod q2)

Multiplications become parallel 64-bit operations - no overflow!

The Chinese Remainder Theorem guarantees we can reconstruct the full result from the residues. But the magic is: for most operations, we never need to reconstruct. We can add, subtract, and multiply entirely in RNS form.

// Before: Big integer multiplication
let result = bigint_mul(a, b);  // Slow, allocates
let reduced = result % Q;       // Even slower

// After: RNS multiplication (parallel)
let r1 = (a.r1 * b.r1) % q1;   // 64-bit, fast
let r2 = (a.r2 * b.r2) % q2;   // 64-bit, fast
// No reconstruction needed for most operations!

NEON Vectorization

With RNS, our coefficients fit in 64 bits. This unlocks SIMD parallelism. ARM NEON can process two 64-bit multiplications simultaneously:

// Process 2 coefficient multiplications at once
let a_vec = vld1q_u64(&a_coeffs[i]);
let b_vec = vld1q_u64(&b_coeffs[i]);
let prod = vmulq_u64(a_vec, b_vec);
let result = montgomery_reduce_neon(prod, q_vec);
vst1q_u64(&mut out[i], result);

Combined with Montgomery reduction (from our encryption optimizations), each coefficient multiply-reduce takes just a few cycles.

The Complete Picture

Technique Impact Why It Works
RNS Representation ~4x faster Eliminates arbitrary-precision arithmetic
Montgomery Reduction ~2x faster Replaces division with shifts
NEON Vectorization ~1.5x faster Process 2 coefficients per instruction
Combined 6.3x faster 168us → 26.7us → 24us

vs Microsoft SEAL

SEAL is the industry standard for FHE. It's well-engineered and widely used. Here's how we compare:

H33 vs SEAL Homomorphic Multiply (N=4096)

H33: 24 microseconds

SEAL: ~180 microseconds

H33 Advantage: 7.5x faster

The difference comes from:

What This Enables

At 24 microseconds per multiply, encrypted computation becomes practical for real-time applications:

When your competitors are at 180+ microseconds per multiply, you can do 7x more computation in the same time budget. That's the difference between "FHE is a research toy" and "FHE is production ready."

Key Takeaways

  1. RNS is mandatory - If you're doing FHE without RNS representation, you're not competitive.
  2. Design for SIMD from the start - Retrofitting vectorization is painful. We designed our data structures for NEON from day one.
  3. Montgomery + RNS is the winning combo - Each technique gives you 2-4x; together they compound.

This is part 3 of our performance series. Next: we'll show the complete optimization journey and what it means for quantum-resistant authentication.

Try the World's Fastest FHE Multiply

24 microseconds per homomorphic multiplication. 7.5x faster than SEAL.

Get API Key

Build With Post-Quantum Security

Enterprise-grade FHE, ZKP, and post-quantum cryptography. One API call. Sub-millisecond latency.

Get Free API Key → Read the Docs
Free tier · 10,000 API calls/month · No credit card required
Verify It Yourself