Homomorphic multiplication is the most expensive operation in FHE. It's what enables computation on encrypted data, but it's also what makes FHE slow. We took our multiply operation from 168 microseconds to 24 microseconds - an 86% reduction that makes real-time encrypted computation possible.
Why Homomorphic Multiply is Hard
In BFV encryption, ciphertexts are polynomials with very large coefficients. A single coefficient might be 109 bits. When you multiply two ciphertexts:
- You're multiplying polynomials of degree 4,096+
- Each coefficient multiplication can overflow 128 bits
- The result needs modular reduction back to the ciphertext space
- Relinearization is required to keep ciphertext size bounded
Our initial implementation handled this correctly but slowly. Every multiplication involved arbitrary-precision arithmetic to handle overflow, followed by expensive modular reduction.
The RNS Breakthrough
The key insight is the Residue Number System (RNS). Instead of working with one giant 109-bit modulus, we decompose it into several smaller moduli that fit in 64 bits:
RNS Decomposition
Original: Q = 109-bit modulus
RNS form: Q = q1 * q2 (where q1, q2 are ~55 bits each)
Now each coefficient is represented as a tuple (x mod q1, x mod q2)
Multiplications become parallel 64-bit operations - no overflow!
The Chinese Remainder Theorem guarantees we can reconstruct the full result from the residues. But the magic is: for most operations, we never need to reconstruct. We can add, subtract, and multiply entirely in RNS form.
// Before: Big integer multiplication
let result = bigint_mul(a, b); // Slow, allocates
let reduced = result % Q; // Even slower
// After: RNS multiplication (parallel)
let r1 = (a.r1 * b.r1) % q1; // 64-bit, fast
let r2 = (a.r2 * b.r2) % q2; // 64-bit, fast
// No reconstruction needed for most operations!
NEON Vectorization
With RNS, our coefficients fit in 64 bits. This unlocks SIMD parallelism. ARM NEON can process two 64-bit multiplications simultaneously:
// Process 2 coefficient multiplications at once
let a_vec = vld1q_u64(&a_coeffs[i]);
let b_vec = vld1q_u64(&b_coeffs[i]);
let prod = vmulq_u64(a_vec, b_vec);
let result = montgomery_reduce_neon(prod, q_vec);
vst1q_u64(&mut out[i], result);
Combined with Montgomery reduction (from our encryption optimizations), each coefficient multiply-reduce takes just a few cycles.
The Complete Picture
| Technique | Impact | Why It Works |
|---|---|---|
| RNS Representation | ~4x faster | Eliminates arbitrary-precision arithmetic |
| Montgomery Reduction | ~2x faster | Replaces division with shifts |
| NEON Vectorization | ~1.5x faster | Process 2 coefficients per instruction |
| Combined | 6.3x faster | 168us → 26.7us → 24us |
vs Microsoft SEAL
SEAL is the industry standard for FHE. It's well-engineered and widely used. Here's how we compare:
H33 vs SEAL Homomorphic Multiply (N=4096)
H33: 24 microseconds
SEAL: ~180 microseconds
H33 Advantage: 7.5x faster
The difference comes from:
- Better RNS implementation - Our RNS base switching is more efficient
- ARM-first design - We optimized for NEON from day one; SEAL targets AVX-512
- Zero-copy operations - Pre-allocated buffers eliminate malloc overhead
What This Enables
At 24 microseconds per multiply, encrypted computation becomes practical for real-time applications:
- Biometric matching: Compare encrypted face embeddings in ~260us total
- Encrypted search: Run queries over encrypted databases
- Private ML inference: Neural network inference on encrypted data
When your competitors are at 180+ microseconds per multiply, you can do 7x more computation in the same time budget. That's the difference between "FHE is a research toy" and "FHE is production ready."
Key Takeaways
- RNS is mandatory - If you're doing FHE without RNS representation, you're not competitive.
- Design for SIMD from the start - Retrofitting vectorization is painful. We designed our data structures for NEON from day one.
- Montgomery + RNS is the winning combo - Each technique gives you 2-4x; together they compound.
This is part 3 of our performance series. Next: we'll show the complete optimization journey and what it means for quantum-resistant authentication.
Try the World's Fastest FHE Multiply
24 microseconds per homomorphic multiplication. 7.5x faster than SEAL.
Get API Key