BenchmarksStack RankingH33 FHEH33 ZKAPIsPricingPQCTokenDocsWhite PaperBlogAboutSecurity Demo

FHE Homomorphic Multiply: 168us to 24us

Homomorphic multiplication is the most expensive operation in FHE. It's what enables computation on encrypted data, but it's also what makes FHE slow. We took our multiply operation from 168 microseconds to 24 microseconds - an 86% reduction that makes real-time encrypted computation possible.

Before
168us
After
24us
86%
Faster
6.3x
vs Baseline
7.5x
vs SEAL

Why Homomorphic Multiply is Hard

In BFV encryption, ciphertexts are polynomials with very large coefficients. A single coefficient might be 109 bits. When you multiply two ciphertexts:

  • You're multiplying polynomials of degree 4,096+
  • Each coefficient multiplication can overflow 128 bits
  • The result needs modular reduction back to the ciphertext space
  • Relinearization is required to keep ciphertext size bounded

Our initial implementation handled this correctly but slowly. Every multiplication involved arbitrary-precision arithmetic to handle overflow, followed by expensive modular reduction.

The RNS Breakthrough

The key insight is the Residue Number System (RNS). Instead of working with one giant 109-bit modulus, we decompose it into several smaller moduli that fit in 64 bits:

RNS Decomposition

Original: Q = 109-bit modulus

RNS form: Q = q1 * q2 (where q1, q2 are ~55 bits each)

Now each coefficient is represented as a tuple (x mod q1, x mod q2)

Multiplications become parallel 64-bit operations - no overflow!

The Chinese Remainder Theorem guarantees we can reconstruct the full result from the residues. But the magic is: for most operations, we never need to reconstruct. We can add, subtract, and multiply entirely in RNS form.

// Before: Big integer multiplication
let result = bigint_mul(a, b);  // Slow, allocates
let reduced = result % Q;       // Even slower

// After: RNS multiplication (parallel)
let r1 = (a.r1 * b.r1) % q1;   // 64-bit, fast
let r2 = (a.r2 * b.r2) % q2;   // 64-bit, fast
// No reconstruction needed for most operations!

NEON Vectorization

With RNS, our coefficients fit in 64 bits. This unlocks SIMD parallelism. ARM NEON can process two 64-bit multiplications simultaneously:

// Process 2 coefficient multiplications at once
let a_vec = vld1q_u64(&a_coeffs[i]);
let b_vec = vld1q_u64(&b_coeffs[i]);
let prod = vmulq_u64(a_vec, b_vec);
let result = montgomery_reduce_neon(prod, q_vec);
vst1q_u64(&mut out[i], result);

Combined with Montgomery reduction (from our encryption optimizations), each coefficient multiply-reduce takes just a few cycles.

The Complete Picture

Technique Impact Why It Works
RNS Representation ~4x faster Eliminates arbitrary-precision arithmetic
Montgomery Reduction ~2x faster Replaces division with shifts
NEON Vectorization ~1.5x faster Process 2 coefficients per instruction
Combined 6.3x faster 168us โ†’ 26.7us โ†’ 24us

vs Microsoft SEAL

SEAL is the industry standard for FHE. It's well-engineered and widely used. Here's how we compare:

H33 vs SEAL Homomorphic Multiply (N=4096)

H33: 24 microseconds

SEAL: ~180 microseconds

H33 Advantage: 7.5x faster

The difference comes from:

  • Better RNS implementation - Our RNS base switching is more efficient
  • ARM-first design - We optimized for NEON from day one; SEAL targets AVX-512
  • Zero-copy operations - Pre-allocated buffers eliminate malloc overhead

What This Enables

At 24 microseconds per multiply, encrypted computation becomes practical for real-time applications:

  • Biometric matching: Compare encrypted face embeddings in ~260us total
  • Encrypted search: Run queries over encrypted databases
  • Private ML inference: Neural network inference on encrypted data

When your competitors are at 180+ microseconds per multiply, you can do 7x more computation in the same time budget. That's the difference between "FHE is a research toy" and "FHE is production ready."

Key Takeaways

  1. RNS is mandatory - If you're doing FHE without RNS representation, you're not competitive.
  2. Design for SIMD from the start - Retrofitting vectorization is painful. We designed our data structures for NEON from day one.
  3. Montgomery + RNS is the winning combo - Each technique gives you 2-4x; together they compound.

This is part 3 of our performance series. Next: we'll show the complete optimization journey and what it means for quantum-resistant authentication.

Try the World's Fastest FHE Multiply

24 microseconds per homomorphic multiplication. 7.5x faster than SEAL.

Get API Key