Secure AI Without Slowing Down | High-Performance Private Computation

The Performance Objection

Every conversation about FHE in production starts with the same objection: "It's too slow."

For most of FHE's history, this was correct. Craig Gentry's original 2009 scheme took 30 minutes per Boolean gate. Second-generation schemes (BGV, BFV) brought that down to milliseconds. Third-generation TFHE introduced programmable bootstrapping at around 10ms per gate. GPU acceleration pushed it further. But "milliseconds per operation" is still orders of magnitude too slow for systems that need to process millions of requests per second.

The objection was not wrong. It was just not permanent. Five years of instruction-level optimization changed the math entirely.

The Shift

H33 does not accelerate generic FHE. It optimizes a specific pipeline—encrypted biometric matching, identity verification, and AI inference on sensitive data—at the NTT butterfly level. The result is not "faster FHE." It is a production system that happens to use FHE internally, running at speeds that make the encryption invisible to the application layer.

How We Got Here

There is no single optimization that makes FHE production-viable. It is the compound effect of dozens of optimizations, each multiplying the gains of the ones before it. Here are the ones that moved the needle the most:

Montgomery NTT with Harvey Lazy Reduction

The Number Theoretic Transform is the core of every lattice-based FHE scheme. It is where polynomial multiplication happens. Generic NTT implementations use modular division in every butterfly operation—and division is the most expensive instruction on modern CPUs. Montgomery form eliminates division entirely, replacing it with shifts and multiplications. Harvey lazy reduction defers the final modular reduction between NTT stages, keeping intermediate values in [0, 2q) and skipping work that would be immediately undone in the next stage. The combination eliminates the single most expensive operation in the entire pipeline.

NTT-Domain Fused Inner Products

Standard FHE implementations transform data into NTT domain, multiply, then transform back—for each pair of polynomials. H33 keeps data in NTT domain through the entire inner product computation and performs a single final Inverse NTT at the end. This saves 2×M transforms per multiply, where M is the number of CRT moduli. For a 32-user batch with multiple moduli, this eliminates hundreds of expensive transforms.

Pre-NTT Public Keys and Templates

Public keys and enrolled biometric templates are used repeatedly. Instead of transforming them into NTT domain on every encryption or matching operation, H33 stores them in NTT form at keygen and enrollment time. The forward NTT cost is paid once and amortized over millions of operations.

SIMD Batching

BFV's polynomial ring supports 4,096 plaintext slots. Each biometric template uses 128 dimensions. That means 32 users fit into a single ciphertext: 4096 ÷ 128 = 32. The encryption, matching, and verification costs are identical whether you process 1 user or 32. This is a 32x throughput multiplier at zero additional cost.

Batch Attestation

Instead of signing and verifying a Dilithium signature for each user, H33 signs once per 32-user batch. One Dilithium sign + verify (291 microseconds) attests the entire batch. That is 31x less signature overhead compared to individual attestation.

In-Process ZKP Caching

ZK-STARK proof generation is expensive. ZK-STARK proof verification against a cached lookup is not. H33 uses an in-process DashMap that resolves proof lookups in 0.059 microseconds—44x faster than the raw STARK verification and with zero TCP contention. A TCP-based cache proxy (Cachee RESP) at 96 workers caused an 11x throughput regression due to connection serialization. In-process is the only architecture that works at this scale.

The Numbers

H33 Production Pipeline v10 — Graviton4 Benchmark

Per-authentication latency38.5µs

32-user batch latency1,232µs

Sustained throughput (120s)2,172,518 auth/sec

Variance±0.71%

Hardware cost$2/hr (spot)

            c8g.metal-48xl · 192 vCPUs · 96 workers · Graviton4 Neoverse V2 · System allocator (not jemalloc)
        

The pipeline breakdown per 32-user batch:

Stage	Component	Latency	% Pipeline
1	FHE Batch (BFV inner product, 32 users/CT)	939µs	76.2%
2	Dilithium sign + verify (1 per batch)	291µs	23.6%
3	STARK cached DashMap lookup	0.059µs	<0.01%
4	ML Agents (Harvest + SideChannel + CryptoHealth)	~2.35µs	0.19%
Total	32-user batch	1,232µs	100%

vs The Competition

The performance gap between H33 and other privacy-preserving computation approaches is not incremental. It is structural.

System	Operation	Latency	Hardware	Gap vs H33
H33	Full FHE + ZK + PQ pipeline	38.5µs	1 ARM CPU ($2/hr)	—
Zama TFHE	Single bootstrap	800µs	H100 GPU ($280K)	20x slower
Generic BFV/CKKS	Single homomorphic op	4–7ms	Server-class CPU	100–180x slower
Succinct SP1	Single proof generation	10.3s	16 GPUs	267,000x slower

Zama's TFHE bootstrap on an H100 is impressive engineering. But 800 microseconds for a single gate operation—on hardware that costs $280,000 per card—cannot match 38.5 microseconds for a complete authentication pipeline on a $2/hr ARM instance. The gap is 20x in latency and 140,000x in cost-per-operation.

Why GPU Approaches Can't Close the Gap

The natural assumption is that GPUs will eventually solve FHE performance. More parallelism, more cores, more memory bandwidth. This assumption is wrong, and the reason is architectural.

GPU parallelism helps when the bottleneck is parallelizable floating-point arithmetic—matrix multiplications, convolutions, linear algebra. GPUs have thousands of cores optimized for this pattern.

In FHE, the bottleneck is different:

Cache utilization — NTT butterflies access data in a stride pattern that defeats GPU cache hierarchies. L1 cache on an ARM CPU core (64KB–128KB) holds the working set for an entire polynomial. GPU shared memory is smaller and shared across warps.
Memory access patterns — NTT requires data-dependent memory accesses that cause warp divergence on GPUs. CPUs with branch prediction and prefetch handle these patterns natively.
Modular arithmetic — FHE operates on integers modulo a prime, not floating point. GPUs are designed for IEEE 754 float/double. Integer modular multiplication requires multi-instruction sequences on GPU, versus single-cycle Montgomery multiplication on CPU.
Reduction arithmetic — Montgomery REDC and Harvey lazy reduction are inherently sequential within a butterfly. You cannot parallelize the reduction step across GPU threads without introducing synchronization overhead that negates the parallelism.

This is not a solvable problem with more GPU cores. It is a fundamental mismatch between the computation pattern (tight integer arithmetic with cache-sensitive access patterns) and the hardware architecture (massively parallel floating-point with distributed cache hierarchies). CPUs with large private L1/L2 caches and fast integer pipelines are structurally better hardware for NTT-heavy FHE workloads.

The Hardware Truth

H33 runs on AWS Graviton4—an ARM CPU with 192 vCPUs, large per-core caches, and a flat memory model. No GPU. No FPGA. No custom ASIC. The system allocator (glibc malloc, not jemalloc) outperforms alternatives because ARM's memory model eliminates arena bookkeeping overhead under tight FHE loops. The fastest path to production FHE is not better hardware. It is better algorithms on commodity hardware.

GPUs may eventually close part of the gap with custom ASIC-like designs (DARPA's DPRIVE program aims for this). But custom silicon is 3–5 years away from production and will cost orders of magnitude more than a Graviton4 spot instance. H33 is shipping today at $2/hr.