Benchmarks Pricing Docs Blog About
Performance FHE · 6 min read

Secure AI Without
Slowing Down

"FHE is too slow for production." This was true in 2020. It was true in 2023. It is not true in 2026. H33 runs complete FHE + ZK + post-quantum pipelines at 38.5 microseconds per operation—faster than a human blink.

38.5µs
Per Auth
2.17M/s
Sustained
±0.71%
Variance
$2/hr
Hardware Cost
Measured on c8g.metal-48xl (96 cores, AWS Graviton4, Neoverse V2) · Criterion.rs v0.5 · 120-second sustained run · March 2026

The Performance Objection

Every conversation about FHE in production starts with the same objection: "It's too slow."

For most of FHE's history, this was correct. Craig Gentry's original 2009 scheme took 30 minutes per Boolean gate. Second-generation schemes (BGV, BFV) brought that down to milliseconds. Third-generation TFHE introduced programmable bootstrapping at around 10ms per gate. GPU acceleration pushed it further. But "milliseconds per operation" is still orders of magnitude too slow for systems that need to process millions of requests per second.

The objection was not wrong. It was just not permanent. Five years of instruction-level optimization changed the math entirely.

The Shift

H33 does not accelerate generic FHE. It optimizes a specific pipeline—encrypted biometric matching, identity verification, and AI inference on sensitive data—at the NTT butterfly level. The result is not "faster FHE." It is a production system that happens to use FHE internally, running at speeds that make the encryption invisible to the application layer.

How We Got Here

There is no single optimization that makes FHE production-viable. It is the compound effect of dozens of optimizations, each multiplying the gains of the ones before it. Here are the ones that moved the needle the most:

Montgomery NTT with Harvey Lazy Reduction

The Number Theoretic Transform is the core of every lattice-based FHE scheme. It is where polynomial multiplication happens. Generic NTT implementations use modular division in every butterfly operation—and division is the most expensive instruction on modern CPUs. Montgomery form eliminates division entirely, replacing it with shifts and multiplications. Harvey lazy reduction defers the final modular reduction between NTT stages, keeping intermediate values in [0, 2q) and skipping work that would be immediately undone in the next stage. The combination eliminates the single most expensive operation in the entire pipeline.

NTT-Domain Fused Inner Products

Standard FHE implementations transform data into NTT domain, multiply, then transform back—for each pair of polynomials. H33 keeps data in NTT domain through the entire inner product computation and performs a single final Inverse NTT at the end. This saves 2×M transforms per multiply, where M is the number of CRT moduli. For a 32-user batch with multiple moduli, this eliminates hundreds of expensive transforms.

Pre-NTT Public Keys and Templates

Public keys and enrolled biometric templates are used repeatedly. Instead of transforming them into NTT domain on every encryption or matching operation, H33 stores them in NTT form at keygen and enrollment time. The forward NTT cost is paid once and amortized over millions of operations.

SIMD Batching

BFV's polynomial ring supports 4,096 plaintext slots. Each biometric template uses 128 dimensions. That means 32 users fit into a single ciphertext: 4096 ÷ 128 = 32. The encryption, matching, and verification costs are identical whether you process 1 user or 32. This is a 32x throughput multiplier at zero additional cost.

Batch Attestation

Instead of signing and verifying a Dilithium signature for each user, H33 signs once per 32-user batch. One Dilithium sign + verify (291 microseconds) attests the entire batch. That is 31x less signature overhead compared to individual attestation.

In-Process ZKP Caching

ZK-STARK proof generation is expensive. ZK-STARK proof verification against a cached lookup is not. H33 uses an in-process DashMap that resolves proof lookups in 0.059 microseconds—44x faster than the raw STARK verification and with zero TCP contention. A TCP-based cache proxy (Cachee RESP) at 96 workers caused an 11x throughput regression due to connection serialization. In-process is the only architecture that works at this scale.

The Numbers

H33 Production Pipeline v10 — Graviton4 Benchmark

Per-authentication latency38.5µs
32-user batch latency1,232µs
Sustained throughput (120s)2,172,518 auth/sec
Variance±0.71%
Hardware cost$2/hr (spot)
c8g.metal-48xl · 192 vCPUs · 96 workers · Graviton4 Neoverse V2 · System allocator (not jemalloc)

The pipeline breakdown per 32-user batch:

Stage Component Latency % Pipeline
1 FHE Batch (BFV inner product, 32 users/CT) 939µs 76.2%
2 Dilithium sign + verify (1 per batch) 291µs 23.6%
3 STARK cached DashMap lookup 0.059µs <0.01%
4 ML Agents (Harvest + SideChannel + CryptoHealth) ~2.35µs 0.19%
Total 32-user batch 1,232µs 100%

vs The Competition

The performance gap between H33 and other privacy-preserving computation approaches is not incremental. It is structural.

System Operation Latency Hardware Gap vs H33
H33 Full FHE + ZK + PQ pipeline 38.5µs 1 ARM CPU ($2/hr)
Zama TFHE Single bootstrap 800µs H100 GPU ($280K) 20x slower
Generic BFV/CKKS Single homomorphic op 4–7ms Server-class CPU 100–180x slower
Succinct SP1 Single proof generation 10.3s 16 GPUs 267,000x slower

Zama's TFHE bootstrap on an H100 is impressive engineering. But 800 microseconds for a single gate operation—on hardware that costs $280,000 per card—cannot match 38.5 microseconds for a complete authentication pipeline on a $2/hr ARM instance. The gap is 20x in latency and 140,000x in cost-per-operation.

Why GPU Approaches Can't Close the Gap

The natural assumption is that GPUs will eventually solve FHE performance. More parallelism, more cores, more memory bandwidth. This assumption is wrong, and the reason is architectural.

GPU parallelism helps when the bottleneck is parallelizable floating-point arithmetic—matrix multiplications, convolutions, linear algebra. GPUs have thousands of cores optimized for this pattern.

In FHE, the bottleneck is different:

This is not a solvable problem with more GPU cores. It is a fundamental mismatch between the computation pattern (tight integer arithmetic with cache-sensitive access patterns) and the hardware architecture (massively parallel floating-point with distributed cache hierarchies). CPUs with large private L1/L2 caches and fast integer pipelines are structurally better hardware for NTT-heavy FHE workloads.

The Hardware Truth

H33 runs on AWS Graviton4—an ARM CPU with 192 vCPUs, large per-core caches, and a flat memory model. No GPU. No FPGA. No custom ASIC. The system allocator (glibc malloc, not jemalloc) outperforms alternatives because ARM's memory model eliminates arena bookkeeping overhead under tight FHE loops. The fastest path to production FHE is not better hardware. It is better algorithms on commodity hardware.

GPUs may eventually close part of the gap with custom ASIC-like designs (DARPA's DPRIVE program aims for this). But custom silicon is 3–5 years away from production and will cost orders of magnitude more than a Graviton4 spot instance. H33 is shipping today at $2/hr.

Production-Speed Privacy. No GPU Required.

2.17 million encrypted operations per second on a single ARM instance. See the benchmarks yourself.

See Benchmarks → Read the Docs