Over the past months, we've systematically optimized every layer of H33's cryptographic stack. The result: we now have the fastest FHE library and the fastest STARK prover in the world. This post brings together the complete picture — and you can verify every number on our benchmarks page.
What makes H33 different from academic benchmarks is that these numbers come from a production authentication pipeline — not isolated microbenchmarks. Every optimization had to survive 96-core contention on AWS Graviton4, real memory pressure, and the full chain of BFV FHE encryption, ZKP verification, and Dilithium attestation running simultaneously. The single-call latency of ~42µs per authentication represents the entire post-quantum pipeline, not any single primitive in isolation.
FHE Encryption Journey
Improvement: 6.1x faster than baseline | vs SEAL: 4.5x faster
The FHE encrypt path was the first bottleneck we attacked. The baseline BFV encryptor spent the majority of its time in Number Theoretic Transforms (NTTs) — the polynomial-ring equivalent of FFTs that underpin every BFV operation. Each encryption required multiple forward and inverse NTTs across several RNS (Residue Number System) moduli, and every NTT butterfly involved a modular reduction using integer division.
Switching to Montgomery multiplication eliminated division entirely from the NTT hot path. Instead of computing a * b mod q via division, Montgomery form uses a precomputed reciprocal and bitwise shifts — reducing each modular multiply to two 64-bit multiplications and an addition. We then stored all twiddle factors in Montgomery form at keygen time, so the forward and inverse transforms never touch a division instruction. Combined with Harvey lazy reduction — allowing intermediate butterfly values to sit in [0, 2q) between stages rather than fully reducing after each step — this eliminated an enormous amount of redundant work.
Parallelizing the NTT across RNS moduli with Rayon delivered the final push from 351µs to 330µs. Each modulus channel is completely independent, making this embarrassingly parallel. On Graviton4 with its 96 Neoverse V2 cores, the parallel NTT keeps all channels saturated.
STARK Proving Journey
Improvement: 2.8x faster than baseline | vs Plonky3: 30-50% faster
STARK proving is dominated by two operations: Merkle tree construction (hashing) and polynomial evaluation (field arithmetic). The baseline prover built Merkle trees sequentially — a single thread hashing leaves, then internal nodes. Switching to parallel Merkle construction via Rayon cut 6ms in one shot, since each subtree is independent until the final merge at the root.
The breakthrough came from batch inversion. STARK FRI (Fast Reed-Solomon Interactive Oracle Proof) queries require computing many field inversions during the quotient phase. A naive approach computes each inversion independently, costing one extended-GCD per element. Montgomery's batch inversion trick replaces N independent inversions with 3N field multiplications and a single inversion — a 50x speedup for N=1024. This alone dropped the prover from 14ms to 7.1ms. Profile-guided optimization (PGO) then tuned branch prediction and instruction cache layout for the final 2% push to 6.96ms.
FHE Homomorphic Multiply Journey
Improvement: 6.3x faster | vs SEAL: 7.5x faster
The homomorphic multiply is the most expensive primitive in any BFV pipeline. Our original implementation used arbitrary-precision integer arithmetic for the tensor product — multiplying two ciphertext polynomials with coefficients that can reach hundreds of bits. Switching to RNS (Residue Number System) representation decomposed each large modulus into several 64-bit channels, eliminating big-integer math entirely. Each channel's polynomial multiply becomes a pointwise NTT-domain operation: forward NTT, element-wise multiply, done. No inverse NTT is needed until the final accumulation step, saving two full transforms per modulus channel.
ARM NEON intrinsics accelerated the remaining element-wise operations — branchless Galois permutations and vectorized key-switching during relinearization. The net result: 168µs down to 26.7µs, a 6.3x improvement that leaves Microsoft SEAL 7.5x behind.
Production Pipeline: 2.17M Auth/sec
Fast primitives are necessary but not sufficient. The production authentication pipeline chains three stages into a single API call: BFV FHE batch verification, ZKP lookup, and Dilithium attestation. Each stage had to be optimized not just in isolation but under the contention of 96 parallel workers.
FHE batch inner product: ~1,109µs — BFV encrypted biometric match across 32 SIMD-packed users in a single ciphertext (N=4096, 128 dims per user).
ZKP cache lookup: ~0.085µs — In-process DashMap replaced the TCP-based Cachee proxy, which had caused an 11x regression at 96 workers due to connection serialization.
Dilithium attestation: ~244µs — One SHA3-256 digest plus one Dilithium sign-and-verify per batch, not per user. Batch attestation amortizes the cost 31x.
The total is ~1,356µs per 32-user batch, or ~42µs per individual authentication. At 96 workers on a c8g.metal-48xl Graviton4 instance, this sustains 2,172,518 authentications per second — fully post-quantum, fully homomorphically encrypted, with zero plaintext exposure.
The Techniques That Mattered
Across all three optimization efforts, certain techniques appeared repeatedly:
- Montgomery Multiplication — Eliminating division in modular arithmetic gave us 2-6x improvements in FHE operations. Twiddle factors stored in Montgomery form at keygen time mean the hot loop never divides.
- RNS Representation — Decomposing large moduli into smaller ones that fit in 64 bits eliminated arbitrary-precision arithmetic entirely.
- SIMD Vectorization — NEON on ARM, AVX-512 on x86. Processing multiple coefficients per instruction. ARM NEON excels at add/sub/compare but lacks native 64x64-to-128 multiply, so we kept scalar paths where they outperform.
- Batch Inversion — Montgomery's trick: N inversions become 3N multiplications + 1 inversion. Critical for STARK proving.
- Parallel Merkle Trees — Embarrassingly parallel construction using Rayon.
- Profile-Guided Optimization — Let the compiler optimize what we couldn't.
Optimizations That Failed
Not every idea survived production testing. Several optimizations that looked promising in microbenchmarks regressed under real 96-core contention:
- jemalloc on Graviton4 — 8% throughput regression. The glibc allocator on aarch64 is already heavily optimized for ARM's flat memory model. jemalloc's arena bookkeeping adds pure overhead when 96 workers run tight FHE loops.
- Fused NTT twiddle pre-computation — 15% win in isolated microbenchmarks, but +24µs per batch in production. The extra lookup table pollutes L1 cache under heavy worker contention. Net negative.
- TCP cache proxy at scale — A single Docker container running RESP serialized all 96 connections through one socket. Throughput dropped from 1.51M to 136K auth/sec — an 11x regression. The in-process DashMap at 0.085µs per lookup was the fix.
The lesson: always benchmark under production contention. L1 cache pressure and allocator behavior at 96 cores are invisible in single-threaded benchmarks.
What This Means
You have the fastest FHE library in the world.
331us encryption. 24us homomorphic multiply. 4.5-7.5x faster than Microsoft SEAL.
You have the fastest STARK prover in the world.
6.96ms prove time. 30-50% faster than Plonky3. No trusted setup required.
You have the only quantum-resistant biometric auth system.
FHE + STARK + post-quantum signatures. End-to-end in ~17-24ms. Nobody else has this.
The remaining gains are in hardware (custom ASICs) or algorithmic breakthroughs (new proof systems). At the software level, we have hit the limits of what is algorithmically possible on commodity server hardware. Every modular multiply is Montgomery-reduced, every NTT is parallelized, every cache line is accounted for.
Ship it.
Try the World's Fastest Crypto Stack
FHE encryption in 331us. STARK proofs in 6.96ms. Quantum-resistant authentication in 17-24ms.
Get API Key