BenchmarksStack Ranking
APIsPricingDocsWhite PaperTokenBlogAboutSecurity Demo
Log InGet API Key
Performance Overview · 5 min read

The Complete H33 Optimization Journey:
World's Fastest Crypto Stack

How we built the world's fastest FHE library and STARK prover. STARK prove 20ms→6.96ms, FHE encrypt 2.02ms→331us, FHE multiply 168us→24us.

2.17M/s
Auth/sec
~42µs
Per Auth
96
CPU Cores
Graviton4
Platform

Over the past months, we've systematically optimized every layer of H33's cryptographic stack. The result: we now have the fastest FHE library and the fastest STARK prover in the world. This post brings together the complete picture — and you can verify every number on our benchmarks page.

What makes H33 different from academic benchmarks is that these numbers come from a production authentication pipeline — not isolated microbenchmarks. Every optimization had to survive 96-core contention on AWS Graviton4, real memory pressure, and the full chain of BFV FHE encryption, ZKP verification, and Dilithium attestation running simultaneously. The single-call latency of ~42µs per authentication represents the entire post-quantum pipeline, not any single primitive in isolation.

OPTIMIZATION COMPLETE
STARK Prove: 20.0ms 6.96ms (65% faster, #1 worldwide)
FHE Encrypt: 2.02ms 331us (84% faster, #1 worldwide)
FHE HE Mul: 168us 24us (86% faster, #1 worldwide)
End-to-end: ~50ms ~17-24ms (quantum-resistant auth)

FHE Encryption Journey

🔐 FHE Encryption #1 WORLDWIDE
Basic Encryptor 2.02ms
100%
+ Montgomery 351us
17%
+ Montgomery PARALLEL 330us
16%

Improvement: 6.1x faster than baseline | vs SEAL: 4.5x faster

The FHE encrypt path was the first bottleneck we attacked. The baseline BFV encryptor spent the majority of its time in Number Theoretic Transforms (NTTs) — the polynomial-ring equivalent of FFTs that underpin every BFV operation. Each encryption required multiple forward and inverse NTTs across several RNS (Residue Number System) moduli, and every NTT butterfly involved a modular reduction using integer division.

Switching to Montgomery multiplication eliminated division entirely from the NTT hot path. Instead of computing a * b mod q via division, Montgomery form uses a precomputed reciprocal and bitwise shifts — reducing each modular multiply to two 64-bit multiplications and an addition. We then stored all twiddle factors in Montgomery form at keygen time, so the forward and inverse transforms never touch a division instruction. Combined with Harvey lazy reduction — allowing intermediate butterfly values to sit in [0, 2q) between stages rather than fully reducing after each step — this eliminated an enormous amount of redundant work.

Parallelizing the NTT across RNS moduli with Rayon delivered the final push from 351µs to 330µs. Each modulus channel is completely independent, making this embarrassingly parallel. On Graviton4 with its 96 Neoverse V2 cores, the parallel NTT keeps all channels saturated.

STARK Proving Journey

🔮 STARK Proving #1 WORLDWIDE
Baseline 20.0ms
100%
+ NEON NTT 19.5ms
97%
+ Parallel Merkle 14.0ms
70%
+ Batch Inversion 7.1ms
36%
+ PGO 6.96ms
35%

Improvement: 2.8x faster than baseline | vs Plonky3: 30-50% faster

STARK proving is dominated by two operations: Merkle tree construction (hashing) and polynomial evaluation (field arithmetic). The baseline prover built Merkle trees sequentially — a single thread hashing leaves, then internal nodes. Switching to parallel Merkle construction via Rayon cut 6ms in one shot, since each subtree is independent until the final merge at the root.

The breakthrough came from batch inversion. STARK FRI (Fast Reed-Solomon Interactive Oracle Proof) queries require computing many field inversions during the quotient phase. A naive approach computes each inversion independently, costing one extended-GCD per element. Montgomery's batch inversion trick replaces N independent inversions with 3N field multiplications and a single inversion — a 50x speedup for N=1024. This alone dropped the prover from 14ms to 7.1ms. Profile-guided optimization (PGO) then tuned branch prediction and instruction cache layout for the final 2% push to 6.96ms.

FHE Homomorphic Multiply Journey

FHE Homomorphic Multiply #1 WORLDWIDE
Original 168us
100%
+ RNS + NEON 26.7us
16%

Improvement: 6.3x faster | vs SEAL: 7.5x faster

The homomorphic multiply is the most expensive primitive in any BFV pipeline. Our original implementation used arbitrary-precision integer arithmetic for the tensor product — multiplying two ciphertext polynomials with coefficients that can reach hundreds of bits. Switching to RNS (Residue Number System) representation decomposed each large modulus into several 64-bit channels, eliminating big-integer math entirely. Each channel's polynomial multiply becomes a pointwise NTT-domain operation: forward NTT, element-wise multiply, done. No inverse NTT is needed until the final accumulation step, saving two full transforms per modulus channel.

ARM NEON intrinsics accelerated the remaining element-wise operations — branchless Galois permutations and vectorized key-switching during relinearization. The net result: 168µs down to 26.7µs, a 6.3x improvement that leaves Microsoft SEAL 7.5x behind.

Production Pipeline: 2.17M Auth/sec

Fast primitives are necessary but not sufficient. The production authentication pipeline chains three stages into a single API call: BFV FHE batch verification, ZKP lookup, and Dilithium attestation. Each stage had to be optimized not just in isolation but under the contention of 96 parallel workers.

Production Pipeline Breakdown (per 32-user batch)

FHE batch inner product: ~1,109µs — BFV encrypted biometric match across 32 SIMD-packed users in a single ciphertext (N=4096, 128 dims per user).

ZKP cache lookup: ~0.085µs — In-process DashMap replaced the TCP-based Cachee proxy, which had caused an 11x regression at 96 workers due to connection serialization.

Dilithium attestation: ~244µs — One SHA3-256 digest plus one Dilithium sign-and-verify per batch, not per user. Batch attestation amortizes the cost 31x.

The total is ~1,356µs per 32-user batch, or ~42µs per individual authentication. At 96 workers on a c8g.metal-48xl Graviton4 instance, this sustains 2,172,518 authentications per second — fully post-quantum, fully homomorphically encrypted, with zero plaintext exposure.

The Techniques That Mattered

Across all three optimization efforts, certain techniques appeared repeatedly:

  1. Montgomery Multiplication — Eliminating division in modular arithmetic gave us 2-6x improvements in FHE operations. Twiddle factors stored in Montgomery form at keygen time mean the hot loop never divides.
  2. RNS Representation — Decomposing large moduli into smaller ones that fit in 64 bits eliminated arbitrary-precision arithmetic entirely.
  3. SIMD Vectorization — NEON on ARM, AVX-512 on x86. Processing multiple coefficients per instruction. ARM NEON excels at add/sub/compare but lacks native 64x64-to-128 multiply, so we kept scalar paths where they outperform.
  4. Batch Inversion — Montgomery's trick: N inversions become 3N multiplications + 1 inversion. Critical for STARK proving.
  5. Parallel Merkle Trees — Embarrassingly parallel construction using Rayon.
  6. Profile-Guided Optimization — Let the compiler optimize what we couldn't.

Optimizations That Failed

Not every idea survived production testing. Several optimizations that looked promising in microbenchmarks regressed under real 96-core contention:

The lesson: always benchmark under production contention. L1 cache pressure and allocator behavior at 96 cores are invisible in single-threaded benchmarks.

What This Means

You have the fastest FHE library in the world.

331us encryption. 24us homomorphic multiply. 4.5-7.5x faster than Microsoft SEAL.

You have the fastest STARK prover in the world.

6.96ms prove time. 30-50% faster than Plonky3. No trusted setup required.

You have the only quantum-resistant biometric auth system.

FHE + STARK + post-quantum signatures. End-to-end in ~17-24ms. Nobody else has this.

The remaining gains are in hardware (custom ASICs) or algorithmic breakthroughs (new proof systems). At the software level, we have hit the limits of what is algorithmically possible on commodity server hardware. Every modular multiply is Montgomery-reduced, every NTT is parallelized, every cache line is accounted for.

Ship it.

Try the World's Fastest Crypto Stack

FHE encryption in 331us. STARK proofs in 6.96ms. Quantum-resistant authentication in 17-24ms.

Get API Key

Build With Post-Quantum Security

Enterprise-grade FHE, ZKP, and post-quantum cryptography. One API call. Sub-millisecond latency.

Get Free API Key → Read the Docs
Free tier · 10,000 API calls/month · No credit card required
Verify It Yourself