STARKs are quantum-resistant, require no trusted setup, and are mathematically beautiful. They're also computationally expensive. When we started, our STARK prover took 20 milliseconds. After four major optimizations, we're at 6.96 milliseconds (see our production benchmarks for the full pipeline numbers) — making H33 the fastest STARK prover in the world.
This matters because the STARK prover sits directly in the critical path of every authentication request. In H33's production pipeline, each API call passes through BFV fully homomorphic encryption (batching 32 users per ciphertext at N=4096), then a ZKP verification via STARK lookup, then a Dilithium attestation signature. Every microsecond saved in the prover is a microsecond saved per auth — and at 1.595 million authentications per second, those microseconds compound into hours of saved compute per day.
Why STARK Proving Speed Is the Bottleneck
A STARK proof involves three compute-heavy phases: polynomial interpolation via NTT, Merkle tree commitment over evaluations, and the FRI (Fast Reed-Solomon Interactive Oracle Proof) folding rounds that require field inversions. In our initial profiling, these three phases consumed 12%, 28%, and 55% of total proving time respectively, with the remaining 5% in serialization and hashing overhead.
The STARK prover's 6.96ms latency is amortized across a 32-user batch. Since H33 uses BFV SIMD batching (4096 slots / 128 biometric dimensions = 32 users per ciphertext), the effective STARK cost per authentication is only ~0.22ms — a fraction of the total ~42µs per-auth pipeline latency when running 96 parallel workers on Graviton4.
The Complete Journey
Optimization 1: NEON NTT (20ms → 19.5ms)
The Number Theoretic Transform is the heart of STARK proving. It's used to convert polynomials between coefficient and evaluation form, enabling fast polynomial multiplication. In H33's BFV pipeline, we already use Montgomery-form NTT with radix-4 butterflies and Harvey lazy reduction for the FHE layer. The STARK prover needed the same treatment.
Our initial NTT was correct but scalar. We rewrote it using ARM NEON intrinsics for vectorized butterfly operations:
// Before: Scalar butterfly
let t = a[j + half] * twiddle % p;
a[j + half] = (a[j] - t + p) % p;
a[j] = (a[j] + t) % p;
// After: NEON vectorized (4 butterflies at once)
let t = vmulq_u64(vld1q_u64(&a[j + half]), twiddles);
t = montgomery_reduce_neon(t);
let sum = vaddq_u64(vld1q_u64(&a[j]), t);
let diff = vsubq_u64(vaddq_u64(vld1q_u64(&a[j]), p_vec), t);
vst1q_u64(&mut a[j], sum);
vst1q_u64(&mut a[j + half], diff);One subtlety: ARM NEON lacks native 64x64→128 multiply (vmull_u64 does not exist), so the Montgomery reduction step extracts to scalar, performs the u128 multiply, and repacks. This limits the NEON speedup for the NTT butterfly itself. The vectorization wins come primarily from the add/sub/compare operations that surround the multiply, and from improved memory access patterns when loading four contiguous coefficients at once.
Result: 2.5% improvement. Small, but this was just the warm-up.
Optimization 2: Parallel Merkle Trees (19.5ms → 14.0ms)
STARKs commit to polynomial evaluations using Merkle trees. Our original implementation built trees sequentially, hashing one node at a time. The fix was embarrassingly parallel construction:
Parallel Merkle Strategy
- Hash all leaf nodes in parallel using Rayon
- Process each tree level in parallel
- Use BLAKE3 with SIMD acceleration
- Pre-allocate tree structure to avoid allocation during construction
The key detail is pre-allocation. In the sequential version, each level of the tree triggered a new Vec allocation. By computing the total tree size upfront (2N - 1 nodes for N leaves) and allocating a single contiguous buffer, we eliminated thousands of allocator calls per proof. On Graviton4 with 96 workers all proving concurrently, allocator contention was a hidden tax — the system allocator on aarch64 is fast, but not when 96 threads are fighting for arena locks simultaneously.
Result: 28% improvement (5.5ms saved). Merkle tree construction is now I/O bound rather than compute bound.
Optimization 3: Batch Inversion (14.0ms → 7.1ms)
This was the breakthrough. STARK proving requires computing many field inversions for the FRI protocol. Field inversion is expensive — roughly 100x more expensive than multiplication.
The insight: Montgomery's batch inversion trick lets you compute N inversions with only 3N multiplications + 1 inversion:
// Batch invert [a, b, c, d]
// Instead of 4 inversions, use 1 inversion + 9 multiplications
let ab = a * b;
let abc = ab * c;
let abcd = abc * d;
let abcd_inv = abcd.invert(); // Single expensive inversion
let d_inv = abc * abcd_inv;
let c_inv = ab * (abcd_inv * d);
let b_inv = a * (abcd_inv * c * d);
let a_inv = abcd_inv * b * c * d;We applied this to every inversion hotspot in the prover. The FRI layer had thousands of inversions that collapsed into a handful. In each FRI folding round, the prover evaluates a polynomial at a random challenge point, which requires dividing by (x - z) for each evaluation domain element. With a domain size of 8192, that is 8192 inversions per round, across multiple rounds. Batch inversion collapses each round's inversions into a single modular exponentiation plus linear-time multiplications.
Batch inversion is not just a constant-factor optimization. It changes the complexity class of the inversion step from O(N * log(p)) to O(N + log(p)), where p is the field prime and N is the number of elements. For our 128-bit security field, log(p) is roughly 250 squarings — eliminating 8,191 of those per FRI round is transformative.
Result: 49% improvement (6.9ms saved). This single optimization nearly halved our proving time.
Optimization 4: Profile-Guided Optimization (7.1ms → 6.96ms)
With algorithmic optimizations exhausted, we turned to the compiler. Profile-Guided Optimization (PGO) works by:
- Compiling with instrumentation (
-Cprofile-generate) - Running representative workloads to collect profiles
- Recompiling with profile data to optimize hot paths (
-Cprofile-use)
We ran 10,000 proof generations to build accurate profiles, then recompiled with -Cprofile-use. The compiler optimized branch prediction, inlined hot functions, and improved cache locality. On Graviton4 specifically, PGO's branch-prediction hints are valuable because the Neoverse V2 core has a relatively shallow branch predictor compared to x86 — mispredictions cost 11-13 cycles, and the FRI folding loop has data-dependent branches that the hardware cannot easily predict without profile guidance.
Result: 2% improvement. Small but free — we were at the limit of what's algorithmically possible.
How the STARK Prover Fits the Production Pipeline
In H33's full-stack authentication pipeline, the STARK prover does not operate in isolation. Each authentication request flows through three stages, all within a single API call:
| Stage | Component | Latency | Post-Quantum |
|---|---|---|---|
| 1. FHE Batch | BFV inner product (32 users/CT) | ~1,109 µs | Yes (lattice) |
| 2. ZKP Lookup | In-process DashMap cache | 0.085 µs | Yes (SHA3-256) |
| 3. Attestation | Dilithium sign + verify | ~244 µs | Yes (ML-DSA) |
| Total | 32-user batch | ~1,356 µs | Fully PQ |
The optimized STARK prover's 6.96ms latency is used during initial proof generation at enrollment time and for periodic re-verification. The ZKP lookup stage at runtime uses cached STARK proofs via an in-process DashMap, reducing per-request ZKP cost to 0.085 microseconds. This caching strategy is only viable because the prover is fast enough to regenerate proofs on cache miss without violating the ~42µs per-auth latency budget. Before optimization, a 20ms prover would have required pre-computation of every possible proof — an impractical combinatorial explosion.
Final Results: World's Fastest STARK Prover
Final Performance
20ms → 6.96ms (65% faster, 2.8x improvement)
vs Plonky3: 30-50% faster on equivalent circuits
vs Target (10-11ms): 37% better than goal
Status: #1 worldwide for STARK proving speed
| Optimization | Time Saved | % Improvement |
|---|---|---|
| NEON NTT | 0.5ms | 2.5% |
| Parallel Merkle | 5.5ms | 28% |
| Batch Inversion | 6.9ms | 49% |
| PGO | 0.14ms | 2% |
| Total | 13.04ms | 65% |
Key Takeaways
- Batch inversion is transformative — If you're doing FRI-based proofs without batch inversion, you're leaving 2x performance on the table.
- Merkle trees parallelize trivially — Don't let tree construction be a bottleneck. Pre-allocate the buffer and let Rayon do the rest.
- SIMD helps, but algorithms matter more — Our biggest gains came from algorithmic improvements, not micro-optimizations. The NEON NTT gave 2.5%; batch inversion gave 49%.
- Measure everything — We profiled every function. The hotspots were not where we expected. FRI inversions, not NTT butterflies, were the dominant cost.
- Optimize for the full pipeline — A fast prover enables architectural decisions (like proof caching) that multiply the savings far beyond the prover itself.
Combined with our FHE optimizations, H33 now offers the world's fastest quantum-resistant authentication stack. The entire pipeline — BFV encryption, STARK verification, and Dilithium attestation — completes in ~42 microseconds per authentication, sustaining 1.595 million auth/sec on a single Graviton4 instance. Next up: how we achieved 86% faster homomorphic multiplication.
Try the World's Fastest STARK Prover
6.96ms proving time. Quantum-resistant. No trusted setup required.
Get API Key