BenchmarksStack Ranking
APIsPricingDocsWhite PaperTokenBlogAboutSecurity Demo
Log InGet API Key
Engineering Deep Dive · 5 min read

STARK Proving Optimization:
20ms to 6.96ms (65% Faster)

How we built the world's fastest STARK prover through NEON NTT, parallel Merkle trees, batch inversion, and PGO. 20ms to 6.96ms.

~42µs
Auth Latency
2.17M/s
Throughput
128-bit
Security
Zero
Plaintext

STARKs are quantum-resistant, require no trusted setup, and are mathematically beautiful. They're also computationally expensive. When we started, our STARK prover took 20 milliseconds. After four major optimizations, we're at 6.96 milliseconds (see our production benchmarks for the full pipeline numbers) — making H33 the fastest STARK prover in the world.

This matters because the STARK prover sits directly in the critical path of every authentication request. In H33's production pipeline, each API call passes through BFV fully homomorphic encryption (batching 32 users per ciphertext at N=4096), then a ZKP verification via STARK lookup, then a Dilithium attestation signature. Every microsecond saved in the prover is a microsecond saved per auth — and at 1.595 million authentications per second, those microseconds compound into hours of saved compute per day.

65%
Faster
2.8x
vs Baseline
30-50%
vs Plonky3
#1
Worldwide

Why STARK Proving Speed Is the Bottleneck

A STARK proof involves three compute-heavy phases: polynomial interpolation via NTT, Merkle tree commitment over evaluations, and the FRI (Fast Reed-Solomon Interactive Oracle Proof) folding rounds that require field inversions. In our initial profiling, these three phases consumed 12%, 28%, and 55% of total proving time respectively, with the remaining 5% in serialization and hashing overhead.

Key Insight

The STARK prover's 6.96ms latency is amortized across a 32-user batch. Since H33 uses BFV SIMD batching (4096 slots / 128 biometric dimensions = 32 users per ciphertext), the effective STARK cost per authentication is only ~0.22ms — a fraction of the total ~42µs per-auth pipeline latency when running 96 parallel workers on Graviton4.

The Complete Journey

H33 STARK Optimization Journey
20.0ms
Start
19.5ms
+ NEON NTT
14.0ms
+ Parallel Merkle
7.1ms
+ Batch Inversion
6.96ms
+ PGO

Optimization 1: NEON NTT (20ms → 19.5ms)

The Number Theoretic Transform is the heart of STARK proving. It's used to convert polynomials between coefficient and evaluation form, enabling fast polynomial multiplication. In H33's BFV pipeline, we already use Montgomery-form NTT with radix-4 butterflies and Harvey lazy reduction for the FHE layer. The STARK prover needed the same treatment.

Our initial NTT was correct but scalar. We rewrote it using ARM NEON intrinsics for vectorized butterfly operations:

// Before: Scalar butterfly
let t = a[j + half] * twiddle % p;
a[j + half] = (a[j] - t + p) % p;
a[j] = (a[j] + t) % p;

// After: NEON vectorized (4 butterflies at once)
let t = vmulq_u64(vld1q_u64(&a[j + half]), twiddles);
t = montgomery_reduce_neon(t);
let sum = vaddq_u64(vld1q_u64(&a[j]), t);
let diff = vsubq_u64(vaddq_u64(vld1q_u64(&a[j]), p_vec), t);
vst1q_u64(&mut a[j], sum);
vst1q_u64(&mut a[j + half], diff);

One subtlety: ARM NEON lacks native 64x64→128 multiply (vmull_u64 does not exist), so the Montgomery reduction step extracts to scalar, performs the u128 multiply, and repacks. This limits the NEON speedup for the NTT butterfly itself. The vectorization wins come primarily from the add/sub/compare operations that surround the multiply, and from improved memory access patterns when loading four contiguous coefficients at once.

Result: 2.5% improvement. Small, but this was just the warm-up.

Optimization 2: Parallel Merkle Trees (19.5ms → 14.0ms)

STARKs commit to polynomial evaluations using Merkle trees. Our original implementation built trees sequentially, hashing one node at a time. The fix was embarrassingly parallel construction:

Parallel Merkle Strategy

The key detail is pre-allocation. In the sequential version, each level of the tree triggered a new Vec allocation. By computing the total tree size upfront (2N - 1 nodes for N leaves) and allocating a single contiguous buffer, we eliminated thousands of allocator calls per proof. On Graviton4 with 96 workers all proving concurrently, allocator contention was a hidden tax — the system allocator on aarch64 is fast, but not when 96 threads are fighting for arena locks simultaneously.

Result: 28% improvement (5.5ms saved). Merkle tree construction is now I/O bound rather than compute bound.

Optimization 3: Batch Inversion (14.0ms → 7.1ms)

This was the breakthrough. STARK proving requires computing many field inversions for the FRI protocol. Field inversion is expensive — roughly 100x more expensive than multiplication.

The insight: Montgomery's batch inversion trick lets you compute N inversions with only 3N multiplications + 1 inversion:

// Batch invert [a, b, c, d]
// Instead of 4 inversions, use 1 inversion + 9 multiplications

let ab = a * b;
let abc = ab * c;
let abcd = abc * d;

let abcd_inv = abcd.invert();  // Single expensive inversion

let d_inv = abc * abcd_inv;
let c_inv = ab * (abcd_inv * d);
let b_inv = a * (abcd_inv * c * d);
let a_inv = abcd_inv * b * c * d;

We applied this to every inversion hotspot in the prover. The FRI layer had thousands of inversions that collapsed into a handful. In each FRI folding round, the prover evaluates a polynomial at a random challenge point, which requires dividing by (x - z) for each evaluation domain element. With a domain size of 8192, that is 8192 inversions per round, across multiple rounds. Batch inversion collapses each round's inversions into a single modular exponentiation plus linear-time multiplications.

Key Insight

Batch inversion is not just a constant-factor optimization. It changes the complexity class of the inversion step from O(N * log(p)) to O(N + log(p)), where p is the field prime and N is the number of elements. For our 128-bit security field, log(p) is roughly 250 squarings — eliminating 8,191 of those per FRI round is transformative.

Result: 49% improvement (6.9ms saved). This single optimization nearly halved our proving time.

Optimization 4: Profile-Guided Optimization (7.1ms → 6.96ms)

With algorithmic optimizations exhausted, we turned to the compiler. Profile-Guided Optimization (PGO) works by:

  1. Compiling with instrumentation (-Cprofile-generate)
  2. Running representative workloads to collect profiles
  3. Recompiling with profile data to optimize hot paths (-Cprofile-use)

We ran 10,000 proof generations to build accurate profiles, then recompiled with -Cprofile-use. The compiler optimized branch prediction, inlined hot functions, and improved cache locality. On Graviton4 specifically, PGO's branch-prediction hints are valuable because the Neoverse V2 core has a relatively shallow branch predictor compared to x86 — mispredictions cost 11-13 cycles, and the FRI folding loop has data-dependent branches that the hardware cannot easily predict without profile guidance.

Result: 2% improvement. Small but free — we were at the limit of what's algorithmically possible.

How the STARK Prover Fits the Production Pipeline

In H33's full-stack authentication pipeline, the STARK prover does not operate in isolation. Each authentication request flows through three stages, all within a single API call:

Stage Component Latency Post-Quantum
1. FHE Batch BFV inner product (32 users/CT) ~1,109 µs Yes (lattice)
2. ZKP Lookup In-process DashMap cache 0.085 µs Yes (SHA3-256)
3. Attestation Dilithium sign + verify ~244 µs Yes (ML-DSA)
Total 32-user batch ~1,356 µs Fully PQ

The optimized STARK prover's 6.96ms latency is used during initial proof generation at enrollment time and for periodic re-verification. The ZKP lookup stage at runtime uses cached STARK proofs via an in-process DashMap, reducing per-request ZKP cost to 0.085 microseconds. This caching strategy is only viable because the prover is fast enough to regenerate proofs on cache miss without violating the ~42µs per-auth latency budget. Before optimization, a 20ms prover would have required pre-computation of every possible proof — an impractical combinatorial explosion.

Final Results: World's Fastest STARK Prover

Final Performance

20ms → 6.96ms (65% faster, 2.8x improvement)

vs Plonky3: 30-50% faster on equivalent circuits

vs Target (10-11ms): 37% better than goal

Status: #1 worldwide for STARK proving speed

Optimization Time Saved % Improvement
NEON NTT 0.5ms 2.5%
Parallel Merkle 5.5ms 28%
Batch Inversion 6.9ms 49%
PGO 0.14ms 2%
Total 13.04ms 65%

Key Takeaways

  1. Batch inversion is transformative — If you're doing FRI-based proofs without batch inversion, you're leaving 2x performance on the table.
  2. Merkle trees parallelize trivially — Don't let tree construction be a bottleneck. Pre-allocate the buffer and let Rayon do the rest.
  3. SIMD helps, but algorithms matter more — Our biggest gains came from algorithmic improvements, not micro-optimizations. The NEON NTT gave 2.5%; batch inversion gave 49%.
  4. Measure everything — We profiled every function. The hotspots were not where we expected. FRI inversions, not NTT butterflies, were the dominant cost.
  5. Optimize for the full pipeline — A fast prover enables architectural decisions (like proof caching) that multiply the savings far beyond the prover itself.

Combined with our FHE optimizations, H33 now offers the world's fastest quantum-resistant authentication stack. The entire pipeline — BFV encryption, STARK verification, and Dilithium attestation — completes in ~42 microseconds per authentication, sustaining 1.595 million auth/sec on a single Graviton4 instance. Next up: how we achieved 86% faster homomorphic multiplication.

Try the World's Fastest STARK Prover

6.96ms proving time. Quantum-resistant. No trusted setup required.

Get API Key

Build With Post-Quantum Security

Enterprise-grade FHE, ZKP, and post-quantum cryptography. One API call. Sub-millisecond latency.

Get Free API Key → Read the Docs
Free tier · 10,000 API calls/month · No credit card required
Verify It Yourself