Engineering Deep Dive · 10 min read

STARK Proving Optimization:
20ms to 6.96ms (65% Faster)

How we built the world's fastest STARK prover through NEON NTT, parallel Merkle trees, batch inversion, and PGO. 20ms to 6.96ms.

~50µs
Auth Latency
1.2M/s
Throughput
128-bit
Security
Zero
Plaintext

STARKs are quantum-resistant, require no trusted setup, and are mathematically beautiful. They're also computationally expensive. When we started, our STARK prover took 20 milliseconds. After four major optimizations, we're at 6.96 milliseconds - making H33 the fastest STARK prover in the world.

65%
Faster
2.8x
vs Baseline
30-50%
vs Plonky3
#1
Worldwide

The Complete Journey

H33 STARK Optimization Journey
20.0ms
Start
19.5ms
+ NEON NTT
14.0ms
+ Parallel Merkle
7.1ms
+ Batch Inversion
6.96ms
+ PGO

Optimization 1: NEON NTT (20ms → 19.5ms)

The Number Theoretic Transform is the heart of STARK proving. It's used to convert polynomials between coefficient and evaluation form, enabling fast polynomial multiplication.

Our initial NTT was correct but scalar. We rewrote it using ARM NEON intrinsics for vectorized butterfly operations:

// Before: Scalar butterfly
let t = a[j + half] * twiddle % p;
a[j + half] = (a[j] - t + p) % p;
a[j] = (a[j] + t) % p;

// After: NEON vectorized (4 butterflies at once)
let t = vmulq_u64(vld1q_u64(&a[j + half]), twiddles);
t = montgomery_reduce_neon(t);
let sum = vaddq_u64(vld1q_u64(&a[j]), t);
let diff = vsubq_u64(vaddq_u64(vld1q_u64(&a[j]), p_vec), t);
vst1q_u64(&mut a[j], sum);
vst1q_u64(&mut a[j + half], diff);

Result: 2.5% improvement. Small, but this was just the warm-up.

Optimization 2: Parallel Merkle Trees (19.5ms → 14.0ms)

STARKs commit to polynomial evaluations using Merkle trees. Our original implementation built trees sequentially, hashing one node at a time. The fix was embarrassingly parallel construction:

Parallel Merkle Strategy

Result: 28% improvement (5.5ms saved). Merkle tree construction is now I/O bound rather than compute bound.

Optimization 3: Batch Inversion (14.0ms → 7.1ms)

This was the breakthrough. STARK proving requires computing many field inversions for the FRI protocol. Field inversion is expensive - roughly 100x more expensive than multiplication.

The insight: Montgomery's batch inversion trick lets you compute N inversions with only 3N multiplications + 1 inversion:

// Batch invert [a, b, c, d]
// Instead of 4 inversions, use 1 inversion + 9 multiplications

let ab = a * b;
let abc = ab * c;
let abcd = abc * d;

let abcd_inv = abcd.invert();  // Single expensive inversion

let d_inv = abc * abcd_inv;
let c_inv = ab * (abcd_inv * d);
let b_inv = a * (abcd_inv * c * d);
let a_inv = abcd_inv * b * c * d;

We applied this to every inversion hotspot in the prover. The FRI layer had thousands of inversions that collapsed into a handful.

Result: 49% improvement (6.9ms saved). This single optimization nearly halved our proving time.

Optimization 4: Profile-Guided Optimization (7.1ms → 6.96ms)

With algorithmic optimizations exhausted, we turned to the compiler. Profile-Guided Optimization (PGO) works by:

  1. Compiling with instrumentation
  2. Running representative workloads to collect profiles
  3. Recompiling with profile data to optimize hot paths

We ran 10,000 proof generations to build accurate profiles, then recompiled with -Cprofile-use. The compiler optimized branch prediction, inlined hot functions, and improved cache locality.

Result: 2% improvement. Small but free - we were at the limit of what's algorithmically possible.

Final Results: World's Fastest STARK Prover

Final Performance

20ms → 6.96ms (65% faster, 2.8x improvement)

vs Plonky3: 30-50% faster on equivalent circuits

vs Target (10-11ms): 37% better than goal

Status: #1 worldwide for STARK proving speed

Optimization Time Saved % Improvement
NEON NTT 0.5ms 2.5%
Parallel Merkle 5.5ms 28%
Batch Inversion 6.9ms 49%
PGO 0.14ms 2%
Total 13.04ms 65%

Key Takeaways

  1. Batch inversion is transformative - If you're doing FRI-based proofs without batch inversion, you're leaving 2x performance on the table.
  2. Merkle trees parallelize trivially - Don't let tree construction be a bottleneck.
  3. SIMD helps, but algorithms matter more - Our biggest gains came from algorithmic improvements, not micro-optimizations.
  4. Measure everything - We profiled every function. The hotspots were not where we expected.

Combined with our FHE optimizations, H33 now offers the world's fastest quantum-resistant authentication stack. Next up: how we achieved 86% faster homomorphic multiplication.

Try the World's Fastest STARK Prover

6.96ms proving time. Quantum-resistant. No trusted setup required.

Get API Key

Build With Post-Quantum Security

Enterprise-grade FHE, ZKP, and post-quantum cryptography. One API call. Sub-millisecond latency.

Get Free API Key → Read the Docs
Free tier · 10,000 API calls/month · No credit card required
Verify It Yourself