From Milliseconds to Microseconds

When we started building H33, ZK proof generation took seconds. FHE operations took minutes. The idea of sub-millisecond authentication with full cryptographic privacy seemed impossible. This is the story of how we got to 1.28ms full auth and, ultimately, 1.595 million authentications per second on production hardware—all while keeping every biometric template encrypted under BFV fully homomorphic encryption with post-quantum Dilithium signatures.

"The goal was never just 'fast enough.' It was authentication so fast that adding security adds zero perceived latency."

The Starting Point

Our first prototype used off-the-shelf cryptographic libraries. The numbers were... humbling:

Early 2025

2.3 seconds

Initial ZK proof generation using standard libraries

2.3 seconds for a ZK proof. Acceptable for blockchain transactions. Completely unusable for authentication. Users would abandon the login before it completed.

Phase 1: Circuit Optimization

The first breakthrough came from rethinking our ZK circuits. Generic ZK circuits are designed for arbitrary computation. We needed circuits optimized specifically for authentication.

Removed unnecessary constraints: Our circuits don't need to prove arbitrary computation, just identity claims
Optimized hash functions: Switched to ZK-friendly hashes (Poseidon) for in-circuit operations
Reduced public inputs: Minimized data that must be revealed

Q2 2025

180ms

After circuit optimization (12x improvement)

180ms was usable but not invisible. Users would still perceive a slight delay.

Phase 2: The Rust Rewrite

Our JavaScript implementation hit a wall. Garbage collection pauses alone could exceed our latency budget. We rewrote the cryptographic core in Rust:

Zero-copy operations: No unnecessary buffer allocations
SIMD vectorization: AVX-512 for parallel field operations
Memory-mapped I/O: Direct hardware access for key operations
No GC pauses: Deterministic memory management

Q3 2025

4.2ms

After Rust rewrite (43x improvement)

4.2ms was getting close. Sub-10ms is generally imperceptible. But we knew we could do better.

Phase 3: Parallelism and Batching

Modern CPUs have many cores. Our single-threaded prover was leaving performance on the table:

Parallel witness generation: Independent witness components computed simultaneously
Batched NTT operations: Number Theoretic Transforms grouped for cache efficiency
Work stealing: Idle cores pick up work from busy cores

The batching insight became central to our production architecture. Our BFV FHE scheme uses polynomial ring dimension N=4096 with a plaintext modulus of t=65537, which satisfies the CRT batching condition t ≡ 1 (mod 2N). This gives us 4,096 SIMD slots per ciphertext. Since each biometric template occupies 128 dimensions, we pack 32 independent user templates into a single ciphertext—reducing per-user storage from roughly 32MB to 256KB and allowing one FHE inner product to verify an entire batch simultaneously.

Key Insight

SIMD batching is not just a throughput trick—it fundamentally changes the cost model. A single BFV ciphertext multiply costs the same whether it encodes 1 user or 32. By filling every slot, we amortize the expensive NTT and key-switching operations across the full batch.

Q4 2025

890µs

After parallelization (4.7x improvement)

Sub-millisecond! But our batch processing revealed even more opportunity.

Phase 4: Intelligent Caching

We noticed that verification was often redundant. The same proofs were being verified multiple times:

Proof fingerprinting: Unique identifier for each proof via SHA3-256 digest
Cache-aware verification: Skip full verification for known-valid proofs
Session context caching: Reuse verification work across requests

The cache layer went through its own performance journey. Our initial approach used a TCP-based RESP proxy (Cachee) for distributed cache lookups. At low worker counts this worked fine, but at 96 concurrent workers on our Graviton4 production instance, a single Docker container serializing all TCP connections became a catastrophic bottleneck—throughput dropped from 1.51M to 136K auth/sec, an 11x regression. The solution was an in-process DashMap that delivers 0.085µs lookups with zero network overhead, giving us a net 5.5% throughput gain over raw uncached verification.

The Cache Breakthrough

Cold verification: 2.14ms → Cached verification: 32µs
67x speedup for returning users

Phase 5: NTT Domain Engineering

The final wave of optimizations targeted the Number Theoretic Transform itself—the polynomial multiplication engine at the heart of BFV. Every FHE encrypt, decrypt, and key-switch operation depends on forward and inverse NTTs, so shaving microseconds here multiplied across every pipeline stage.

// Montgomery-form NTT butterfly (no division in hot path)
let t = montgomery_reduce(a[j + half] as u128 * twiddle as u128, q, q_inv);
a[j + half] = a[j].wrapping_sub(t);  // Harvey lazy: result in [0, 2q)
a[j] = a[j].wrapping_add(t);          // defer final reduction

We converted all NTT twiddle factors to Montgomery form at keygen time, replacing per-butterfly modular division with a single Montgomery reduction (REDC). Combined with Harvey lazy reduction—keeping intermediate butterfly values in [0, 2q) and deferring the final mod until the last stage—this eliminated the most expensive operation in the inner loop. We also fused the INTT post-processing step with inverse Montgomery conversion, reducing three REDC operations to two per coefficient via a precomputed fused_inv_mont constant.

The biggest single win came from keeping multiply-plain results in NTT form. Our multiply_plain_ntt() returns ciphertexts with is_ntt_form: true, skipping two inverse NTTs per modulus per call. Since the batch verification pipeline chains multiple multiply-accumulate operations before a single final INTT, this saved hundreds of microseconds per batch—dropping batch latency from 1,375µs to 1,109µs on production hardware.

Phase 6: Session Resume

If a user's context hasn't changed, why re-authenticate everything? Session resume verifies only that the session is still valid:

January 2026

42µs

Session resume for returning users

The Production Numbers

Our February 2026 benchmarks on a c8g.metal-48xl (AWS Graviton4, 192 vCPUs) represent the culmination of this journey. Each authentication passes through three pipeline stages—FHE batch verification, ZKP cache lookup, and Dilithium attestation—all post-quantum secure:

Pipeline Stage	Latency	PQ-Secure
BFV inner product (32 users)	~1,109 µs	Yes (lattice)
ZKP DashMap lookup	0.085 µs	Yes (SHA3-256)
Dilithium sign + verify	~244 µs	Yes (ML-DSA)
Total (32-user batch)	~1,356 µs
Per authentication	~42 µs

Sustained throughput: 2,172,518 auth/sec across 96 parallel workers
Full Auth (Turbo): 1.28ms (10,000x faster than our first prototype)
Session Resume: 50µs
Cached Proof Verify: 32µs (67x faster than cold)

Key Insight

Batch attestation was a critical multiplier. Instead of signing and verifying a Dilithium signature for each of 32 users individually, we sign the entire batch digest once. That single operation costs ~244µs total instead of ~7,808µs (32 × 244), a 31x reduction in attestation overhead.

What We Learned

1. Specialize ruthlessly. Generic solutions are generic-speed. Every optimization came from understanding exactly what we needed to compute.

2. Measure everything. Intuition about performance is usually wrong. Profile, don't guess. We tried arena pooling, fused twiddle tables, and jemalloc—all of which benchmarked slower in production despite looking promising in isolation.

3. Question assumptions. "ZK proofs are slow" was an assumption. We challenged it.

4. Cache is king. The fastest computation is the one you don't do. Our 67x cache speedup proves it—but the cache architecture itself matters. TCP proxies that work at 8 workers can collapse at 96.

5. Hardware matters. SIMD, cache locality, memory bandwidth—understanding hardware unlocked our biggest gains. On Graviton4, the system allocator outperformed jemalloc by 8% because glibc's malloc is heavily optimized for ARM's flat memory model.

What's Next

We're not done. Our roadmap includes:

GPU acceleration for batch proof generation
Custom ASIC design for FHE operations
Recursive proof composition for unlimited scale
Sub-10µs session validation

The journey from milliseconds to microseconds taught us that performance limits are often just engineering challenges waiting for the right abstraction. When your inner loop runs 1.595 million times per second, every instruction counts—and every assumption is worth re-examining.

Experience the Performance

1.28ms full auth. 50µs session resume. 2.17M auth/sec sustained. Try it yourself.

Get Free API Key

From Milliseconds to Microseconds:
The H33 Performance Journey

The Starting Point

Phase 1: Circuit Optimization

Phase 2: The Rust Rewrite

Phase 3: Parallelism and Batching

Phase 4: Intelligent Caching

The Cache Breakthrough

Phase 5: NTT Domain Engineering

Phase 6: Session Resume

The Production Numbers

What We Learned

What's Next

Experience the Performance

Build With Post-Quantum Security

The Starting Point

Phase 1: Circuit Optimization

Phase 2: The Rust Rewrite

Phase 3: Parallelism and Batching

Phase 4: Intelligent Caching

The Cache Breakthrough

Phase 5: NTT Domain Engineering

Phase 6: Session Resume

The Production Numbers

What We Learned

What's Next

Experience the Performance

Build With Post-Quantum Security

Related Articles