BenchmarksStack Ranking
APIsPricingDocsWhite PaperTokenBlogAboutSecurity Demo
Log InGet API Key
Engineering · 5 min read

From Milliseconds to Microseconds:
The H33 Performance Journey

How H33 achieved 1.28ms full auth and high-throughput auth: the engineering story behind our January 2026 benchmarks.

~42µs
Auth Latency
2.17M/s
Throughput
128-bit
Security
Zero
Plaintext

When we started building H33, ZK proof generation took seconds. FHE operations took minutes. The idea of sub-millisecond authentication with full cryptographic privacy seemed impossible. This is the story of how we got to 1.28ms full auth and, ultimately, 1.595 million authentications per second on production hardware—all while keeping every biometric template encrypted under BFV fully homomorphic encryption with post-quantum Dilithium signatures.

"The goal was never just 'fast enough.' It was authentication so fast that adding security adds zero perceived latency."

The Starting Point

Our first prototype used off-the-shelf cryptographic libraries. The numbers were... humbling:

Early 2025
2.3 seconds
Initial ZK proof generation using standard libraries

2.3 seconds for a ZK proof. Acceptable for blockchain transactions. Completely unusable for authentication. Users would abandon the login before it completed.

Phase 1: Circuit Optimization

The first breakthrough came from rethinking our ZK circuits. Generic ZK circuits are designed for arbitrary computation. We needed circuits optimized specifically for authentication.

Q2 2025
180ms
After circuit optimization (12x improvement)

180ms was usable but not invisible. Users would still perceive a slight delay.

Phase 2: The Rust Rewrite

Our JavaScript implementation hit a wall. Garbage collection pauses alone could exceed our latency budget. We rewrote the cryptographic core in Rust:

Q3 2025
4.2ms
After Rust rewrite (43x improvement)

4.2ms was getting close. Sub-10ms is generally imperceptible. But we knew we could do better.

Phase 3: Parallelism and Batching

Modern CPUs have many cores. Our single-threaded prover was leaving performance on the table:

The batching insight became central to our production architecture. Our BFV FHE scheme uses polynomial ring dimension N=4096 with a plaintext modulus of t=65537, which satisfies the CRT batching condition t ≡ 1 (mod 2N). This gives us 4,096 SIMD slots per ciphertext. Since each biometric template occupies 128 dimensions, we pack 32 independent user templates into a single ciphertext—reducing per-user storage from roughly 32MB to 256KB and allowing one FHE inner product to verify an entire batch simultaneously.

Key Insight

SIMD batching is not just a throughput trick—it fundamentally changes the cost model. A single BFV ciphertext multiply costs the same whether it encodes 1 user or 32. By filling every slot, we amortize the expensive NTT and key-switching operations across the full batch.

Q4 2025
890µs
After parallelization (4.7x improvement)

Sub-millisecond! But our batch processing revealed even more opportunity.

Phase 4: Intelligent Caching

We noticed that verification was often redundant. The same proofs were being verified multiple times:

The cache layer went through its own performance journey. Our initial approach used a TCP-based RESP proxy (Cachee) for distributed cache lookups. At low worker counts this worked fine, but at 96 concurrent workers on our Graviton4 production instance, a single Docker container serializing all TCP connections became a catastrophic bottleneck—throughput dropped from 1.51M to 136K auth/sec, an 11x regression. The solution was an in-process DashMap that delivers 0.085µs lookups with zero network overhead, giving us a net 5.5% throughput gain over raw uncached verification.

The Cache Breakthrough

Cold verification: 2.14ms → Cached verification: 32µs
67x speedup for returning users

Phase 5: NTT Domain Engineering

The final wave of optimizations targeted the Number Theoretic Transform itself—the polynomial multiplication engine at the heart of BFV. Every FHE encrypt, decrypt, and key-switch operation depends on forward and inverse NTTs, so shaving microseconds here multiplied across every pipeline stage.

// Montgomery-form NTT butterfly (no division in hot path)
let t = montgomery_reduce(a[j + half] as u128 * twiddle as u128, q, q_inv);
a[j + half] = a[j].wrapping_sub(t);  // Harvey lazy: result in [0, 2q)
a[j] = a[j].wrapping_add(t);          // defer final reduction

We converted all NTT twiddle factors to Montgomery form at keygen time, replacing per-butterfly modular division with a single Montgomery reduction (REDC). Combined with Harvey lazy reduction—keeping intermediate butterfly values in [0, 2q) and deferring the final mod until the last stage—this eliminated the most expensive operation in the inner loop. We also fused the INTT post-processing step with inverse Montgomery conversion, reducing three REDC operations to two per coefficient via a precomputed fused_inv_mont constant.

The biggest single win came from keeping multiply-plain results in NTT form. Our multiply_plain_ntt() returns ciphertexts with is_ntt_form: true, skipping two inverse NTTs per modulus per call. Since the batch verification pipeline chains multiple multiply-accumulate operations before a single final INTT, this saved hundreds of microseconds per batch—dropping batch latency from 1,375µs to 1,109µs on production hardware.

Phase 6: Session Resume

If a user's context hasn't changed, why re-authenticate everything? Session resume verifies only that the session is still valid:

January 2026
42µs
Session resume for returning users

The Production Numbers

Our February 2026 benchmarks on a c8g.metal-48xl (AWS Graviton4, 192 vCPUs) represent the culmination of this journey. Each authentication passes through three pipeline stages—FHE batch verification, ZKP cache lookup, and Dilithium attestation—all post-quantum secure:

Pipeline StageLatencyPQ-Secure
BFV inner product (32 users)~1,109 µsYes (lattice)
ZKP DashMap lookup0.085 µsYes (SHA3-256)
Dilithium sign + verify~244 µsYes (ML-DSA)
Total (32-user batch)~1,356 µs
Per authentication~42 µs
Key Insight

Batch attestation was a critical multiplier. Instead of signing and verifying a Dilithium signature for each of 32 users individually, we sign the entire batch digest once. That single operation costs ~244µs total instead of ~7,808µs (32 × 244), a 31x reduction in attestation overhead.

What We Learned

1. Specialize ruthlessly. Generic solutions are generic-speed. Every optimization came from understanding exactly what we needed to compute.

2. Measure everything. Intuition about performance is usually wrong. Profile, don't guess. We tried arena pooling, fused twiddle tables, and jemalloc—all of which benchmarked slower in production despite looking promising in isolation.

3. Question assumptions. "ZK proofs are slow" was an assumption. We challenged it.

4. Cache is king. The fastest computation is the one you don't do. Our 67x cache speedup proves it—but the cache architecture itself matters. TCP proxies that work at 8 workers can collapse at 96.

5. Hardware matters. SIMD, cache locality, memory bandwidth—understanding hardware unlocked our biggest gains. On Graviton4, the system allocator outperformed jemalloc by 8% because glibc's malloc is heavily optimized for ARM's flat memory model.

What's Next

We're not done. Our roadmap includes:

The journey from milliseconds to microseconds taught us that performance limits are often just engineering challenges waiting for the right abstraction. When your inner loop runs 1.595 million times per second, every instruction counts—and every assumption is worth re-examining.

Experience the Performance

1.28ms full auth. 50µs session resume. 2.17M auth/sec sustained. Try it yourself.

Get Free API Key

Build With Post-Quantum Security

Enterprise-grade FHE, ZKP, and post-quantum cryptography. One API call. Sub-millisecond latency.

Get Free API Key → Read the Docs
Free tier · 10,000 API calls/month · No credit card required
Verify It Yourself