When we started building H33, ZK proof generation took seconds. FHE operations took minutes. The idea of sub-millisecond authentication with full cryptographic privacy seemed impossible. This is the story of how we got to 1.28ms full auth and, ultimately, 1.595 million authentications per second on production hardware—all while keeping every biometric template encrypted under BFV fully homomorphic encryption with post-quantum Dilithium signatures.
"The goal was never just 'fast enough.' It was authentication so fast that adding security adds zero perceived latency."
The Starting Point
Our first prototype used off-the-shelf cryptographic libraries. The numbers were... humbling:
2.3 seconds for a ZK proof. Acceptable for blockchain transactions. Completely unusable for authentication. Users would abandon the login before it completed.
Phase 1: Circuit Optimization
The first breakthrough came from rethinking our ZK circuits. Generic ZK circuits are designed for arbitrary computation. We needed circuits optimized specifically for authentication.
- Removed unnecessary constraints: Our circuits don't need to prove arbitrary computation, just identity claims
- Optimized hash functions: Switched to ZK-friendly hashes (Poseidon) for in-circuit operations
- Reduced public inputs: Minimized data that must be revealed
180ms was usable but not invisible. Users would still perceive a slight delay.
Phase 2: The Rust Rewrite
Our JavaScript implementation hit a wall. Garbage collection pauses alone could exceed our latency budget. We rewrote the cryptographic core in Rust:
- Zero-copy operations: No unnecessary buffer allocations
- SIMD vectorization: AVX-512 for parallel field operations
- Memory-mapped I/O: Direct hardware access for key operations
- No GC pauses: Deterministic memory management
4.2ms was getting close. Sub-10ms is generally imperceptible. But we knew we could do better.
Phase 3: Parallelism and Batching
Modern CPUs have many cores. Our single-threaded prover was leaving performance on the table:
- Parallel witness generation: Independent witness components computed simultaneously
- Batched NTT operations: Number Theoretic Transforms grouped for cache efficiency
- Work stealing: Idle cores pick up work from busy cores
The batching insight became central to our production architecture. Our BFV FHE scheme uses polynomial ring dimension N=4096 with a plaintext modulus of t=65537, which satisfies the CRT batching condition t ≡ 1 (mod 2N). This gives us 4,096 SIMD slots per ciphertext. Since each biometric template occupies 128 dimensions, we pack 32 independent user templates into a single ciphertext—reducing per-user storage from roughly 32MB to 256KB and allowing one FHE inner product to verify an entire batch simultaneously.
SIMD batching is not just a throughput trick—it fundamentally changes the cost model. A single BFV ciphertext multiply costs the same whether it encodes 1 user or 32. By filling every slot, we amortize the expensive NTT and key-switching operations across the full batch.
Sub-millisecond! But our batch processing revealed even more opportunity.
Phase 4: Intelligent Caching
We noticed that verification was often redundant. The same proofs were being verified multiple times:
- Proof fingerprinting: Unique identifier for each proof via SHA3-256 digest
- Cache-aware verification: Skip full verification for known-valid proofs
- Session context caching: Reuse verification work across requests
The cache layer went through its own performance journey. Our initial approach used a TCP-based RESP proxy (Cachee) for distributed cache lookups. At low worker counts this worked fine, but at 96 concurrent workers on our Graviton4 production instance, a single Docker container serializing all TCP connections became a catastrophic bottleneck—throughput dropped from 1.51M to 136K auth/sec, an 11x regression. The solution was an in-process DashMap that delivers 0.085µs lookups with zero network overhead, giving us a net 5.5% throughput gain over raw uncached verification.
The Cache Breakthrough
Cold verification: 2.14ms → Cached verification: 32µs
67x speedup for returning users
Phase 5: NTT Domain Engineering
The final wave of optimizations targeted the Number Theoretic Transform itself—the polynomial multiplication engine at the heart of BFV. Every FHE encrypt, decrypt, and key-switch operation depends on forward and inverse NTTs, so shaving microseconds here multiplied across every pipeline stage.
// Montgomery-form NTT butterfly (no division in hot path)
let t = montgomery_reduce(a[j + half] as u128 * twiddle as u128, q, q_inv);
a[j + half] = a[j].wrapping_sub(t); // Harvey lazy: result in [0, 2q)
a[j] = a[j].wrapping_add(t); // defer final reductionWe converted all NTT twiddle factors to Montgomery form at keygen time, replacing per-butterfly modular division with a single Montgomery reduction (REDC). Combined with Harvey lazy reduction—keeping intermediate butterfly values in [0, 2q) and deferring the final mod until the last stage—this eliminated the most expensive operation in the inner loop. We also fused the INTT post-processing step with inverse Montgomery conversion, reducing three REDC operations to two per coefficient via a precomputed fused_inv_mont constant.
The biggest single win came from keeping multiply-plain results in NTT form. Our multiply_plain_ntt() returns ciphertexts with is_ntt_form: true, skipping two inverse NTTs per modulus per call. Since the batch verification pipeline chains multiple multiply-accumulate operations before a single final INTT, this saved hundreds of microseconds per batch—dropping batch latency from 1,375µs to 1,109µs on production hardware.
Phase 6: Session Resume
If a user's context hasn't changed, why re-authenticate everything? Session resume verifies only that the session is still valid:
The Production Numbers
Our February 2026 benchmarks on a c8g.metal-48xl (AWS Graviton4, 192 vCPUs) represent the culmination of this journey. Each authentication passes through three pipeline stages—FHE batch verification, ZKP cache lookup, and Dilithium attestation—all post-quantum secure:
| Pipeline Stage | Latency | PQ-Secure |
|---|---|---|
| BFV inner product (32 users) | ~1,109 µs | Yes (lattice) |
| ZKP DashMap lookup | 0.085 µs | Yes (SHA3-256) |
| Dilithium sign + verify | ~244 µs | Yes (ML-DSA) |
| Total (32-user batch) | ~1,356 µs | |
| Per authentication | ~42 µs |
- Sustained throughput: 2,172,518 auth/sec across 96 parallel workers
- Full Auth (Turbo): 1.28ms (10,000x faster than our first prototype)
- Session Resume: 50µs
- Cached Proof Verify: 32µs (67x faster than cold)
Batch attestation was a critical multiplier. Instead of signing and verifying a Dilithium signature for each of 32 users individually, we sign the entire batch digest once. That single operation costs ~244µs total instead of ~7,808µs (32 × 244), a 31x reduction in attestation overhead.
What We Learned
1. Specialize ruthlessly. Generic solutions are generic-speed. Every optimization came from understanding exactly what we needed to compute.
2. Measure everything. Intuition about performance is usually wrong. Profile, don't guess. We tried arena pooling, fused twiddle tables, and jemalloc—all of which benchmarked slower in production despite looking promising in isolation.
3. Question assumptions. "ZK proofs are slow" was an assumption. We challenged it.
4. Cache is king. The fastest computation is the one you don't do. Our 67x cache speedup proves it—but the cache architecture itself matters. TCP proxies that work at 8 workers can collapse at 96.
5. Hardware matters. SIMD, cache locality, memory bandwidth—understanding hardware unlocked our biggest gains. On Graviton4, the system allocator outperformed jemalloc by 8% because glibc's malloc is heavily optimized for ARM's flat memory model.
What's Next
We're not done. Our roadmap includes:
- GPU acceleration for batch proof generation
- Custom ASIC design for FHE operations
- Recursive proof composition for unlimited scale
- Sub-10µs session validation
The journey from milliseconds to microseconds taught us that performance limits are often just engineering challenges waiting for the right abstraction. When your inner loop runs 1.595 million times per second, every instruction counts—and every assumption is worth re-examining.
Experience the Performance
1.28ms full auth. 50µs session resume. 2.17M auth/sec sustained. Try it yourself.
Get Free API Key