The Performance Excuse
For the past three years, every enterprise security team we've talked to has given us the same answer when we asked why they hadn't migrated to post-quantum cryptography yet: performance.
Not security concerns. Not compliance gaps. Not vendor availability. Performance. The widely-cited figures for CRYSTALS-Kyber and Dilithium were discouraging: key sizes an order of magnitude larger than RSA equivalents, signing times measured in hundreds of microseconds per operation, and FHE — fully homomorphic encryption — carrying latencies in the milliseconds for even simple operations.
At hyperscaler throughput, those numbers meant post-quantum cryptography was a roadmap item, not a deployment decision.
"We've been planning our PQC migration for eighteen months. We just can't get the performance to close." — VP of Security Engineering, Tier-1 bank, 2025
We built H33 to remove that excuse. As of March 10, 2026, our production pipeline delivers the complete post-quantum stack — BFV-64 fully homomorphic encryption, zero-knowledge proof verification, and Dilithium-3 (NIST FIPS 204) attestation — at 38.5 microseconds per authentication and 2,172,518 operations per second sustained. No classical crypto in the hot path. No hybrid fallback. No tricks.
Here's how we got there — and why we think the performance penalty narrative around post-quantum cryptography is, at this point, an engineering problem that's been solved.
The Pipeline
Every H33 authentication request runs four sequential stages. Understanding each one is necessary to understand why they're fast individually — and why they compose the way they do.
Stage 1: FHE Batch Verification
Fully homomorphic encryption lets us verify biometric data without ever decrypting it. The user's biometric vector arrives encrypted under their public key. We perform the similarity computation — cosine distance between the enrolled and presented templates — inside the ciphertext. The plaintext never exists on our infrastructure.
In the naïve implementation, this costs roughly 5,000 microseconds per operation using standard SEAL or OpenFHE implementations on comparable hardware. That number is where the "FHE is too slow" narrative comes from. It's accurate for naïve implementations. We don't use a naïve implementation.
The key insight is that FHE's computational cost is front-loaded and amortizable. The expensive operation — the number-theoretic transform underlying BFV arithmetic — scales sub-linearly when you batch multiple plaintexts into the same ciphertext slot packing. We evaluate 32 users per FHE invocation. The 939µs cost is paid once and divided across all 32.
| Configuration | FHE Cost | Per-User Cost |
|---|---|---|
| Single-user, naïve | ~4,800µs | ~4,800µs |
| Single-user, optimized NTT | ~1,375µs | ~1,375µs |
| H33 batch, 32 users — Feb baseline | ~1,375µs | ~43µs |
| H33 batch, 32 users — March | ~939µs | ~30µs |
The March improvement — from 1,375µs to 939µs per batch — came from two changes. First, a cache-aligned memory layout for the NTT butterfly network that reduced L3 cache miss rate by approximately 34% under concurrent load. Second, explicit SIMD vectorization of the modular reduction step with a software emulation layer for non-native targets.
FHE doesn't get faster per-operation — it gets cheaper per user at scale. The fixed overhead of an FHE evaluation (key switching, NTT transforms, noise management) is paid once per batch, not per user. At batch size 32, a 939µs evaluation costs ~30µs per user.
Stage 2: ZKP Cache Lookup
This is where our February-to-March improvement was most dramatic, and the engineering decision was embarrassingly simple once we saw the numbers.
In February, our ZKP proof cache ran as a separate process, accessible via a local TCP socket. Round-trip to the cache: approximately 2.7µs. That's genuinely fast for a network call. It's also completely unnecessary overhead when the calling process is running on the same physical machine.
We replaced the TCP cache proxy with an in-process shared-memory store using a lock-free ring buffer indexed by proof hash (SHA3-256). Reads are zero-copy — the calling thread gets a direct pointer into the shared region. No serialization, no syscall, no lock acquisition.
| Architecture | Lookup Latency | Lock Contention at 96 Workers |
|---|---|---|
| External TCP cache | ~2.7µs | High (socket backlog) |
| In-process mutex map | ~0.4µs | Moderate |
| In-process zero-copy (March) | 0.059µs | None measured |
The 44× improvement over the TCP baseline is real, but it somewhat understates the practical impact. The TCP cache also introduced non-deterministic latency under high concurrency as socket backlog built up. The in-process cache has no such behavior — lookup time is stable to within nanoseconds at full worker load.
Stage 3: Dilithium-3 Batch Attestation
Dilithium-3 (standardized as ML-DSA under NIST FIPS 204) is the post-quantum digital signature algorithm we use for authentication attestation. Each authentication session receives a signed token that downstream services can verify without calling back to H33.
The naïve approach signs each authentication result individually: one Dilithium sign+verify per user. At ~131µs per operation (signing 92µs, verification 39µs), that's the dominant latency term in an unoptimized pipeline.
We instead sign the entire FHE batch evaluation output — a Merkle root over the 32 authentication results — with a single Dilithium-3 signature. The 291µs sign+verify cost is amortized across all 32 users in the batch, adding roughly 9µs per user. Each individual result token includes a Merkle inclusion proof that allows independent verification against the batch root.
The batch signing approach is fully compliant with NIST FIPS 204 (ML-DSA). The signature covers the root of a binary hash tree over the individual outputs — verification of any individual result requires only the root signature, the inclusion proof, and the verifying party's copy of the Dilithium public key. No modification to the underlying algorithm is required.
Putting It Together: 38.5µs
| Stage | Batch Cost | Per-User Cost | Algorithm |
|---|---|---|---|
| FHE batch verify | ~939µs | ~30µs | BFV-64 · H33 optimized |
| ZKP cache lookup | ~0.059µs | <0.01µs | SHA3-256 · zero-copy |
| Dilithium attestation | ~291µs | ~9µs | ML-DSA · FIPS 204 |
| Total | ~1,232µs | ~38.5µs | All PQ-safe |
At 96 workers on a 192-vCPU Graviton4 instance, the per-worker throughput is approximately 22,600 authentications per second. Across 96 workers: 2,172,518 sustained.
What This Means for Enterprise PQC Migration
The benchmark figures above are not theoretical. They reflect a production workload on commodity AWS infrastructure — no custom silicon, no hardware security modules in the latency path, no hybrid classical/post-quantum fallback.
National banking infrastructure handles 500k–2M authentication operations per second at peak. Government identity systems operate in the 1M+/sec range. H33's 2.17M sustained figure puts a complete post-quantum authentication stack within the performance envelope that enterprise deployments require.
The performance barrier that was delaying your quantum migration no longer exists. The question is no longer when PQC will be fast enough. It's why you're still running RSA.
The "harvest now, decrypt later" threat — where adversaries are archiving encrypted traffic today to decrypt it once quantum computers are available — is active right now. Traffic encrypted with RSA-2048 or ECC P-256 today is potentially readable by a sufficiently capable quantum computer within the decade. If your data has a secrecy requirement longer than that horizon, the migration decision is already past due.
What's Next
We're actively working on extending the batch size beyond 32 users. Theoretical analysis suggests batch sizes of 64–128 are achievable with the current BFV parameter sets, which would push per-user FHE cost below 20µs. We're also evaluating CKKS for approximate-arithmetic workloads — ML inference over encrypted embeddings — which has different optimization characteristics.
The full benchmark methodology, hardware specifications, and historical run data are documented at h33.ai/benchmarks. Enterprise customers can request full run logs and raw timing data under NDA.
If you're building infrastructure that needs to survive the next decade of cryptographic transitions, the API is open for access. Post-quantum is no longer a future problem.
Hardware & Methodology
| Parameter | Value |
|---|---|
| Instance | c8g.metal-48xl (AWS Graviton4) |
| CPU | 192 vCPUs · Neoverse V2 |
| Memory | 377 GiB |
| Workers | 96 |
| FHE Parameters | N=4096, Q=56-bit, t=65537 (BFV-64) |
| Batch Size | 32 users / ciphertext |
| NTT | Montgomery radix-4, Harvey lazy reduction |
| Allocator | glibc system (not jemalloc) |
| Benchmark Framework | Criterion.rs v0.5 |
| Sustained Duration | 120 seconds |