How We Got Post-Quantum Cryptography to 38.5 Microseconds

The Performance Excuse

For the past three years, every enterprise security team we've talked to has given us the same answer when we asked why they hadn't migrated to post-quantum cryptography yet: performance.

Not security concerns. Not compliance gaps. Not vendor availability. Performance. The widely-cited figures for CRYSTALS-Kyber and Dilithium were discouraging: key sizes an order of magnitude larger than RSA equivalents, signing times measured in hundreds of microseconds per operation, and FHE — fully homomorphic encryption — carrying latencies in the milliseconds for even simple operations.

At hyperscaler throughput, those numbers meant post-quantum cryptography was a roadmap item, not a deployment decision.

The Industry Blocker

"We've been planning our PQC migration for eighteen months. We just can't get the performance to close." — VP of Security Engineering, Tier-1 bank, 2025

We built H33 to remove that excuse. As of March 10, 2026, our production pipeline delivers the complete post-quantum stack — BFV-64 fully homomorphic encryption, zero-knowledge proof verification, and Dilithium-3 (NIST FIPS 204) attestation — at 38.5 microseconds per authentication and 2,172,518 operations per second sustained. No classical crypto in the hot path. No hybrid fallback. No tricks.

Here's how we got there — and why we think the performance penalty narrative around post-quantum cryptography is, at this point, an engineering problem that's been solved.

The Pipeline

Every H33 authentication request runs four sequential stages. Understanding each one is necessary to understand why they're fast individually — and why they compose the way they do.

Stage 1: FHE Batch Verification

Fully homomorphic encryption lets us verify biometric data without ever decrypting it. The user's biometric vector arrives encrypted under their public key. We perform the similarity computation — cosine distance between the enrolled and presented templates — inside the ciphertext. The plaintext never exists on our infrastructure.

In the naïve implementation, this costs roughly 5,000 microseconds per operation using standard SEAL or OpenFHE implementations on comparable hardware. That number is where the "FHE is too slow" narrative comes from. It's accurate for naïve implementations. We don't use a naïve implementation.

The key insight is that FHE's computational cost is front-loaded and amortizable. The expensive operation — the number-theoretic transform underlying BFV arithmetic — scales sub-linearly when you batch multiple plaintexts into the same ciphertext slot packing. We evaluate 32 users per FHE invocation. The 939µs cost is paid once and divided across all 32.

Configuration	FHE Cost	Per-User Cost
Single-user, naïve	~4,800µs	~4,800µs
Single-user, optimized NTT	~1,375µs	~1,375µs
H33 batch, 32 users — Feb baseline	~1,375µs	~43µs
H33 batch, 32 users — March	~939µs	~30µs

The March improvement — from 1,375µs to 939µs per batch — came from two changes. First, a cache-aligned memory layout for the NTT butterfly network that reduced L3 cache miss rate by approximately 34% under concurrent load. Second, explicit SIMD vectorization of the modular reduction step with a software emulation layer for non-native targets.

The Amortization Principle

FHE doesn't get faster per-operation — it gets cheaper per user at scale. The fixed overhead of an FHE evaluation (key switching, NTT transforms, noise management) is paid once per batch, not per user. At batch size 32, a 939µs evaluation costs ~30µs per user.

Stage 2: ZKP Cache Lookup

This is where our February-to-March improvement was most dramatic, and the engineering decision was embarrassingly simple once we saw the numbers.

In February, our ZKP proof cache ran as a separate process, accessible via a local TCP socket. Round-trip to the cache: approximately 2.7µs. That's genuinely fast for a network call. It's also completely unnecessary overhead when the calling process is running on the same physical machine.

We replaced the TCP cache proxy with an in-process shared-memory store using a lock-free ring buffer indexed by proof hash (SHA3-256). Reads are zero-copy — the calling thread gets a direct pointer into the shared region. No serialization, no syscall, no lock acquisition.

Architecture	Lookup Latency	Lock Contention at 96 Workers
External TCP cache	~2.7µs	High (socket backlog)
In-process mutex map	~0.4µs	Moderate
In-process zero-copy (March)	0.059µs	None measured

The 44× improvement over the TCP baseline is real, but it somewhat understates the practical impact. The TCP cache also introduced non-deterministic latency under high concurrency as socket backlog built up. The in-process cache has no such behavior — lookup time is stable to within nanoseconds at full worker load.

Stage 3: Dilithium-3 Batch Attestation

Dilithium-3 (standardized as ML-DSA under NIST FIPS 204) is the post-quantum digital signature algorithm we use for authentication attestation. Each authentication session receives a signed token that downstream services can verify without calling back to H33.

The naïve approach signs each authentication result individually: one Dilithium sign+verify per user. At ~131µs per operation (signing 92µs, verification 39µs), that's the dominant latency term in an unoptimized pipeline.

We instead sign the entire FHE batch evaluation output — a Merkle root over the 32 authentication results — with a single Dilithium-3 signature. The 291µs sign+verify cost is amortized across all 32 users in the batch, adding roughly 9µs per user. Each individual result token includes a Merkle inclusion proof that allows independent verification against the batch root.

FIPS 204 Compliance Note

The batch signing approach is fully compliant with NIST FIPS 204 (ML-DSA). The signature covers the root of a binary hash tree over the individual outputs — verification of any individual result requires only the root signature, the inclusion proof, and the verifying party's copy of the Dilithium public key. No modification to the underlying algorithm is required.

Putting It Together: 38.5µs

Stage	Batch Cost	Per-User Cost	Algorithm
FHE batch verify	~939µs	~30µs	BFV-64 · H33 optimized
ZKP cache lookup	~0.059µs	<0.01µs	SHA3-256 · zero-copy
Dilithium attestation	~291µs	~9µs	ML-DSA · FIPS 204
Total	~1,232µs	~38.5µs	All PQ-safe

At 96 workers on a 192-vCPU Graviton4 instance, the per-worker throughput is approximately 22,600 authentications per second. Across 96 workers: 2,172,518 sustained.

What This Means for Enterprise PQC Migration

The benchmark figures above are not theoretical. They reflect a production workload on commodity AWS infrastructure — no custom silicon, no hardware security modules in the latency path, no hybrid classical/post-quantum fallback.

National banking infrastructure handles 500k–2M authentication operations per second at peak. Government identity systems operate in the 1M+/sec range. H33's 2.17M sustained figure puts a complete post-quantum authentication stack within the performance envelope that enterprise deployments require.

The performance barrier that was delaying your quantum migration no longer exists. The question is no longer when PQC will be fast enough. It's why you're still running RSA.

Harvest Now, Decrypt Later

The "harvest now, decrypt later" threat — where adversaries are archiving encrypted traffic today to decrypt it once quantum computers are available — is active right now. Traffic encrypted with RSA-2048 or ECC P-256 today is potentially readable by a sufficiently capable quantum computer within the decade. If your data has a secrecy requirement longer than that horizon, the migration decision is already past due.

What's Next

We're actively working on extending the batch size beyond 32 users. Theoretical analysis suggests batch sizes of 64–128 are achievable with the current BFV parameter sets, which would push per-user FHE cost below 20µs. We're also evaluating CKKS for approximate-arithmetic workloads — ML inference over encrypted embeddings — which has different optimization characteristics.

The full benchmark methodology, hardware specifications, and historical run data are documented at h33.ai/benchmarks. Enterprise customers can request full run logs and raw timing data under NDA.

If you're building infrastructure that needs to survive the next decade of cryptographic transitions, the API is open for access. Post-quantum is no longer a future problem.

Hardware & Methodology

Parameter	Value
Instance	c8g.metal-48xl (AWS Graviton4)
CPU	192 vCPUs · Neoverse V2
Memory	377 GiB
Workers	96
FHE Parameters	N=4096, Q=56-bit, t=65537 (BFV-64)
Batch Size	32 users / ciphertext
NTT	Montgomery radix-4, Harvey lazy reduction
Allocator	glibc system (not jemalloc)
Benchmark Framework	Criterion.rs v0.5
Sustained Duration	120 seconds

The Performance Excuse

The Pipeline

Stage 1: FHE Batch Verification

Stage 2: ZKP Cache Lookup

Stage 3: Dilithium-3 Batch Attestation

Putting It Together: 38.5µs

What This Means for Enterprise PQC Migration

What's Next

Hardware & Methodology

Post-Quantum Performance Is a Solved Problem

Related Articles