Why Key Exchange Is the First Thing to Migrate
Of all the cryptographic primitives in your stack, key exchange is the most urgent to migrate to post-quantum algorithms. The reason is straightforward: key exchange is the only primitive vulnerable to the Harvest Now, Decrypt Later attack. Digital signatures can be forged only in real-time (an attacker needs a quantum computer at the moment of forgery). But key exchange sessions recorded today can be decrypted retroactively once a quantum computer is available.
This is why NIST prioritized ML-KEM (originally CRYSTALS-Kyber) as the first post-quantum standard to finalize. FIPS 203 was published in August 2024, and CNSA 2.0 mandates its adoption for National Security Systems by January 2027. The message from every standards body is the same: migrate key exchange first, migrate it now.
How Lattice-Based Key Encapsulation Works
ML-KEM is based on the Module Learning With Errors (MLWE) problem, which is a structured variant of the Learning With Errors (LWE) problem introduced by Oded Regev in 2005. The security assumption is that given a matrix A and a vector b = As + e (where s is a secret vector and e is a small error vector), it is computationally infeasible to recover s -- even with a quantum computer.
The "Module" in MLWE means the matrix entries are elements of a polynomial ring R_q = Z_q[X]/(X^n + 1), rather than individual integers. This provides a compact representation that keeps key sizes manageable while maintaining strong security guarantees. The ring structure enables efficient Number Theoretic Transform (NTT) operations for polynomial multiplication, which is where most of the computational work happens.
The key encapsulation flow works in three steps:
- KeyGen: Generate a random matrix A (from a seed), a secret vector s, and an error vector e. Compute the public key t = As + e. The public key is (A, t); the secret key is s.
- Encapsulate: The sender samples a random message m, derives randomness from H(m), and computes ciphertext components using the public key. The shared secret K is derived from m and H(ciphertext). The ciphertext is sent to the key holder.
- Decapsulate: The key holder uses their secret key to recover m from the ciphertext, re-derives the randomness, re-encapsulates, and checks that the result matches. If it matches, K is output as the shared secret. If not, a pseudorandom rejection value is returned (Fujisaki-Okamoto transform for CCA2 security).
Parameter Sets: ML-KEM-512, 768, and 1024
FIPS 203 defines three parameter sets, corresponding to NIST security levels 1, 3, and 5:
| Parameter Set | Security Level | Classical Equivalent | Public Key | Ciphertext | Shared Secret |
|---|---|---|---|---|---|
| ML-KEM-512 | Level 1 | AES-128 | 800 bytes | 768 bytes | 32 bytes |
| ML-KEM-768 | Level 3 | AES-192 | 1,184 bytes | 1,088 bytes | 32 bytes |
| ML-KEM-1024 | Level 5 | AES-256 | 1,568 bytes | 1,568 bytes | 32 bytes |
For comparison, X25519 (the most common classical key exchange in TLS 1.3) uses 32-byte public keys and produces 32-byte shared secrets. The key size increase is significant -- ML-KEM-768 public keys are 37x larger than X25519. This has real implications for bandwidth, especially in high-frequency API environments or IoT deployments with constrained links.
H33 uses ML-KEM-768 as the default for all API key exchange operations, providing NIST Level 3 security (equivalent to AES-192 against classical attacks, and resistant to all known quantum algorithms). For customers requiring NIST Level 5 compliance (defense, intelligence, critical infrastructure), ML-KEM-1024 is available via configuration.
Performance: ML-KEM vs. ECDH in Production
One of the most common concerns about post-quantum migration is performance. The good news: ML-KEM is fast. In many benchmarks, it is actually faster than ECDH for the computational operations, though the larger key and ciphertext sizes add bandwidth overhead.
| Operation | X25519 (ECDH) | ML-KEM-768 | Difference |
|---|---|---|---|
| Key Generation | ~50 us | ~30 us | ML-KEM 40% faster |
| Encapsulation / DH | ~120 us | ~40 us | ML-KEM 67% faster |
| Decapsulation | ~120 us | ~45 us | ML-KEM 63% faster |
| Public Key Size | 32 bytes | 1,184 bytes | 37x larger |
| Ciphertext Size | 32 bytes | 1,088 bytes | 34x larger |
| Combined Wire Overhead | 64 bytes | 2,272 bytes | +2,208 bytes |
The computational performance advantage of ML-KEM comes from the NTT-based polynomial arithmetic, which maps efficiently to modern CPU architectures. On ARM platforms like AWS Graviton4, the NTT operations benefit from the wide pipelines and high memory bandwidth. On x86_64, AVX2 and AVX-512 provide additional vectorization opportunities for the modular arithmetic.
The 6-14% TLS handshake overhead reported in recent large-scale studies (including Cloudflare's and Google's hybrid PQ deployments) comes primarily from the additional bytes on the wire, not from computational cost. On fast links (datacenter-to-datacenter, broadband), the overhead is at the lower end. On constrained links (mobile, satellite, IoT), the overhead can be higher due to the impact of additional round-trip bytes on congestion windows.
Hybrid Mode: ML-KEM + X25519 for the Transition Period
During the transition period, many deployments use hybrid key exchange that combines ML-KEM with X25519. The shared secret is derived from both the classical and post-quantum key exchanges, so the connection is secure as long as either algorithm remains unbroken. This provides a safety net: if a flaw is discovered in ML-KEM, the classical X25519 component still protects the session (against classical adversaries). If quantum computers arrive, the ML-KEM component protects against quantum attack.
TLS hybrid key exchange is defined in draft-ietf-tls-hybrid-design and is already deployed at scale by Cloudflare (X25519Kyber768Draft00) and Google (X25519Kyber768). Chrome and Firefox both support hybrid PQ key exchange in production builds.
H33 supports hybrid mode for TLS termination but recommends pure ML-KEM for API-to-API communication where both endpoints are under your control. The hybrid overhead is small (~2,300 additional bytes per handshake), but for high-frequency API calls with session resumption, eliminating the classical component simplifies the key schedule and reduces the attack surface.
Implementation Considerations
Key Sizes and Session Management
The larger key sizes in ML-KEM have implications beyond bandwidth. TLS session tickets and resumption tokens that embed key material will be larger. DNS-based certificate retrieval (DANE) and OCSP stapling payloads increase. If your infrastructure uses UDP-based protocols (QUIC, DTLS), the larger handshake may exceed typical MTU sizes and require fragmentation.
H33 mitigates these issues through aggressive session resumption. After the initial ML-KEM handshake establishes a shared secret, subsequent API calls within the session window use symmetric-key authenticated encryption (AES-256-GCM) derived from the PQ-established key. The PQ handshake cost is amortized across hundreds or thousands of API calls per session.
Constant-Time Implementation
ML-KEM implementations must be constant-time to prevent side-channel attacks. The Fujisaki-Okamoto transform in the decapsulation step is particularly sensitive: the comparison between the re-encapsulated ciphertext and the received ciphertext must not leak timing information, and the rejection path (returning a pseudorandom value instead of the real shared secret) must be indistinguishable from the success path in execution time.
H33's ML-KEM implementation is written in pure Rust with no external dependencies. All comparison operations use constant-time primitives (bitwise OR accumulation, not early-exit comparison). The rejection sampling for NTT coefficients uses rejection-free CBD (Centered Binomial Distribution) sampling with batch RNG calls -- one call per 10 coefficients, eliminating variable-time rejection loops.
Seed Expansion and Deterministic Key Generation
ML-KEM generates the public matrix A from a 32-byte seed using SHAKE-128 (an extendable output function from the SHA-3 family). This means the matrix does not need to be transmitted or stored -- both parties can regenerate it from the seed. This is a significant space optimization: the matrix A would otherwise be k*k*n*log2(q) bits (tens of kilobytes), but the seed is always 32 bytes.
However, the seed expansion is a non-trivial computational cost. In H33's production benchmarks on Graviton4, SHAKE-128 expansion accounts for approximately 15% of the total keygen time. We pre-expand and cache matrices for frequently-used parameter sets, reducing repeated keygen operations to a single CBD sampling plus NTT transform.
H33's Pure Rust ML-KEM Implementation
H33 implements ML-KEM natively, without wrapping OpenSSL, liboqs, or any external library. The implementation sits in our src/pqc/kyber.rs module alongside the hybrid key exchange logic in src/pqc/hybrid.rs. Key design decisions include:
- Montgomery-form NTT arithmetic: All polynomial multiplications use Montgomery reduction, eliminating division in the hot path. Twiddle factors are pre-computed in Montgomery form at initialization.
- Batch CBD sampling: Centered Binomial Distribution noise generation uses a single RNG call per 10 coefficients, reducing syscall overhead by 5x compared to per-coefficient sampling.
- Zero-copy serialization: Key material is serialized directly into wire-format buffers without intermediate allocation. On Graviton4 with the system allocator, this avoids the allocation overhead that dominates in short-lived KEM operations.
- ARM NEON acceleration: On aarch64 platforms (Graviton4, Apple Silicon), the NTT butterfly operations use NEON intrinsics for parallel modular addition and subtraction. Modular multiplication falls back to scalar u128 operations (ARM NEON lacks native 64x64->128 multiply).
The result is ML-KEM-768 encapsulation in under 40 microseconds on Graviton4, with the full key exchange adding less than 100 microseconds to an API call. At H33's production throughput of 2.17 million authentications per second, the ML-KEM overhead is invisible in the pipeline -- the BFV FHE operations dominate at 939 microseconds per 32-user batch.
Further Reading
- H33 API Documentation -- Integration guides with code examples
- Post-Quantum Architecture -- Full PQ stack overview including ML-DSA and FHE
- HNDL Protection -- Why key exchange migration is urgent
- Pricing -- Credit-based pricing with a free tier for development
- Benchmarks -- Production performance numbers on Graviton4