PricingDemo
Log InGet API Key
Engineering

Key Pool Architecture: How Pre-Generated Keys Deliver 104x FHE Speedup

Eliminating key generation from the critical path through background generation, pool management, and warm cache strategies

Fully homomorphic encryption has a cold-start problem. Before you can encrypt a single value, you need to generate a key set: a secret key, a public key, and evaluation keys for homomorphic operations like rotation and relinearization. For BFV with a polynomial degree of 4,096 and production-grade parameters, this key generation takes milliseconds. That sounds fast until you consider that the actual FHE computation on a pre-encrypted batch takes microseconds. The keygen is not part of the cryptographic operation, but if it happens on the critical path, it dominates the end-to-end latency by two orders of magnitude.

Key pooling is the architectural pattern that solves this problem. Instead of generating keys on-demand when a request arrives, the system maintains a pool of pre-generated key sets that are ready for immediate use. A request draws a key set from the pool, performs the computation, and the pool is replenished in the background by a dedicated key generation thread. The result is that the critical path never includes key generation, and the end-to-end latency drops from milliseconds to microseconds. In H33's production pipeline, this architecture delivers a 104x speedup compared to cold-start key generation for the first request in a session.

Why FHE Key Generation Is Expensive

FHE key generation is not a simple random number generation. It involves several computationally intensive steps, each of which contributes to the total keygen latency.

First, the secret key must be sampled from a specific distribution. For BFV, the secret key is a polynomial with small coefficients (typically ternary: coefficients are -1, 0, or 1). Sampling this polynomial requires generating 4,096 random ternary values from a cryptographically secure random number generator (CSRNG). The CSRNG itself adds overhead because it must be seeded properly and must produce values that are indistinguishable from uniform randomness.

Second, the public key is computed from the secret key using polynomial arithmetic in the number-theoretic transform domain. This involves generating a random polynomial of full-size coefficients (each up to the ciphertext modulus), performing a forward transform, multiplying by the transformed secret key, and adding a small error polynomial. The transform operations on 4,096-element polynomials are the dominant cost in this step.

Third, evaluation keys are generated for each homomorphic operation the system needs to support. Relinearization keys are needed after every homomorphic multiplication to reduce the ciphertext dimension back to the standard form. Galois keys are needed for rotation operations that shift values between SIMD slots. Each evaluation key requires multiple polynomial multiplications and transform operations. For a system that supports both multiplication and rotation, the evaluation key generation can be several times more expensive than the public key generation alone.

The total key generation time for H33's BFV production parameters (polynomial degree 4,096, 56-bit modulus, plaintext modulus 65,537) is approximately 3-5 milliseconds on Graviton4 hardware. The actual FHE computation for a 32-user batch (encrypt, inner product, decrypt) takes approximately 30-50 microseconds. Without key pooling, the first request in a session pays a 100x latency penalty for key generation compared to subsequent requests that reuse the same keys.

Pool Architecture

H33's key pool operates as a lock-free concurrent data structure with three distinct zones: the ready zone, the generation zone, and the retirement zone.

The ready zone contains fully generated key sets that are available for immediate use by computation threads. When a request arrives, a computation thread atomically pops a key set from the ready zone. This operation takes sub-microsecond time because it is a single atomic pointer swap with no locks and no blocking. The ready zone is sized to handle burst traffic patterns, maintaining enough key sets to serve several seconds of peak throughput without requiring any new key generation.

The generation zone is managed by dedicated background threads that continuously generate new key sets and push them into the ready zone. These threads run at lower priority than computation threads, ensuring that key generation does not compete with active request processing for CPU time. The generation rate is adaptive: it increases when the ready zone drops below a configurable low-water mark and decreases when the pool is full. This adaptive rate prevents wasted computation during low-traffic periods while ensuring the pool stays full during high-traffic periods.

The retirement zone handles key sets that have been used and are no longer needed. Key retirement is not immediate because in-flight computations may still reference a key set after it has been logically released. The retirement zone uses reference counting: each key set tracks the number of active computations using it, and the key set is only deallocated when the reference count reaches zero. This ensures memory safety without requiring computation threads to block on deallocation.

Warm Cache Integration

Key pooling interacts with CPU cache hierarchies in ways that further improve performance. A freshly generated key set is likely in L2 or L3 cache because the generation thread just wrote it. When a computation thread pops this key set from the ready zone and begins using it, the polynomial data may still be cache-resident, avoiding main memory reads. This cache warming effect is particularly significant for evaluation keys, which are large (tens of kilobytes) and accessed repeatedly during homomorphic operations.

H33 exploits this effect by structuring the pool as a LIFO (last-in, first-out) stack rather than a FIFO queue. The most recently generated key set is the one most likely to be cache-warm, so serving it first to the next computation thread maximizes cache hit rates. On Graviton4 with its 64MB L3 cache shared across 192 vCPUs, the cache warming effect contributes approximately 15-20% of the total speedup beyond the raw elimination of keygen latency.

Key Rotation and Security

A common concern with key pooling is whether reusing pre-generated keys weakens security. The answer is no, with an important caveat about key rotation. Each key set in the pool is independently generated with fresh randomness from the system CSRNG. The security of any individual key set is identical regardless of whether it was generated on-demand or pre-generated. Pooling changes when keys are generated, not how they are generated.

However, key rotation is still important for operational security. If a key set is compromised (through a side-channel attack, a memory dump, or a software vulnerability), all computations performed with that key set are potentially compromised. Key rotation limits the blast radius of a key compromise by ensuring that each key set is used for a bounded number of computations before being retired.

H33's key rotation policy uses a sliding window approach. Each key set has a maximum lifetime (measured in wall-clock time) and a maximum use count (measured in number of computations). When either limit is reached, the key set is moved to the retirement zone and replaced by a fresh key set from the generation zone. The lifetime and use count limits are configurable per deployment, allowing organizations with stricter security requirements to rotate keys more frequently at the cost of higher key generation overhead.

Memory Management

FHE key sets are large objects. A complete BFV key set including secret key, public key, relinearization keys, and Galois keys for a standard set of rotations can total several hundred kilobytes. A key pool with 1,000 ready key sets requires hundreds of megabytes of memory. On a Graviton4 machine with 371 GiB of RAM, this is a trivial allocation. On a smaller instance, pool sizing becomes a meaningful tradeoff between latency and memory usage.

H33 uses a tiered memory strategy for key pools. Hot keys (the most recently generated, most likely to be used next) are kept in a small pool in directly addressable memory. Warm keys (generated recently but not immediately needed) are kept in a larger pool that may be partially paged out to L3 cache. Cold keys (pre-generated during startup for burst handling) are kept in mapped memory that is paged in on demand. This tiering ensures that the most frequently accessed keys have the lowest access latency while allowing the total pool size to exceed what would fit in cache.

The system allocator (glibc malloc on ARM Graviton4) handles key set allocation without custom allocators. Extensive benchmarking showed that the default allocator on ARM is already optimized for the allocation patterns that FHE key generation produces: infrequent large allocations (key sets) with long lifetimes and predictable sizes. Custom allocators like jemalloc actually regressed performance by 8% on this workload due to arena bookkeeping overhead under high concurrency.

The 104x Speedup in Context

The 104x speedup metric compares the worst case (cold-start with on-demand keygen) to the best case (warm pool with cache-resident keys). In a production deployment that has been running for more than a few seconds, the pool is always warm and the speedup is always realized. The 104x number is most relevant for understanding the difference between a naive FHE deployment that generates keys per-request and H33's production architecture that amortizes keygen across the pool lifecycle.

It is important to understand what the 104x does and does not include. It includes the elimination of keygen from the critical path and the cache warming benefit of LIFO pool serving. It does not include the speedups from SIMD batching (which packs 4,096 values per ciphertext), from the fused pipeline (which avoids serialization between FHE, ZKP, and signing stages), or from the production-optimized polynomial arithmetic. Those optimizations contribute independently to H33's total throughput of 2,293,766 operations per second at 38 microseconds per authentication.

Scaling Key Pools Across Cores

On a 192-vCPU Graviton4 machine, the key pool must support concurrent access from all cores without becoming a contention bottleneck. H33 uses a sharded pool architecture where the pool is divided into per-core segments. Each computation thread preferentially draws from its local segment, falling back to adjacent segments only when the local segment is empty. This eliminates cross-core atomic contention in the common case and reduces cache coherency traffic.

The background key generation threads are fewer in number than the computation threads, typically 4-8 generation threads for 192 vCPUs. Each generation thread fills multiple pool segments in round-robin order. This asymmetry works because key generation is infrequent relative to key consumption: a single key set serves many computations before rotation, so a small number of generation threads can keep the pool full even at peak throughput.

Monitoring and Observability

Key pool health is a critical operational metric. If the pool runs dry, requests must wait for on-demand key generation, and latency spikes by two orders of magnitude. H33 exposes pool metrics through its monitoring interface: current ready count, generation rate, consumption rate, pool hit rate (percentage of requests served from the pool), and average key age. An alert fires if the ready count drops below the low-water mark or if the pool hit rate drops below 99.9%.

During the Graviton4 benchmark runs that produced the 2,293,766 auth/sec sustained number, the key pool maintained a 100% hit rate throughout the 120-second benchmark. Zero requests required on-demand key generation. The pool was sized based on the expected peak throughput multiplied by the key rotation interval, with a 2x safety margin. This sizing formula is conservative but ensures that traffic bursts never drain the pool.

Applicability Beyond BFV

Key pooling is not specific to BFV. H33 uses the same architecture for CKKS key pools (where keygen is similarly expensive due to the scaling factor computation) and for TFHE bootstrapping key pools (where bootstrapping key generation is even more expensive than BFV keygen). The pool interface is abstracted behind a trait that each FHE scheme implements, so adding a new scheme to the pool requires only implementing the key generation and key serialization methods.

The post-quantum signature keys used in H33's attestation layer also benefit from pooling. ML-DSA, FALCON, and SLH-DSA key generation each requires lattice sampling or hash tree computation. While signature keygen is faster than FHE keygen, it still adds microseconds to the critical path. Pooling signature keys eliminates this overhead and ensures that the attestation stage contributes minimal latency to the total pipeline.

The Bottom Line

FHE key generation is expensive because the mathematics demands it: lattice sampling, polynomial transforms, and evaluation key computation are fundamentally costly operations that cannot be optimized below a certain floor. Key pooling does not make keygen faster. It makes keygen irrelevant to the critical path by moving it to background threads that run continuously and maintain a ready supply of key material. The result is a 104x reduction in first-request latency and a production pipeline where every request is served at warm-pool speed regardless of traffic patterns. This is not a clever trick. It is a necessary architectural decision for any FHE system that claims production-grade latency.

Ready to Deploy Production FHE?

See H33's optimized pipeline in action. Schedule a demo or start with the free tier.

Verify It Yourself