PricingDemo
Log InGet API Key
Production

Encrypted Decisions on CPU in Production

No GPU required. ARM Graviton4 delivers production FHE throughput at 2,293,766 authentications per second.

The FHE industry has a GPU fixation. Conference presentations feature NVIDIA benchmarks. Research papers report speedups on A100s and H100s. Startups raise funding to build FHE-specific hardware accelerators. The implicit message is that FHE is inherently too slow for production without specialized hardware, and that GPUs are the path to practicality.

H33 disagrees. Our production system runs on ARM Graviton4 CPUs, achieving 2,293,766 authenticated encrypted computations per second at 38 microseconds per authentication. No GPU. No FPGA. No custom ASIC. Standard ARM processors, available on demand from any major cloud provider, running standard Rust code compiled with standard toolchains.

This is not a position against hardware acceleration. GPUs genuinely help for specific FHE workloads, particularly those with high arithmetic intensity and regular memory access patterns. But the GPU-or-nothing narrative is wrong, and it is preventing organizations from deploying FHE today on hardware they already have.

Why CPU Works for Production FHE

FHE computation is fundamentally polynomial arithmetic. Encrypt a value, and it becomes a polynomial in a quotient ring. Multiply two encrypted values, and you multiply two polynomials. Add encrypted values, and you add polynomials. Every FHE operation reduces to operations on polynomial coefficients.

Modern CPUs are excellent at this. ARM Graviton4 provides 192 vCPUs, each with wide SIMD registers capable of processing multiple coefficient operations in parallel. The memory hierarchy provides large L2 and L3 caches that keep hot polynomial data close to the compute units. The system's memory bandwidth is sufficient to feed the parallel computation units without becoming a bottleneck.

The key insight is that FHE throughput depends more on parallelism than on single-core speed. A biometric authentication pipeline processes independent user requests in parallel. Each request operates on independent ciphertexts with independent noise budgets. There are no cross-request dependencies that would serialize the computation. This makes FHE workloads embarrassingly parallel, and CPUs with high core counts excel at embarrassingly parallel workloads.

Graviton4's 192 vCPUs provide 192 independent computation threads, each processing its own batch of encrypted authentications. With SIMD batching at the FHE level (4096 values per ciphertext), each thread processes 4096 users per batch operation. The product of thread-level parallelism and SIMD-level parallelism gives total throughput that exceeds what most applications require.

The GPU Tax

GPU-based FHE has hidden costs that the benchmark numbers do not capture. The most significant is the data transfer overhead between host memory and GPU memory. FHE ciphertexts are large (tens of kilobytes each), and moving them to and from the GPU takes time that does not appear in kernel-only benchmarks.

A production FHE pipeline involves: receiving the request over the network, deserializing the ciphertext, transferring it to GPU memory, executing the homomorphic operation, transferring the result back to host memory, serializing the result ciphertext, and sending the response. The GPU kernel execution might be fast, but the transfer steps often dominate the end-to-end latency.

CPU-based FHE eliminates this transfer overhead entirely. The ciphertext is deserialized into the same memory that the computation uses. There is no bus transfer, no memory copy, no synchronization between host and device. The end-to-end latency is the computation latency, nothing more.

GPU availability is another hidden cost. NVIDIA A100 and H100 instances are expensive and frequently unavailable during peak demand. Graviton4 instances are broadly available and cost less per vCPU. For sustained production workloads that run 24/7, the cost difference compounds into significant savings over time.

GPU programming complexity adds engineering cost. CUDA kernels for FHE require careful optimization of thread blocks, shared memory usage, and memory coalescing. CPU code compiles with standard Rust toolchains and runs on any ARM instance without hardware-specific tuning. The engineering velocity difference is substantial: CPU implementations can be iterated, debugged, and deployed faster than GPU implementations.

Architectural Decisions That Enable CPU Performance

Achieving 2,293,766 authentications per second on CPU is not automatic. It requires specific architectural decisions that align the computation with the hardware's strengths.

The first decision is batch-oriented processing. Rather than processing one authentication at a time, H33 batches 32 user biometric vectors into a single ciphertext using SIMD packing. One homomorphic inner product operation simultaneously computes similarity scores for all 32 users. This amortizes the fixed overhead of polynomial operations across 32 independent results.

The second decision is the choice of FHE parameters. H33's BFV configuration uses N=4096 with a single 56-bit modulus, which provides sufficient security for biometric matching while minimizing the polynomial size. Larger parameters would provide more noise headroom but would slow down every operation. The parameter selection is tuned for the specific workload, not for generality.

The third decision is algorithmic. The polynomial arithmetic uses carefully optimized implementations that exploit the specific structure of the chosen parameters. The modular arithmetic uses techniques that minimize the number of divisions and maximize the use of additions and multiplications, which are faster on modern processors. These optimizations are specific to our parameter choices and would need to be re-derived for different parameters.

The fourth decision is the system allocator. On ARM Graviton4, the system allocator (glibc malloc) outperforms alternatives like jemalloc. This is counterintuitive, since jemalloc is often faster on x86. But ARM's flat memory model and glibc's ARM-specific optimizations make the system allocator the best choice for tight FHE loops with 192 concurrent workers.

What the Numbers Mean

2,293,766 authentications per second is a throughput number measured on a single Graviton4 metal instance. Each authentication includes: BFV homomorphic inner product on encrypted biometric vectors, SHA3-256 hashing of the computation trace, Dilithium (ML-DSA) post-quantum signature of the attestation, STARK proof verification from cache, and H33-74 distillation of the attestation into 74 bytes.

The 38 microsecond per-authentication latency is the reciprocal of the throughput, representing the amortized cost per authentication when the system is running at maximum throughput. Individual request latency is higher because each request includes a batch of 32 users, and the batch processing time is divided among the users in the batch.

These numbers are production numbers, measured under sustained load for 30 seconds, not burst benchmarks. They include all pipeline stages, not just the FHE computation. They use the full security parameters, not reduced parameters for better benchmark numbers. And they run on standard cloud instances, not custom hardware.

When GPU Makes Sense

CPU-first does not mean CPU-only. GPU acceleration makes sense for specific workloads where the arithmetic intensity is high enough to overcome the transfer overhead, and where the computation graph is regular enough to map efficiently to GPU thread blocks.

Bootstrapping, the operation that refreshes the noise budget in FHE ciphertexts, is the primary candidate for GPU acceleration. Bootstrapping involves a large number of polynomial multiplications with a regular structure that maps well to GPU parallelism. For workloads that require bootstrapping (deep computations that exhaust the noise budget), GPU acceleration can provide meaningful speedups.

Key generation is another GPU-friendly operation. Generating FHE keys involves sampling large random polynomials and performing polynomial multiplications. This is a one-time cost per key, but for applications that generate keys frequently, GPU acceleration reduces the latency.

For the biometric authentication pipeline that H33 runs in production, neither bootstrapping nor key generation is on the hot path. The computation depth is shallow enough to avoid bootstrapping, and keys are generated once and reused. The hot path is homomorphic inner products and attestation, both of which run efficiently on CPU.

The Deployment Advantage

CPU-based FHE has a massive deployment advantage: it runs everywhere. Any cloud provider offers ARM instances. Any Kubernetes cluster can schedule ARM pods. Any CI/CD pipeline can build ARM binaries. There are no driver dependencies, no CUDA toolkit installations, no GPU quota requests, no hardware-specific Docker images.

This deployment simplicity translates directly to operational reliability. CPU instances have well-understood failure modes. ARM processors do not have the thermal throttling issues that affect GPUs under sustained load. Memory management is straightforward without the complexity of GPU memory allocation and deallocation. Debugging is simpler when everything runs on the host CPU with standard profiling tools.

For organizations evaluating FHE for production deployment, the message is clear: you do not need to wait for GPU availability or specialized hardware. Production-grade encrypted computation is available today on standard ARM instances, at throughput levels that exceed most application requirements. Start with CPU, measure your actual performance needs, and add GPU acceleration only if and when the numbers demand it.

The FHE industry will eventually need hardware acceleration for the most demanding workloads. But the most demanding workloads are not the first workloads. The first workloads are biometric authentication, fraud detection, encrypted search, and privacy-preserving analytics. These workloads run on CPU today, at production scale, at production throughput. The future is encrypted, and it does not require a GPU.

Production FHE on CPU

Deploy encrypted computation on ARM Graviton4 without GPU dependencies.

Get API Key View Benchmarks
Verify It Yourself