TFHE GPU Scaling Limits: Memory, Not Compute

If you have spent any time evaluating fully homomorphic encryption for production workloads, someone on your team has probably said: "What about GPUs?" It is a reasonable question. GPUs transformed machine learning from an academic curiosity into a trillion-dollar industry. They accelerated molecular dynamics, weather modeling, and protein folding. Every time a computation looks embarrassingly parallel, GPUs seem like the obvious answer.

TFHE (Torus Fully Homomorphic Encryption) is, at first glance, one of those computations. It operates on encrypted Boolean circuits. Each gate evaluation involves a bootstrapping step that is fundamentally a large polynomial multiplication. Polynomial multiplications are parallel. GPUs are massively parallel. The conclusion writes itself.

Except it does not work the way you expect.

After two years of building production FHE systems, running sustained benchmarks on GPUs ranging from A10G to H100, and deploying TFHE at scale on ARM hardware, we have a clear picture of what actually limits TFHE throughput. The bottleneck is not compute. The bottleneck is memory bandwidth. And GPUs, for all their theoretical FLOPS, do not solve the memory bandwidth problem. They shift it to a more expensive, harder-to-manage substrate.

The Memory Bandwidth Wall

To understand why GPUs disappoint for TFHE, you need to understand what a TFHE bootstrapping operation actually demands from hardware.

A single TFHE bootstrapping step transforms a noisy ciphertext into a fresh ciphertext with reduced noise. This is what makes TFHE composable: you can chain unlimited Boolean gates because each gate refreshes the noise budget. The cost of this composability is the bootstrap itself, which involves multiplying large polynomials with key-switching matrices that are stored in memory.

The key-switching key for a standard TFHE parameter set occupies between 24 and 48 megabytes. The bootstrapping key can range from 50 megabytes to several hundred megabytes, depending on the security parameters and decomposition base. For a single gate evaluation, the processor must stream through tens of megabytes of key material, multiply it against the ciphertext polynomial, accumulate the results, and write the output back. This is not a matrix-matrix multiplication where the same data is reused across thousands of output elements. The key material is consumed once per gate, per bootstrap. The reuse ratio is low.

On a CPU, the bandwidth picture looks like this: a modern server processor provides 200 to 400 GB/s of memory bandwidth, shared across all cores. A 96-core ARM Graviton4, for example, provides roughly 307 GB/s of memory bandwidth across its coherent memory fabric. Each core can independently stream key material from its L3 cache or main memory, and the cache hierarchy naturally serves frequently accessed portions of the bootstrapping key without any explicit management.

On a GPU, the bandwidth picture looks better on paper. An NVIDIA H100 provides roughly 3,350 GB/s of HBM3 bandwidth. An A100 provides 2,039 GB/s. These numbers are five to ten times higher than what a CPU offers. So GPUs should be five to ten times faster at TFHE, right?

No. Because bandwidth is not the only dimension. You also need to consider how that bandwidth is utilized.

Why Raw Bandwidth Does Not Translate to Raw Throughput

GPU memory bandwidth is high, but GPU memory bandwidth utilization for TFHE workloads is poor. There are four reasons for this, and they are structural, not implementational. Better CUDA code does not fix them.

1. Occupancy vs. Working Set

GPU throughput depends on occupancy: keeping thousands of threads active so the hardware can hide memory latency by switching between warps while others wait for data. TFHE bootstrapping has a large working set per gate. Each active bootstrap needs its own portion of the accumulator, its own slice of the bootstrapping key, and its own intermediate buffers. With 48 KB of shared memory per SM and bootstrapping keys measured in hundreds of megabytes, you cannot fit enough concurrent bootstraps into the GPU's on-chip memory to achieve the occupancy the hardware needs.

The result is that threads stall waiting for HBM reads. The 3,350 GB/s of theoretical bandwidth collapses to a fraction of that. Published benchmarks from multiple groups show that real TFHE GPU implementations achieve between 30% and 55% of peak memory bandwidth utilization. A CPU running the same workload with careful prefetching and NUMA-aware allocation achieves 70% to 85%. The absolute bandwidth gap narrows dramatically once you account for utilization.

2. Host-Device Transfer Overhead

TFHE is not self-contained. In a real system, ciphertexts arrive over the network, need to be transferred to GPU memory, processed, and the results transferred back. PCIe 5.0 provides 64 GB/s per direction. For a workload that processes hundreds of gate evaluations per request, the time spent moving data between host and device memory can equal or exceed the time spent on the actual computation.

This transfer overhead is invisible in microbenchmarks that measure a single batch of gates with data already resident on the GPU. It is very visible in production, where ciphertexts arrive continuously and results must be returned with bounded latency. You can overlap transfers with computation using CUDA streams, but this adds scheduling complexity and still does not eliminate the PCIe bottleneck for latency-sensitive workloads.

3. Circuit Composition and Scheduling

Real TFHE programs are not bags of independent gates. They are circuits with data dependencies. A 16-bit comparison, for example, chains together approximately 96 Boolean gates in a dependency graph where some gates depend on the output of previous gates. The critical path through this graph determines latency, and no amount of parallelism reduces the length of the critical path.

On a GPU, this means you launch a kernel, wait for dependent gates to complete, launch the next kernel, and so on. Each kernel launch has overhead (typically 5 to 15 microseconds). For a circuit with 20 levels of depth, that is 100 to 300 microseconds of pure scheduling overhead before you perform any useful work. On a CPU, function calls cost nanoseconds, and you can interleave independent circuits across cores without any kernel launch overhead.

The scheduling problem compounds when you consider multi-tenant workloads. A production system serves many clients simultaneously, each with different circuits at different stages of completion. On a CPU, this is a straightforward thread-pool problem. On a GPU, this requires managing multiple CUDA streams, coordinating memory allocations across tenants, and handling the reality that GPU memory is a fixed pool that cannot be dynamically expanded. When GPU memory fills, you spill to host memory and pay the PCIe transfer penalty on every access.

4. System Integration Gaps

A GPU is a peripheral device. It sits behind a bus, managed by a driver, accessed through an API that adds abstraction layers between your code and the hardware. For training a neural network over hours, these layers are irrelevant. For evaluating a Boolean circuit in milliseconds with a latency SLA, every layer matters.

Consider what a production TFHE deployment needs beyond raw gate evaluation: encrypted input validation, key management, ciphertext routing, result caching, audit logging, batch scheduling, health monitoring, graceful degradation under load. Every one of these operations needs to interact with the TFHE computation layer. On a CPU, they live in the same address space, share the same memory, and communicate through function calls. On a GPU, they live on different devices, and every interaction crosses the host-device boundary.

This is not a theoretical concern. We have watched teams spend months building GPU-accelerated TFHE gates that benchmark beautifully in isolation, then spend additional months building the infrastructure to actually serve those gates in production. The infrastructure overhead often exceeds the compute time, which means the GPU acceleration buys less than the microbenchmarks promised.

The Throughput-Per-Dollar Question

Even if you accept the engineering complexity, there is a straightforward economic question: what does TFHE throughput cost per dollar on GPU versus CPU?

A single NVIDIA H100 GPU costs approximately $25,000 to $30,000 at retail, plus the host system to drive it. A fully loaded GPU server with four H100s runs $120,000 to $150,000. In cloud terms, an p5.48xlarge with 8 H100s costs roughly $98 per hour on-demand.

A Graviton4 c8g.metal-48xl with 192 vCPUs and 371 GiB of RAM costs approximately $2.30 per hour on-demand. That is 42 times less expensive per hour than the GPU instance.

Now look at what we achieve on that $2.30/hour ARM node.

Operation	Throughput	Latency
8-bit Greater-Than	768 TPS	125 ms
16-bit Greater-Than	372 TPS	258 ms
32-bit Greater-Than	182 TPS	526 ms
64-bit Greater-Than	91 TPS	1,058 ms
16-bit Equality	769 TPS	125 ms
AND Gates (sustained)	11,520 gates/sec	—

These numbers are from a 96-channel deployment on a single ARM node. No GPU. No cluster. No specialized hardware. In-process caching (CACHEE_MODE=inprocess) eliminates network round-trips for bootstrapping key access. Each channel independently processes TFHE operations while the execution scheduler ensures that memory bandwidth is saturated across all 96 channels without contention.

For a GPU to justify its cost premium, it would need to deliver roughly 42 times the throughput of this ARM node. That means a single H100 would need to sustain over 32,000 8-bit comparisons per second to break even on cost. Published GPU TFHE benchmarks from leading research groups and commercial vendors do not approach this number. They typically show 2 to 5 times the raw gate throughput of a single CPU core, not 42 times the throughput of a 96-core CPU deployment.

The arithmetic is clear: for sustained TFHE throughput, ARM compute delivers more operations per dollar than GPU compute. The gap widens further when you account for the engineering investment required to build and maintain the GPU infrastructure.

What Actually Determines TFHE Performance

If the bottleneck is memory bandwidth rather than compute, then the path to higher throughput is not more FLOPS. It is more efficient use of the bandwidth you have. This is the architectural insight that drives H33's approach to TFHE.

There are four factors that matter more than hardware choice.

Execution Scheduling

On a 96-core processor, you have 96 independent execution units that share a memory subsystem. If all 96 cores request different portions of the bootstrapping key at the same time, they will contend for memory bandwidth and all of them will slow down. If they are scheduled so that their memory access patterns are staggered and cache-friendly, they can collectively approach the theoretical bandwidth limit of the machine.

This scheduling problem is where most of the performance lives. A naive deployment that launches 96 independent TFHE evaluations will achieve perhaps 40% of what a carefully scheduled deployment achieves. The scheduling must account for the circuit structure (which gates depend on which), the memory access pattern (which key material is needed when), and the cache hierarchy (which data is hot and which is cold). Getting this right is genuinely difficult engineering. But it is software engineering, not hardware procurement.

Memory Locality

The bootstrapping key is too large to fit in L1 or L2 cache, but portions of it can be kept in L3 cache across successive gate evaluations. If two consecutive gate evaluations on the same core access similar portions of the key, the second evaluation benefits from the first one's cache warming. The execution scheduler can exploit this by routing related gates to the same core, turning a cold memory access into a warm cache hit.

On a GPU, you do not have this kind of fine-grained control. The GPU's cache is managed by hardware, and the thread scheduler determines which warps execute on which SMs. You can influence this with careful kernel design, but you cannot control it the way you can on a CPU where each core runs a dedicated thread with a deterministic memory access pattern.

Circuit Composition

A single Boolean gate is not a useful computation. Useful computations are circuits built from hundreds or thousands of gates. The way these circuits are structured, decomposed, and mapped to parallel resources determines throughput as much as the speed of any individual gate.

Our 96-channel architecture evaluates 96 independent TFHE circuits simultaneously. Each channel processes one client's encrypted computation while the others process theirs. Within each channel, gates are evaluated in dependency order with the scheduler managing inter-gate data flow. Across channels, the scheduler ensures that memory bandwidth is distributed fairly and that no single channel monopolizes shared resources.

This is the kind of system-level optimization that does not show up in a microbenchmark of a single gate. It is also the kind of optimization that a GPU makes harder, not easier, because the GPU's programming model does not naturally support 96 independent circuits with different dependency structures running concurrently on shared hardware.

Integration with the Rest of the Stack

TFHE does not run in a vacuum. In H33's production stack, TFHE operations are one component of a larger pipeline that includes key management, ciphertext validation, attestation, caching, and audit logging. Every TFHE result feeds into downstream processing. Every TFHE input comes from upstream validation.

When the TFHE engine lives on a CPU alongside these other components, data flows between them at memory speed. When the TFHE engine lives on a GPU, data must cross the PCIe bus to reach any other component. For a workload like 8-bit comparison at 125 ms latency, adding even 2 ms of PCIe transfer overhead represents a 1.6% latency increase. For high-throughput workloads processing thousands of operations per second, the aggregate transfer overhead can consume a significant fraction of the pipeline budget.

The Hard Part Is Not the Hardware

This is the central point, and it is often missed in discussions about FHE performance: the hard part of making TFHE fast is not choosing the right hardware. It is designing the execution model that extracts maximum throughput from whatever hardware you have.

A $25,000 GPU rack does not automatically solve the scheduling problem. It does not automatically solve the memory locality problem. It does not automatically solve the circuit composition problem. It does not automatically solve the system integration problem. You still need to solve all four, and now you need to solve them within the constraints of a GPU programming model that was designed for neural network training, not Boolean circuit evaluation.

We chose ARM for our TFHE deployment not because ARM is inherently faster for polynomial arithmetic (it is not). We chose ARM because the CPU programming model gives us fine-grained control over scheduling, memory placement, cache behavior, and system integration. That control is what enabled us to build a 96-channel execution engine that sustains 11,520 AND gates per second on a single node.

Could we run faster on a GPU? Per individual gate, probably. Could we run faster on a GPU at scale, in production, with bounded latency, serving multiple tenants, integrated with the rest of the security stack, at a cost that makes commercial sense? The data says no.

Where GPUs Do Make Sense for FHE

To be clear: GPUs are not useless for FHE. There are specific workloads where GPU acceleration delivers genuine value.

Batch BFV and CKKS operations on large vectors benefit from GPU parallelism because the working set per operation is smaller and the reuse ratio is higher. Encrypted machine learning inference, where the same model weights are applied to many encrypted inputs, has a memory access pattern that GPUs are designed to exploit. Key generation, which happens once and is not latency-sensitive, can benefit from GPU acceleration.

But TFHE Boolean circuits are a different workload. The gate-by-gate nature of the computation, the large per-gate working set, the complex dependency structures, and the latency sensitivity all push against the GPU's strengths and toward its weaknesses. The right tool depends on the workload, and for TFHE in production, the right tool is a many-core CPU with a well-designed execution scheduler.

What This Means for Your Evaluation

If you are evaluating TFHE for production use, here is what we recommend.

First, benchmark end-to-end, not gate-by-gate. A vendor who reports individual gate latency is telling you the least useful number. Ask for sustained throughput across realistic circuits (comparisons, additions, multiplications at various bit widths) running concurrently under multi-tenant load. That is the number that determines whether the system can serve your production workload.

Second, benchmark on the deployment target, not on a showcase GPU. If you are deploying to cloud instances, benchmark on cloud instances. If you are deploying on-premise, benchmark on your actual hardware. TFHE performance is acutely sensitive to memory bandwidth, cache hierarchy, and NUMA topology, all of which vary dramatically between platforms.

Third, measure total cost of ownership, not peak performance. A system that delivers 500 TPS at $2.30 per hour costs $0.0046 per thousand operations. A system that delivers 1,500 TPS at $98 per hour costs $0.065 per thousand operations. The faster system costs 14 times more per operation. Unless you need those 1,500 TPS from a single node (and horizontal scaling is not an option), the less expensive system is the better investment.

Fourth, evaluate the integration story. How does the TFHE engine connect to your key management system? How does it handle multi-tenant isolation? How does it report metrics and health? How does it degrade under overload? These operational concerns dominate the total cost of running TFHE in production, and they are dramatically simpler on a CPU architecture than on a GPU architecture.

The Path Forward

TFHE is reaching the point where it is practical for real production workloads. Not every workload, and not without careful engineering, but for specific high-value use cases like encrypted comparisons, encrypted search, and encrypted access control, the performance is here today.

The path to making TFHE faster is not to throw more hardware at it. It is to build smarter execution engines that extract every available byte of memory bandwidth from the hardware you already have. It is to design scheduling algorithms that minimize cache misses and maximize bandwidth utilization across many concurrent circuits. It is to integrate TFHE into the broader security stack so that the overhead of encryption is amortized across the entire pipeline.

This is the approach H33 has taken, and the results speak for themselves: 768 TPS for 8-bit comparisons, 372 TPS for 16-bit comparisons, 91 TPS for full 64-bit comparisons, all on a single ARM node at $2.30 per hour with no GPU, no cluster, and no specialized hardware. The performance comes from software architecture, not hardware expenditure.

Memory bandwidth is the bottleneck. Once you accept that, the engineering decisions follow naturally. And the economics follow from the engineering.

See H33 TFHE Performance Live

We will walk you through our 96-channel ARM deployment, show you the real numbers on your workload, and explain how H33 integrates TFHE into a complete post-quantum security stack built on three independent hardness assumptions.

Schedule a Demo