What 1,129 TPS Means for Real-Time Encrypted Computation
Encrypted computation has always been defined by a single constraint: speed. The mathematics of fully homomorphic encryption guarantees that data remains encrypted during processing — no decryption, no exposure, no trust assumptions about the server. But that guarantee has historically come at a cost measured in seconds per operation. For TFHE (Torus Fully Homomorphic Encryption), the bottleneck is programmable bootstrapping: the operation that resets noise after every gate evaluation, and the operation that makes TFHE uniquely powerful for arbitrary Boolean circuits.
On April 22, 2026, H33's GPU-accelerated multi-bit TFHE engine achieved 1,129 transactions per second on a single NVIDIA A10G GPU. That is a 1.63x improvement over our previous v4 baseline of 691 TPS. More importantly, the GPU implementation achieves a 1.0% noise floor that exactly matches our CPU implementation — meaning this is not a speed-versus-correctness tradeoff. It is pure throughput gain with zero noise penalty.
To put 1,129 TPS in context: this means a single GPU can evaluate over one thousand encrypted gate operations every second. For workloads like encrypted comparison, encrypted sorting, or encrypted arithmetic on sensitive data, this moves TFHE from "theoretically possible" to "practically deployable." The entire H33-TFHE engine is proprietary — zero external TFHE dependencies, zero third-party FHE libraries. Every gate, every bootstrapping kernel, every noise management routine was written in-house specifically for this pipeline.
The Headline Numbers
1,129 TPS on NVIDIA A10G (production-ready). 1.63x over v4 baseline (691 TPS). 1.0% noise matching CPU exactly. Three bugs fixed to reach production. All proprietary — no external TFHE dependencies.
What Is Multi-bit TFHE?
Standard TFHE operates on a single encrypted bit at a time. Each logic gate — AND, OR, XOR, NAND — takes two encrypted bits as input, produces one encrypted bit as output, and requires one programmable bootstrapping operation to refresh the noise. This is elegant and general-purpose: you can build any Boolean circuit, which means you can compute any function. The problem is that bootstrapping is expensive. For a 32-bit integer addition, you need roughly 32 full-adder circuits, each requiring multiple gates, each requiring a bootstrap. The bootstrapping count dominates the total computation time.
Multi-bit TFHE changes the economics. Instead of encoding one plaintext bit per ciphertext, multi-bit TFHE encodes multiple bits — typically 2, 3, or 4 — into a single encrypted sample. Each gate evaluation then operates on multiple plaintext bits simultaneously, which means the total number of bootstrapping operations required for a given computation drops proportionally. A 4-bit multi-bit encoding reduces the bootstrapping count by up to 4x compared to single-bit TFHE for the same logical operation.
The tradeoff is noise. Encoding more bits into each ciphertext narrows the noise margin — the gap between a correct decryption and a decryption error. Multi-bit TFHE requires tighter control over noise accumulation during bootstrapping, which makes the implementation harder and the parameter selection more delicate. This is precisely why GPU noise parity matters so much: if the GPU introduces even slightly more noise than the CPU, multi-bit encodings that work correctly on CPU will fail on GPU.
H33's multi-bit TFHE implementation handles this by maintaining strict noise budgets throughout the gate evaluation pipeline. The Number Theoretic Transform is used for polynomial multiplication during bootstrapping, and the entire noise management chain was validated against our CPU reference implementation to confirm the 1.0% noise floor that makes multi-bit encoding safe in production.
Why GPUs for TFHE
TFHE bootstrapping is, at its core, a polynomial multiplication problem. Each bootstrap involves multiplying polynomials of degree N (typically 1024 or 2048) in a specific ring structure. This polynomial multiplication is where most of the computation time is spent, and it is inherently parallelizable: the individual coefficient operations are independent, the butterfly stages of the underlying transforms can be batched, and the memory access patterns are regular enough to exploit GPU memory bandwidth effectively.
The NVIDIA A10G provides 24 GB of GDDR6 memory with 600 GB/s bandwidth and 31.2 TFLOPS of FP32 compute. For TFHE bootstrapping, the memory bandwidth is as important as the raw compute — each bootstrap reads and writes large polynomial buffers, and the ability to stream those buffers through the GPU's memory hierarchy without stalling is what separates a fast GPU implementation from a slow one.
There is a subtlety that many GPU-FHE implementations miss: TFHE bootstrapping is not embarrassingly parallel in the way that, say, matrix multiplication is. The sequential dependency chain within a single bootstrap (key switching, blind rotation, sample extraction) limits how much parallelism you can extract from one gate. The real GPU advantage comes from evaluating many gates simultaneously — batching hundreds or thousands of independent gate evaluations into a single GPU dispatch. This is why H33's GPU TFHE achieves 1,129 TPS rather than just making a single gate faster: we are running many gates in parallel across the GPU's streaming multiprocessors.
CPU implementations, even on hardware as powerful as the 96-vCPU Graviton4 instances we use for BFV production, cannot match this level of gate-level parallelism. CPUs excel at sequential bootstrapping with wide SIMD, but they cannot match the thousands of concurrent threads that a GPU brings to batched gate evaluation. The A10G is also widely available on AWS G5 instances, which means this is not a theoretical result requiring exotic hardware — it deploys to standard cloud infrastructure.
The Three Bugs: How Instrumentation Was the Bottleneck
When we first ran the GPU multi-bit TFHE kernel, throughput was well below expectations. The initial hypothesis was that the Count-Min Sketch (CMS) admission layer was introducing overhead — it seemed plausible, since CMS is a probabilistic data structure that sits in the hot path. We spent time profiling the CMS layer, looking for lock contention, examining memory allocation patterns, and testing alternative sketch sizes. None of it helped.
The breakthrough came when we stopped looking at the computation itself and started looking at the measurement infrastructure. The three bugs we found and fixed tell a story about a lesson every systems engineer eventually learns the hard way: your instrumentation can be the bottleneck.
1 Instrumentation Was the Bug, Not CMS
The CMS admission layer was never the problem. The actual bottleneck was the instrumentation code we had wrapped around CMS to measure its performance. Every gate evaluation was updating a set of counters, histograms, and timing measurements that were protected by mutex locks. Under high GPU throughput, the instrumentation serialized what should have been a lock-free path. We were measuring the system so aggressively that the measurement itself became the dominant cost. Once we identified this, the path forward was clear: the instrumentation had to be rebuilt from scratch.
2 Atomic Stats Overhead
After removing the mutex-based instrumentation, we replaced it with atomic counters. This was better but still not fast enough. The atomic increment operations — even using relaxed memory ordering — were causing cache-line bouncing across CPU cores that were coordinating GPU dispatch. Every atomic fetch_add on a shared counter forces the cache line containing that counter to bounce between L1 caches of whichever cores are doing the incrementing. At 1,000+ operations per second, this cache contention was measurable. The fix was to move to a per-thread accumulator pattern: each thread maintains its own local counter, and the counters are aggregated only when stats are actually read, not on every operation.
3 Lock-Free Metrics + Sampled Patterns
The final fix combined two changes. First, we moved all remaining metrics to a fully lock-free design using thread-local storage and periodic flush. Second, we switched from capturing every access pattern to a sampled approach: instead of recording every gate evaluation for pattern analysis, we sample 1-in-N evaluations (with N tuned to keep overhead below 0.1% of total throughput). The combination of lock-free metrics and sampled patterns brought instrumentation overhead from "dominant cost" to "statistically invisible." Throughput jumped from the v4 baseline of 691 TPS to the production number of 1,129 TPS — a 1.63x improvement, all from fixing the measurement layer.
The 1.63x speedup from 691 to 1,129 TPS did not come from a faster kernel, a better algorithm, or more parallelism. It came entirely from fixing the instrumentation layer. The GPU kernel was already fast. We were just strangling it with our own measurement code. If your benchmarks show unexpectedly low throughput, profile the profiler before rewriting the computation.
Noise Analysis: 1.0% Matching CPU
Throughput without correctness is useless. A GPU TFHE implementation that runs 10x faster but introduces 5% more noise than the CPU version is not production-ready — it is a liability. Multi-bit TFHE operates with tighter noise margins than single-bit TFHE, which means even small noise regressions can cause decryption failures that would not appear in single-bit mode. Getting to 1.0% noise parity with the CPU was non-negotiable.
Our debugging methodology for noise analysis is strict: iteration-count sweep to isolate the noise source. Never guess where noise is coming from — measure the divergence point. The procedure works as follows:
- Run a single gate evaluation on both CPU and GPU with identical inputs and keys. Compare the output ciphertext coefficient-by-coefficient. If they match within tolerance, the single-gate path is clean.
- Increase the iteration count geometrically (1, 2, 4, 8, 16, 32, 64, 128 sequential gate evaluations). At each step, compare CPU and GPU output noise levels. The iteration count where GPU noise first diverges from CPU noise tells you exactly which stage of the pipeline is introducing the discrepancy.
- Isolate the divergent stage. Once you know the iteration count where noise diverges, you can narrow down to the specific bootstrapping sub-operation (blind rotation, key switching, modulus switching) that is accumulating extra noise on the GPU.
- Fix and re-sweep. After each fix, re-run the full iteration-count sweep to confirm that noise parity is restored and that the fix did not introduce a regression at a different iteration depth.
This methodology found all three noise-related issues in our GPU kernel. In each case, the single-gate test passed — the bug only manifested after multiple sequential evaluations, where tiny per-operation noise differences accumulated into measurable divergence. The iteration-count sweep pinpointed the exact depth where divergence began, which made root-causing each issue straightforward rather than requiring guesswork.
After the three bug fixes, the GPU kernel achieves a 1.0% noise floor that matches the CPU implementation exactly. This was validated across all bit widths (8, 16, 32, 64) and all gate types (Greater Than, Equality) in the 96-channel benchmark suite. The noise measurement is taken as the ratio of actual noise to the maximum tolerable noise budget — 1.0% means the GPU uses the same fraction of the noise budget as the CPU, within measurement precision.
96-Channel Benchmark Results
The 96-channel benchmark suite evaluates H33-TFHE across every combination of bit width and gate type that we support in production. "96-channel" refers to the parallel evaluation width — 96 independent encrypted channels processed simultaneously, matching the vCPU count of our Graviton4 production instances for apples-to-apples comparison with the CPU baseline.
| Operation | Bit Width | Throughput (TPS) | Notes |
|---|---|---|---|
| Greater Than | 8-bit | 768 | Highest comparison throughput |
| Greater Than | 16-bit | 372 | 2.06x reduction from 8-bit (expected: linear in bit width) |
| Greater Than | 32-bit | 182 | 2.04x reduction from 16-bit |
| Greater Than | 64-bit | 91 | 2.0x reduction from 32-bit (perfect scaling) |
| Equality | 16-bit | 769 | Equality is faster than GT at same bit width (fewer sequential bootstraps) |
The scaling pattern is clean: doubling the bit width approximately halves the throughput for Greater Than operations. This is the expected behavior for a circuit-based comparison — a 16-bit comparison requires roughly twice as many gate evaluations as an 8-bit comparison. The consistency of this 2x factor across all bit widths confirms that the GPU is not introducing any unexpected overhead at wider bit widths.
The Equality result at 16-bit (769 TPS) is particularly interesting because it nearly matches the 8-bit Greater Than throughput (768 TPS). This reflects the structural difference between the two operations: equality checking can be parallelized more aggressively than ordered comparison because it does not require a sequential carry chain. Each bit position can be checked independently, and the results are combined with a tree of AND gates rather than a sequential ripple.
All benchmarks were run with the production noise budget and the multi-bit encoding that we ship. These are not "best-case" numbers with relaxed parameters — they reflect the actual throughput that production workloads will see.
Production Deployment on A10G
H33-TFHE GPU runs on NVIDIA A10G instances, which are available as AWS G5 instances. The A10G was chosen for three reasons: broad availability (G5 instances are in most AWS regions), cost efficiency (significantly cheaper per GPU-hour than A100 or H100 for the memory-bandwidth-bound workloads that TFHE creates), and 24 GB of GDDR6 memory (sufficient for the bootstrapping key material and ciphertext buffers that TFHE requires).
In the H33 production pipeline, TFHE serves a specific role: gate-level encrypted computation for operations that cannot be expressed as SIMD-batched arithmetic. While H33-128 (BFV) handles high-throughput batched workloads like biometric matching at over 2 million authentications per second, TFHE handles the operations that require arbitrary Boolean logic — encrypted comparisons, encrypted conditionals, encrypted sorting, and the critical SHA3-256 computation on encrypted bits that is part of H33's sealed tier architecture.
H33-TFHE is a core component of the sealed tier for SHA3-256 on encrypted bits. The sealed tier eliminates the decrypt-to-attest window entirely: data is born encrypted, processed encrypted, and attested encrypted. TFHE is the only FHE scheme that can evaluate arbitrary Boolean circuits gate-by-gate, which makes it the only viable approach for computing a hash function (which requires bitwise operations, rotations, and non-linear functions) directly on encrypted data. The 1,129 TPS throughput on A10G makes this practical for production workloads rather than purely theoretical.
The deployment model keeps the GPU kernel isolated from the rest of the pipeline. TFHE gate evaluations are dispatched to the GPU as batched requests, with the CPU handling key management, ciphertext serialization, and the overall pipeline orchestration. This separation means that GPU failures or restarts do not affect the BFV or ZKP components of the pipeline — only TFHE-dependent operations are impacted, and they can fall back to CPU evaluation at reduced throughput while the GPU recovers.
What This Enables: Encrypted Computation Use Cases
At 1,129 TPS on commodity GPU hardware, multi-bit TFHE unlocks several categories of encrypted computation that were previously too slow for production deployment.
Encrypted comparison and sorting. Financial compliance requires comparing account balances against thresholds, sorting transactions by amount, and evaluating conditional logic — all operations that are trivial on plaintext but require Boolean circuits on encrypted data. The 96-channel benchmarks show that 8-bit Greater Than runs at 768 TPS and 16-bit Equality at 769 TPS, which is sufficient for real-time compliance screening on encrypted financial data.
Encrypted conditional logic. Many business rules are expressed as if-then-else conditions: if the patient's age exceeds 65, apply a different risk model; if the transaction amount exceeds $10,000, flag for review. TFHE is the natural scheme for these workloads because it evaluates each condition as a Boolean circuit, preserving the encrypted state throughout. At 1,129 TPS, a single GPU can evaluate over a thousand conditional rules per second without ever seeing the underlying data.
Encrypted hash computation (sealed tier). Computing SHA3-256 on encrypted bits is the most demanding TFHE workload in the H33 pipeline. SHA3 involves 24 rounds of permutation, each requiring thousands of Boolean gate evaluations on 1,600 encrypted state bits. This is computationally intensive even with GPU acceleration, but the 1.63x throughput improvement makes it viable within the latency budget of the sealed tier architecture. The sealed tier exists to close the decrypt-to-attest window — data is never plaintext outside the generation boundary.
Post-quantum secure by construction. TFHE's security is based on the Learning With Errors (LWE) problem over lattices, which is the same hardness assumption underlying the NIST post-quantum standards (ML-KEM and ML-DSA are both lattice-based). There is no known quantum algorithm that efficiently solves LWE. This means every TFHE computation is post-quantum secure by construction — no additional post-quantum wrapper is needed, because the encryption scheme itself is already quantum-resistant.
Zero External Dependencies
H33-TFHE is entirely proprietary. We do not use any external TFHE library, FHE framework, or third-party bootstrapping implementation. Every component — gate evaluation, programmable bootstrapping, key switching, noise management, GPU kernel dispatch — was built in-house. This gives us full control over the noise budget, parameter selection, and optimization surface. It also means we are not constrained by the API design or performance limitations of any open-source TFHE library.
Frequently Asked Questions
What is multi-bit TFHE and how does it differ from standard TFHE?
Standard TFHE encodes one plaintext bit per ciphertext and evaluates one bit per gate. Multi-bit TFHE encodes multiple plaintext bits (typically 2–4) into a single ciphertext, which means each gate evaluation processes multiple bits simultaneously. This reduces the total number of bootstrapping operations needed for a given computation. The tradeoff is a tighter noise margin, which requires more careful parameter selection and more precise bootstrapping implementation. H33's multi-bit TFHE achieves the same 1.0% noise floor as the single-bit CPU implementation, confirming that the tighter margin is managed correctly.
What throughput does H33 GPU TFHE achieve, and on what hardware?
H33 GPU multi-bit TFHE achieves 1,129 transactions per second on a single NVIDIA A10G GPU. The A10G is available on AWS G5 instances and provides 24 GB GDDR6 memory with 600 GB/s bandwidth. This is a 1.63x improvement over the v4 CPU baseline of 691 TPS. The throughput was measured under production noise budgets and multi-bit encoding — these are not relaxed-parameter estimates.
Does the GPU implementation introduce more noise than the CPU version?
No. After fixing three production bugs (all in the instrumentation layer, not in the computation itself), the GPU implementation achieves a 1.0% noise floor that exactly matches the CPU implementation. This was validated using an iteration-count sweep methodology across all supported bit widths (8, 16, 32, 64) and gate types (Greater Than, Equality). The noise parity means that any multi-bit encoding that works correctly on CPU will also work correctly on GPU.
Why did fixing instrumentation improve throughput by 1.63x?
The GPU kernel itself was never slow. The bottleneck was the measurement infrastructure: mutex-protected counters that serialized the hot path, atomic operations that caused cache-line bouncing, and full-capture pattern recording that consumed bandwidth. Replacing these with per-thread accumulators, lock-free metrics, and sampled pattern capture eliminated the overhead entirely. The lesson: if your benchmark shows unexpectedly low throughput, profile the profiler before rewriting the computation.
Is H33 TFHE post-quantum secure?
Yes. TFHE is built on the Learning With Errors (LWE) lattice problem, the same mathematical hardness assumption that underlies the NIST post-quantum standards ML-KEM (Kyber) and ML-DSA (Dilithium). There is no known quantum algorithm that efficiently solves LWE. Every TFHE computation is post-quantum secure by construction, without requiring an additional post-quantum encryption layer.
How does H33 TFHE fit into the broader H33 production pipeline?
H33 uses multiple FHE schemes for different workloads. H33-128 (BFV) handles high-throughput SIMD-batched workloads like biometric authentication at over 2 million operations per second. H33-CKKS handles approximate arithmetic for ML inference. H33-TFHE handles gate-level Boolean computation — encrypted comparisons, conditionals, and hash functions — that cannot be expressed as batched arithmetic. The GPU-accelerated TFHE at 1,129 TPS is specifically part of the sealed tier for SHA3-256 on encrypted bits.
What does "96-channel" mean in the benchmark results?
96-channel refers to 96 independent encrypted computation channels processed in parallel, matching the vCPU count of the Graviton4 metal instances used for CPU baselines. This ensures an apples-to-apples comparison between GPU and CPU throughput. Each channel processes an independent encrypted gate evaluation, and the aggregate throughput across all 96 channels is the reported TPS number.