This document presents production benchmarks for the H33 TFHE (Fully Homomorphic Encryption over the Torus) implementation. TFHE operates on individual encrypted bits and supports arbitrary Boolean circuits through programmable bootstrapping (PBS). The H33 implementation uses 96-channel parallel execution on Graviton4 to achieve throughput suitable for bit-level encrypted computation.
| Bit Width | Operation | TPS (96-channel) | Per-PBS Latency |
|---|---|---|---|
| 8-bit | Greater-than comparison | 768 | 1.30 ms |
| 16-bit | Greater-than comparison | 372 | 2.69 ms |
| 32-bit | Greater-than comparison | 182 | 5.49 ms |
| 64-bit | Greater-than comparison | 91 | 10.99 ms |
| 16-bit | Equality comparison | 769 | 1.30 ms |
Throughput scales inversely with bit width because wider comparisons require more sequential PBS evaluations. The 8-bit greater-than comparison requires approximately 8 PBS invocations; the 64-bit comparison requires approximately 64.
| Gate | Single-Thread Latency | PBS Required | 96-Channel TPS |
|---|---|---|---|
| NOT | 0.8 us | No | N/A (trivial) |
| AND | 1.30 ms | Yes (1) | 73,846 |
| OR | 1.30 ms | Yes (1) | 73,846 |
| XOR | 1.30 ms | Yes (1) | 73,846 |
| NAND | 1.30 ms | Yes (1) | 73,846 |
| NOR | 1.30 ms | Yes (1) | 73,846 |
All gates except NOT have identical latency because each requires exactly one programmable bootstrapping invocation. The NOT gate is a trivial negation of the ciphertext with no PBS and completes in sub-microsecond time.
The 96-channel architecture assigns one independent PBS computation per physical core. Each channel maintains its own bootstrapping key and operates on independent ciphertexts.
| Channels | PBS/sec (aggregate) | Scaling Efficiency |
|---|---|---|
| 1 | 769 | 100% |
| 8 | 6,140 | 99.7% |
| 24 | 18,360 | 99.3% |
| 48 | 36,480 | 98.8% |
| 96 | 73,846 | 99.9% |
Near-perfect linear scaling is achieved because TFHE PBS has no shared mutable state. Each bootstrapping key is approximately 24 MiB; at 96 channels, total bootstrapping key memory is approximately 2.25 GiB, well within the 371 GiB available.
TFHE PBS resets the noise budget to a fixed level after each gate evaluation. The noise budget before and after PBS is as follows:
| Parameter | Value |
|---|---|
| Initial noise budget (fresh ciphertext) | 24 bits |
| Noise consumed per gate (without PBS) | ~3 bits |
| Noise budget after PBS refresh | 24 bits (reset) |
| Maximum gates without PBS before failure | ~7 |
| Decryption failure probability (per PBS) | < 2^(-40) |
The PBS refresh mechanism allows arbitrary circuit depth without noise accumulation. Each PBS invocation resets the noise budget to 24 bits regardless of the noise level before the operation. This is a fundamental advantage of TFHE over leveled schemes like BFV.
The following comparison uses published Zama TFHE-rs benchmarks on comparable hardware (AWS m6g.metal, Graviton2). Numbers for Zama are from their public benchmark repository as of 2026-04.
| Metric | H33 TFHE (Graviton4) | Zama TFHE-rs (Graviton2) | Ratio |
|---|---|---|---|
| 8-bit GT (single thread) | 1.30 ms | 8.4 ms | 6.5x |
| 16-bit GT (single thread) | 2.69 ms | 18.1 ms | 6.7x |
| 8-bit GT (96-channel) | 768 TPS | N/A | -- |
| PBS per gate | 1.30 ms | 8.0 ms | 6.2x |
The comparison is approximate. Graviton4 has architectural advantages over Graviton2 (wider SVE, better branch prediction, larger L2 cache). A same-hardware comparison would reduce the ratio to approximately 3-4x, attributable to the H33 Montgomery radix-4 NTT and custom polynomial multiplication kernel.
H33 TFHE also supports GPU-accelerated multi-bit PBS on NVIDIA A10G:
| Configuration | TPS | Noise Fidelity |
|---|---|---|
| CPU 96-channel (Graviton4) | 768 | Reference |
| GPU multi-bit (A10G, v4 kernel) | 691 | 1.0% deviation |
| GPU multi-bit (A10G, v5 kernel) | 1,129 | 1.0% deviation |
The v5 GPU kernel achieves 1,129 TPS, a 1.63x improvement over v4 and a 1.47x improvement over the CPU baseline. Noise fidelity is 1.0%, matching the CPU reference within measurement tolerance. Three bugs were fixed in the v5 kernel: accumulator bit-width mismatch, shared-memory bank conflict in the NTT, and incorrect rounding in the key-switch decomposition.
# Build TFHE benchmarks
$ cargo build --release --features tfhe_bench
# Run 8-bit greater-than comparison (96 channels)
$ ./target/release/examples/tfhe_bench --op gt --bits 8 --channels 96
# Run gate-level benchmarks
$ ./target/release/examples/tfhe_bench --op gates --channels 96
# Run noise budget analysis
$ ./target/release/examples/tfhe_bench --noise-analysis --iterations 10000