TFHE Gate Performance

Version: 1.0.0
Status: Production
Last Updated: 2026-05-23
Hardware: AWS c8g.metal-48xl (Graviton4, 192 vCPU)
Channels: 96-channel parallel execution
Canonical URL: https://h33.ai/benchmarks/tfhe/

1. Scope

This document presents production benchmarks for the H33 TFHE (Fully Homomorphic Encryption over the Torus) implementation. TFHE operates on individual encrypted bits and supports arbitrary Boolean circuits through programmable bootstrapping (PBS). The H33 implementation uses 96-channel parallel execution on Graviton4 to achieve throughput suitable for bit-level encrypted computation.

2. Definitions

Programmable Bootstrapping (PBS): The core TFHE operation that evaluates an arbitrary lookup table on an encrypted value while simultaneously refreshing the ciphertext's noise. PBS is the computational bottleneck in TFHE circuits.
Gate Operation: A single Boolean operation (AND, OR, XOR, NOT, NAND, NOR) on encrypted bits. Each gate (except NOT) requires one PBS invocation.
96-Channel Execution: The H33 technique of executing 96 independent PBS operations in parallel, one per physical core on the Graviton4 c8g.metal-48xl (192 vCPU = 96 physical cores). Each channel operates on independent ciphertexts with no shared state.
Noise Budget: The remaining margin for correct decryption. Each PBS operation resets the noise budget to a fixed level (determined by parameters). Gates without PBS (NOT) do not consume noise budget.

3. Programmable Bootstrapping Throughput

Bit Width	Operation	TPS (96-channel)	Per-PBS Latency
8-bit	Greater-than comparison	768	1.30 ms
16-bit	Greater-than comparison	372	2.69 ms
32-bit	Greater-than comparison	182	5.49 ms
64-bit	Greater-than comparison	91	10.99 ms
16-bit	Equality comparison	769	1.30 ms

Throughput scales inversely with bit width because wider comparisons require more sequential PBS evaluations. The 8-bit greater-than comparison requires approximately 8 PBS invocations; the 64-bit comparison requires approximately 64.

4. Gate Operation Latencies

Gate	Single-Thread Latency	PBS Required	96-Channel TPS
NOT	0.8 us	No	N/A (trivial)
AND	1.30 ms	Yes (1)	73,846
OR	1.30 ms	Yes (1)	73,846
XOR	1.30 ms	Yes (1)	73,846
NAND	1.30 ms	Yes (1)	73,846
NOR	1.30 ms	Yes (1)	73,846

All gates except NOT have identical latency because each requires exactly one programmable bootstrapping invocation. The NOT gate is a trivial negation of the ciphertext with no PBS and completes in sub-microsecond time.

5. 96-Channel Parallel Execution

The 96-channel architecture assigns one independent PBS computation per physical core. Each channel maintains its own bootstrapping key and operates on independent ciphertexts.

Channels	PBS/sec (aggregate)	Scaling Efficiency
1	769	100%
8	6,140	99.7%
24	18,360	99.3%
48	36,480	98.8%
96	73,846	99.9%

Near-perfect linear scaling is achieved because TFHE PBS has no shared mutable state. Each bootstrapping key is approximately 24 MiB; at 96 channels, total bootstrapping key memory is approximately 2.25 GiB, well within the 371 GiB available.

6. Noise Budget Analysis

TFHE PBS resets the noise budget to a fixed level after each gate evaluation. The noise budget before and after PBS is as follows:

Parameter	Value
Initial noise budget (fresh ciphertext)	24 bits
Noise consumed per gate (without PBS)	~3 bits
Noise budget after PBS refresh	24 bits (reset)
Maximum gates without PBS before failure	~7
Decryption failure probability (per PBS)	< 2^(-40)

The PBS refresh mechanism allows arbitrary circuit depth without noise accumulation. Each PBS invocation resets the noise budget to 24 bits regardless of the noise level before the operation. This is a fundamental advantage of TFHE over leveled schemes like BFV.

7. Comparison: H33 TFHE vs Zama TFHE-rs

The following comparison uses published Zama TFHE-rs benchmarks on comparable hardware (AWS m6g.metal, Graviton2). Numbers for Zama are from their public benchmark repository as of 2026-04.

Metric	H33 TFHE (Graviton4)	Zama TFHE-rs (Graviton2)	Ratio
8-bit GT (single thread)	1.30 ms	8.4 ms	6.5x
16-bit GT (single thread)	2.69 ms	18.1 ms	6.7x
8-bit GT (96-channel)	768 TPS	N/A	--
PBS per gate	1.30 ms	8.0 ms	6.2x

The comparison is approximate. Graviton4 has architectural advantages over Graviton2 (wider SVE, better branch prediction, larger L2 cache). A same-hardware comparison would reduce the ratio to approximately 3-4x, attributable to the H33 Montgomery radix-4 NTT and custom polynomial multiplication kernel.

8. GPU Acceleration

H33 TFHE also supports GPU-accelerated multi-bit PBS on NVIDIA A10G:

Configuration	TPS	Noise Fidelity
CPU 96-channel (Graviton4)	768	Reference
GPU multi-bit (A10G, v4 kernel)	691	1.0% deviation
GPU multi-bit (A10G, v5 kernel)	1,129	1.0% deviation

The v5 GPU kernel achieves 1,129 TPS, a 1.63x improvement over v4 and a 1.47x improvement over the CPU baseline. Noise fidelity is 1.0%, matching the CPU reference within measurement tolerance. Three bugs were fixed in the v5 kernel: accumulator bit-width mismatch, shared-memory bank conflict in the NTT, and incorrect rounding in the key-switch decomposition.

9. Reproducibility

# Build TFHE benchmarks
$ cargo build --release --features tfhe_bench

# Run 8-bit greater-than comparison (96 channels)
$ ./target/release/examples/tfhe_bench --op gt --bits 8 --channels 96

# Run gate-level benchmarks
$ ./target/release/examples/tfhe_bench --op gates --channels 96

# Run noise budget analysis
$ ./target/release/examples/tfhe_bench --noise-analysis --iterations 10000

Related Benchmarks & Specifications

Benchmarks Index ML-DSA-65 BFV FHE STARK Proofs Agent Infrastructure H33-TFHE TFHE Bootstrap