BenchmarksVerificationPricingDemo
Log InGet API Key

TFHE Gate Performance

Version: 1.0.0
Status: Production
Last Updated: 2026-05-23
Hardware: AWS c8g.metal-48xl (Graviton4, 192 vCPU)
Channels: 96-channel parallel execution
Canonical URL: https://h33.ai/benchmarks/tfhe/

1. Scope

This document presents production benchmarks for the H33 TFHE (Fully Homomorphic Encryption over the Torus) implementation. TFHE operates on individual encrypted bits and supports arbitrary Boolean circuits through programmable bootstrapping (PBS). The H33 implementation uses 96-channel parallel execution on Graviton4 to achieve throughput suitable for bit-level encrypted computation.

2. Definitions

Programmable Bootstrapping (PBS)
The core TFHE operation that evaluates an arbitrary lookup table on an encrypted value while simultaneously refreshing the ciphertext's noise. PBS is the computational bottleneck in TFHE circuits.
Gate Operation
A single Boolean operation (AND, OR, XOR, NOT, NAND, NOR) on encrypted bits. Each gate (except NOT) requires one PBS invocation.
96-Channel Execution
The H33 technique of executing 96 independent PBS operations in parallel, one per physical core on the Graviton4 c8g.metal-48xl (192 vCPU = 96 physical cores). Each channel operates on independent ciphertexts with no shared state.
Noise Budget
The remaining margin for correct decryption. Each PBS operation resets the noise budget to a fixed level (determined by parameters). Gates without PBS (NOT) do not consume noise budget.

3. Programmable Bootstrapping Throughput

Bit WidthOperationTPS (96-channel)Per-PBS Latency
8-bitGreater-than comparison7681.30 ms
16-bitGreater-than comparison3722.69 ms
32-bitGreater-than comparison1825.49 ms
64-bitGreater-than comparison9110.99 ms
16-bitEquality comparison7691.30 ms

Throughput scales inversely with bit width because wider comparisons require more sequential PBS evaluations. The 8-bit greater-than comparison requires approximately 8 PBS invocations; the 64-bit comparison requires approximately 64.

4. Gate Operation Latencies

GateSingle-Thread LatencyPBS Required96-Channel TPS
NOT0.8 usNoN/A (trivial)
AND1.30 msYes (1)73,846
OR1.30 msYes (1)73,846
XOR1.30 msYes (1)73,846
NAND1.30 msYes (1)73,846
NOR1.30 msYes (1)73,846

All gates except NOT have identical latency because each requires exactly one programmable bootstrapping invocation. The NOT gate is a trivial negation of the ciphertext with no PBS and completes in sub-microsecond time.

5. 96-Channel Parallel Execution

The 96-channel architecture assigns one independent PBS computation per physical core. Each channel maintains its own bootstrapping key and operates on independent ciphertexts.

ChannelsPBS/sec (aggregate)Scaling Efficiency
1769100%
86,14099.7%
2418,36099.3%
4836,48098.8%
9673,84699.9%

Near-perfect linear scaling is achieved because TFHE PBS has no shared mutable state. Each bootstrapping key is approximately 24 MiB; at 96 channels, total bootstrapping key memory is approximately 2.25 GiB, well within the 371 GiB available.

6. Noise Budget Analysis

TFHE PBS resets the noise budget to a fixed level after each gate evaluation. The noise budget before and after PBS is as follows:

ParameterValue
Initial noise budget (fresh ciphertext)24 bits
Noise consumed per gate (without PBS)~3 bits
Noise budget after PBS refresh24 bits (reset)
Maximum gates without PBS before failure~7
Decryption failure probability (per PBS)< 2^(-40)

The PBS refresh mechanism allows arbitrary circuit depth without noise accumulation. Each PBS invocation resets the noise budget to 24 bits regardless of the noise level before the operation. This is a fundamental advantage of TFHE over leveled schemes like BFV.

7. Comparison: H33 TFHE vs Zama TFHE-rs

The following comparison uses published Zama TFHE-rs benchmarks on comparable hardware (AWS m6g.metal, Graviton2). Numbers for Zama are from their public benchmark repository as of 2026-04.

MetricH33 TFHE (Graviton4)Zama TFHE-rs (Graviton2)Ratio
8-bit GT (single thread)1.30 ms8.4 ms6.5x
16-bit GT (single thread)2.69 ms18.1 ms6.7x
8-bit GT (96-channel)768 TPSN/A--
PBS per gate1.30 ms8.0 ms6.2x

The comparison is approximate. Graviton4 has architectural advantages over Graviton2 (wider SVE, better branch prediction, larger L2 cache). A same-hardware comparison would reduce the ratio to approximately 3-4x, attributable to the H33 Montgomery radix-4 NTT and custom polynomial multiplication kernel.

8. GPU Acceleration

H33 TFHE also supports GPU-accelerated multi-bit PBS on NVIDIA A10G:

ConfigurationTPSNoise Fidelity
CPU 96-channel (Graviton4)768Reference
GPU multi-bit (A10G, v4 kernel)6911.0% deviation
GPU multi-bit (A10G, v5 kernel)1,1291.0% deviation

The v5 GPU kernel achieves 1,129 TPS, a 1.63x improvement over v4 and a 1.47x improvement over the CPU baseline. Noise fidelity is 1.0%, matching the CPU reference within measurement tolerance. Three bugs were fixed in the v5 kernel: accumulator bit-width mismatch, shared-memory bank conflict in the NTT, and incorrect rounding in the key-switch decomposition.

9. Reproducibility

# Build TFHE benchmarks $ cargo build --release --features tfhe_bench # Run 8-bit greater-than comparison (96 channels) $ ./target/release/examples/tfhe_bench --op gt --bits 8 --channels 96 # Run gate-level benchmarks $ ./target/release/examples/tfhe_bench --op gates --channels 96 # Run noise budget analysis $ ./target/release/examples/tfhe_bench --noise-analysis --iterations 10000