TFHE Boolean Circuits on 96-Channel ARM: Real Numbers, No GPU Required

Eric Beans, CEO, H33.ai, Inc. — April 19, 2026

Today we're publishing measured TFHE throughput numbers from a production Graviton4 deployment. Not projections. Not single-gate extrapolations. Real encrypted comparison and equality circuits, running across 96 parallel channels, sustained for 30 seconds each, on a single ARM node with no GPU.

These numbers matter because encrypted comparison is the primitive that turns FHE from a research curiosity into a deployable product. "Is the fraud score above the threshold?" "Does the encrypted credit score qualify?" "Is this transaction amount within the approved range?" Every one of those questions is a comparison on encrypted data, and every one of them requires TFHE — because BFV and CKKS, the other two FHE families, fundamentally cannot do comparisons. They can add and multiply. They cannot branch.

The Numbers

Hardware: AWS c8g.metal-48xl (Graviton4, 192 vCPUs, 371 GiB). Single node, no GPU, no cluster. CACHEE_MODE=inprocess for zero-latency cache lookup.

Operation	Bit Width	TPS (96 channels)	Per-Channel Latency
Greater-Than	8-bit	768	125 ms
Greater-Than	16-bit	372	258 ms
Greater-Than	32-bit	182	526 ms
Greater-Than	64-bit	91	1,058 ms
Equality	16-bit	769	125 ms

The fundamental gate throughput is 11,520 AND gates per second across all 96 channels, sustained. Every number in the table above is derived from this single constant divided by the circuit's AND gate count. The gate is the atom; everything else is chemistry.

What These Numbers Mean for Real Workloads

Fraud Scoring: 768 TPS

A fraud score is typically 0–255 (8-bit). "Is this transaction's encrypted fraud score above the cutoff?" is a single 8-bit comparison: 768 decisions per second on encrypted data, on one node. For a bank processing 50M transactions per day, that's ~579 TPS average — a single Graviton4 node handles it with headroom.

Credit Decisioning: 372 TPS

Credit scores range 300–850 (10-bit, but evaluated as 16-bit). "Does the applicant's encrypted score meet the threshold?" is a 16-bit comparison at 372 TPS. A large lender doing 100K applications per day needs ~1.2 TPS average. A single node handles roughly 300x that volume.

Transaction Amount Comparison: 182 TPS

"Is the encrypted transaction amount above $10,000?" expressed in cents is a 32-bit comparison. At 182 TPS, a compliance engine can evaluate every transaction from a mid-size bank's daily volume on a single node.

Encrypted Attribute Matching: 769 TPS

"Does the encrypted SSN fragment match the reference?" is a 16-bit equality test. Equality is structurally simpler than comparison — it uses an AND tree instead of a ripple chain — and runs at 769 TPS. This is the primitive for encrypted identity matching, cross-bank fraud detection, and KYC attribute verification without exposing the underlying data.

Why CPU, Not GPU

The TFHE landscape is GPU-dominated. Zama's TFHE-rs achieves sub-millisecond gate latency on NVIDIA H100 hardware. That's impressive, and we respect the engineering. But GPU deployment has costs that don't appear in the benchmark:

H100 instances cost 5–10x more per hour than equivalent CPU instances
GPU memory limits constrain the number of concurrent encrypted operations
Data transfer between CPU and GPU adds latency that doesn't show in gate benchmarks
GPU availability is constrained — try getting H100 spot capacity in us-east-1

Our approach is different: run TFHE on ARM CPUs with massive channel parallelism. Graviton4's 192 vCPUs give us 96 independent TFHE channels, each running its own bootstrap pipeline. The per-gate latency (8.3ms) is higher than GPU, but the per-node cost is lower and the deployment model is simpler. For workloads where you need hundreds of encrypted decisions per second — not millions — the CPU path is the right economic choice.

The real comparison isn't gate latency — it's cost per encrypted decision. An 8-bit encrypted comparison on Graviton4 at ~$2.30/hour costs approximately $0.00083 per decision. The same operation on an H100 at ~$30/hour would need to run at 36,000+ TPS just to match our per-decision cost. Gate speed and economic efficiency are not the same metric.

How It Works: The IQ Router

H33's FHE stack doesn't force you to choose between BFV and TFHE. The FHE-IQ router makes the decision automatically based on the operation you're performing.

The routing rule is clean:

Is the operation polynomial (add, multiply, inner product)?
  → BFV   (35 µs per auth, 2.2M auth/sec)

Is the operation non-polynomial (compare, branch, match, sort)?
  → TFHE  (125 ms per 8-bit comparison, 768 TPS)

In a typical fraud-detection pipeline, the score computation (weighted sum of features) runs on BFV at microsecond latency. The threshold decision ("is the score above the cutoff?") routes to TFHE. The handoff is invisible to the caller — the IQ router inspects the computation graph and selects the optimal engine for each node.

This is not a theoretical architecture. It's running in production with 142 routing tests covering 100 realistic scenarios across banking, healthcare, legal, cybersecurity, insurance, IoT, and governance workloads. Every non-polynomial operation routes to TFHE. Every polynomial operation stays on BFV. No manual engine selection required.

Circuit Architecture

A quick note on how encrypted comparison actually works, because the gate counts matter for understanding the numbers.

Greater-Than (Ripple Comparator)

An n-bit greater-than comparison uses a ripple comparator that propagates a "greater" flag from LSB to MSB. Each bit position requires 2 AND gates (one for "a > b at this bit" and one for "propagate previous result through equality"). Total: 2n - 1 AND gates. XOR and NOT operations are free in TFHE — they're direct LWE additions and negations with no bootstrap required.

Equality (AND Tree)

An n-bit equality test first computes per-bit XNOR (free: XOR + NOT), then reduces the n equality bits via a binary AND tree. Total: n - 1 AND gates. This is why 16-bit equality (15 AND gates) runs at the same speed as 8-bit comparison (15 AND gates) — same gate count, same throughput.

The Bootstrap Bottleneck

Every AND gate requires a programmable bootstrap — the operation that refreshes noise and evaluates the gate function. At 8.3ms per bootstrap on Graviton4, this is the fundamental throughput limiter. XOR and NOT are noise-accumulating but free; AND is the expensive reset. SHA3-256 requires 38,400 AND gates across 24 Keccak rounds, putting it firmly outside TFHE's performance envelope on any current hardware. We measured 0.30 TPS. We are not claiming SHA3-under-TFHE is production-viable. It isn't.

Honest numbers on what TFHE cannot do: Full SHA3-256 evaluation on encrypted data runs at 0.30 TPS on this hardware. That's 640x below our internal target and is a fundamental limitation of TFHE's gate-evaluation model applied to a high-gate-count hash function. We publish this number because the FHE industry has a credibility problem with unpublished limitations, and we'd rather you know what doesn't work than discover it in production.

Post-Quantum All the Way Down

Every TFHE operation in H33's stack is post-quantum secure. TFHE is lattice-based (Learning With Errors over a torus), which means the same mathematical hardness assumption that protects BFV also protects TFHE. The comparison result — the single encrypted bit that says "yes, the score is above the threshold" — can be attested via our three-family post-quantum signature bundle and committed to the H33-74 substrate.

The full pipeline for an encrypted threshold decision:

Client encrypts the value under TFHE (LWE ciphertext, ~2KB)
Server evaluates the comparison circuit (125ms for 8-bit)
Result is threshold-decrypted (the decision bit only, not the input)
Decision is attested via ML-DSA + FALCON + SLH-DSA three-family bundle
Attestation is committed to the H33-74 substrate (74 bytes, permanent)

The input value is never exposed. The threshold is never exposed. Only the yes/no decision crosses the encryption boundary, and it's immediately signed under three independent post-quantum assumptions.

Benchmark Methodology

We believe in reproducible benchmarks. Here's exactly how these numbers were produced:

Hardware: AWS c8g.metal-48xl (bare metal, no hypervisor), Graviton4 Neoverse V2, 192 vCPUs, 371 GiB RAM
Software: Rust 1.94.1, release profile with LTO, target-cpu=neoverse-v2
TFHE Parameters: LWE n=481, TRLWE N=512, multi-bit PBS group size 8, gadget 16-bit × 2 levels
Channels: 96 (one per physical core), Rayon work-stealing scheduler
Duration: 30 seconds sustained per test, with warmup phase excluded
Correctness: Verified before each benchmark run (AND(1,0)=0, AND(1,1)=1, 179>101=true, 179==101=false, 179==179=true)
Measurement: Wall-clock elapsed time, atomic counters per channel, no sampling

The benchmark source is available for audit. We do not publish numbers we cannot reproduce.

Where Each Width Fits Commercially

Every comparison width in the table maps to a specific class of real-world encrypted decision. The doubling curve (768, 372, 182, 91) is perfectly linear — each width doubling doubles the gate count and halves throughput — which means you can predict the cost of any integer-width comparison without running another benchmark. 128-bit would be approximately 45 TPS, 256-bit approximately 23 TPS.

8-bit and 16-bit (768 / 769 / 372 TPS)

The 8-bit/16-bit cluster is the deployable core. Fraud scores (0–255), credit risk bands (below 40, 40–70, above 70), eligibility flags, and categorical thresholds all fit in 8 or 16 bits. Cross-bank fraud matching — "does this encrypted account identifier match any entry in the encrypted watchlist?" — is a 16-bit equality test at 769 TPS per node. Medical eligibility checks, insurance qualification thresholds, and compliance zone transitions are all 8-bit comparisons at 768 TPS.

These are not hypothetical use cases. They are the exact operations that regulated industries need to perform on data they are not allowed to see in plaintext. A bank evaluating fraud risk on an encrypted transaction, a hospital checking encrypted insurance eligibility, an insurer comparing an encrypted claim amount to a coverage threshold — each of these is a single TFHE comparison that returns an encrypted yes/no without ever exposing the underlying value.

32-bit (182 TPS)

Transaction amounts in cents fit in 32 bits up to $21 million. "Is this encrypted wire transfer above the reporting threshold?" is a 32-bit comparison at 182 TPS — more than sufficient for any single institution's compliance pipeline. Encrypted timestamp comparisons for session-age validation, rate limiting on encrypted request counts, and full-precision credit score evaluation also land here.

64-bit (91 TPS)

Full Unix timestamps (millisecond precision), monetary amounts in micro-units (supporting sub-cent precision for high-frequency settlement), and encrypted indexed lookups on 64-bit keys. At 91 TPS, a single node handles over 7.8 million encrypted timestamp comparisons per day. For temporal access control — "has the encrypted session token expired?" — this is more than enough throughput.

What's Next

These numbers establish TFHE as a production-viable engine for encrypted decision-making on CPU infrastructure. The IQ router makes it invisible — callers submit computations, the router picks the right engine, results come back attested.

Three areas we're investing in next:

Wider comparators: 128-bit and 256-bit comparison for cryptographic-width operands
Compound predicates: "Is A > B AND C == D?" as a single fused circuit with shared bootstrap amortization
Multi-node parallelism: Distributing independent TFHE channels across a Graviton4 cluster for linear horizontal scaling

The BFV pipeline handles arithmetic at 2.2 million operations per second. The TFHE pipeline handles decisions at hundreds per second. Together, through the IQ router, they cover the full spectrum of encrypted computation — from high-throughput batch processing to precise threshold logic — on commodity ARM hardware, post-quantum secured, no GPU required.