April 23, 2026 · Engineering

768 Encrypted Comparisons Per Second. No GPU.

Production-grade encrypted decision making on a single ARM Graviton4 CPU. 96-channel TFHE. 768 TPS on 8-bit greater-than. All NIST security tests passed. 20,000+ tests across the platform. No CUDA. No driver dependencies. Just Rust.

The Numbers

768 TPS

8-bit Greater-Than · 96-channel ARM Graviton4 · No GPU · Measured

This is a measured throughput number from a production-grade TFHE implementation running on a single AWS c8g.metal-48xl instance — 96 ARM Neoverse V2 cores, no GPU, no accelerator. The workload is fully homomorphic encryption: encrypted comparison and equality operations on multi-bit values, including full programmable bootstrapping, with every result post-quantum attested. All NIST security tests passed. 20,000+ tests across the platform.

With 96 channels executing independent operations in parallel, sustained throughput reaches 768 TPS for 8-bit greater-than comparisons — measured under sustained load, not projected from single-channel latency. The 96-channel architecture scales across bit widths: 372 TPS for 16-bit, 182 TPS for 32-bit, and 91 TPS for 64-bit comparisons. Equality checks are even faster at 769 TPS for 16-bit.

Why This Matters

Encrypted decisions are the operations where privacy meets business logic. "Is this encrypted credit score above the threshold?" "Does this encrypted biometric match?" "Is this encrypted transaction within limits?" These are comparison and threshold operations on data that never leaves encryption.

Fully homomorphic encryption makes these operations mathematically possible. The question has always been whether it can be made fast enough for production systems. The answer, as of today, is yes — on commodity CPU hardware, without specialized accelerators.

The production reality: A single Graviton4 instance delivers 768 encrypted 8-bit comparisons per second across 96 channels — measured under sustained parallel load. 16-bit comparisons at 372 TPS. 32-bit at 182 TPS. 64-bit at 91 TPS. 16-bit equality at 769 TPS. These are real-time throughputs for compliance, fraud detection, and identity verification workloads.

The Architecture

One Gate at a Time

A single TFHE operation is inherently sequential. The programmable bootstrap — the core operation that refreshes noise and evaluates the gate function — requires hundreds of iterations of a blind rotation loop. Each iteration depends on the previous iteration's accumulator. This cannot be parallelized within a single operation.

On the Graviton4, one channel completes an encrypted comparison at its bit width's sequential depth. That is the latency floor for a single encrypted decision. No amount of hardware can make one operation faster than the sequential depth of the blind rotation.

Many Gates at Once

But independent operations are embarrassingly parallel. When 96 users each need an encrypted comparison at the same time, the Graviton4 runs 96 operations simultaneously — one per channel, each completing at its bit width's native speed. Throughput scales linearly with channel count because there is no shared state between operations.

This is the fundamental scaling model for CPU-based TFHE: throughput equals channel count times per-channel operation rate. On the 96-channel Graviton4, measured sustained throughput is 768 TPS for 8-bit greater-than with linear scaling efficiency — no NUMA degradation, no cache contention, no thread migration overhead. Hyperthreading provides no benefit for this workload (192 threads is slower than 96) because the NTT butterfly is purely compute-bound with no memory latency to hide.

Circuit-Batched Throughput

Real operations span multiple bit widths and operation types. The 96-channel architecture processes encrypted comparisons and equality checks across bit widths from 8 to 64 bits. Within each operation, the circuit scheduler batches independent work from multiple users across all 96 channels simultaneously.

The measured throughput across operation types on the 96-channel Graviton4:

Operation	Bit Width	Throughput (TPS)
Greater-Than	8-bit	768
Greater-Than	16-bit	372
Greater-Than	32-bit	182
Greater-Than	64-bit	91
Equality	16-bit	769

The throughput scaling follows the expected pattern: doubling the bit width roughly halves the throughput for comparison operations, since the carry chain depth doubles. Equality checks are faster than comparisons at the same bit width because the AND tree has logarithmic depth versus the comparison's linear carry chain.

At 96 concurrent channels, every core is processing an encrypted operation simultaneously. The system sustains this throughput under load, confirming no thermal throttling or resource degradation.

The Complete FHE Pipeline

Encrypted decisions do not exist in isolation. A production privacy pipeline combines encrypted arithmetic (BFV), encrypted approximate computation (CKKS), and encrypted Boolean decisions (TFHE). H33's IQ routing engine automatically selects the right FHE scheme for each operation:

Operation	Engine	Throughput	Role
Inner product match	BFV	2,209,429/sec	Biometric similarity
ML inference step	CKKS	Production-grade	Risk scoring
Threshold decision	TFHE	768 TPS (8-bit GT)	"Score > threshold?"
Attestation	H33-74	74 bytes	3-family PQ signatures

A complete biometric authentication with threshold decision flows through BFV (encrypted inner product at microsecond latency), then TFHE (encrypted comparison), with every step post-quantum attested. The IQ router selects the engine, the scheduler batches the work, and the attestation pipeline commits the result — all automatically.

Why CPU, Not GPU

GPUs excel at batch parallelism: processing hundreds of identical operations simultaneously. A single A10G GPU can evaluate 1,129 TPS at batch sizes above 64. But for single-operation latency, the GPU is actually slower than the CPU due to kernel launch overhead at small batch sizes.

For the majority of production workloads, the 96-channel CPU path is both faster and more cost-effective. GPUs provide additional throughput at linear cost for high-volume batch scenarios.

The deployment decision: For production workloads, the CPU-only path delivers 768 TPS for 8-bit comparisons with no GPU drivers, no CUDA dependency, and no vendor lock-in. Just a Graviton4 running Rust. When concurrent demand exceeds CPU capacity, GPU acceleration adds throughput linearly at $1.01 per GPU-hour (1,129 TPS on A10G).

Encrypted Operation Costs

The 96-channel architecture delivers predictable throughput across operation types and bit widths:

Operation	Bit Width	96-Channel TPS
Greater-Than	8-bit	768
Greater-Than	16-bit	372
Greater-Than	32-bit	182
Greater-Than	64-bit	91
Equality	16-bit	769

Equality tests are faster than comparisons at the same bit width because the AND tree has logarithmic depth versus the comparison's linear carry chain. The circuit scheduler exploits this: equality gates at the same tree level are independent and batch across channels.

Post-Quantum Attestation

Every gate evaluation produces a post-quantum attestation. The computation result, the routing decision that selected TFHE as the engine, and the authorization metadata that permitted the computation are committed to a 74-byte H33-74 primitive signed under three independent signature families: ML-DSA-65 (lattice-based), FALCON-512 (NTRU-based), and SLH-DSA-SHA2-128f (hash-based).

The attestation cost is negligible relative to the TFHE operation: signing and commitment construction add less than 1 millisecond per operation. The attestation is engine-agnostic — the same 74-byte format attests BFV arithmetic, CKKS inference, and TFHE decisions with identical verification cost.

Production Deployment

The production configuration is deliberately simple:

Hardware: AWS c8g.metal-48xl (96-core Graviton4, 377 GiB RAM)
Software: Rust binary, system allocator, no jemalloc, no CUDA
Worker model: 96 independent workers, one per core, pre-loaded key material
Scheduler: Warm-residency-aware batching with priority classes
Cost: $2.30/hour on-demand ($1.38/hour reserved)

At sustained throughput, the system delivers 768 encrypted 8-bit comparisons per second on commodity cloud hardware with no specialized accelerators. All NIST security tests passed. 20,000+ tests across the platform.

What This Does Not Claim

These are encrypted comparison and equality throughput numbers, not arbitrary FHE computation throughput. Each operation includes full programmable bootstrapping.
The 768 TPS figure is measured under sustained 96-channel parallel load on Graviton4, not projected from single-channel latency.
Throughput scales predictably with bit width: 768 TPS (8-bit GT), 372 TPS (16-bit GT), 182 TPS (32-bit GT), 91 TPS (64-bit GT), 769 TPS (16-bit EQ).
This is not a comparison to other FHE libraries. Different libraries use different parameter sets, security levels, and hardware.
All NIST security tests passed. 20,000+ tests across the platform.

The Path Forward

The system already incorporates a fused rotate-diff-decompose optimization that eliminates per-iteration memory allocations in the bootstrapping hot loop, reducing per-gate latency by 5% over the unoptimized path. The NTT (Number Theoretic Transform) butterfly is the performance floor — it accounts for the majority of per-iteration compute and is already compiled with ARM Neoverse V2 target optimizations.

For workloads exceeding the CPU's concurrent capacity, GPU acceleration provides linear throughput scaling at $1.01 per GPU-hour. The IQ router automatically selects CPU for latency-sensitive single requests and GPU for high-volume batch workloads. No application code changes required.

Encrypted decision making is no longer a research prototype. It runs on commodity ARM CPUs at production throughput — 768 TPS for 8-bit comparisons, scaling across bit widths to 91 TPS for 64-bit — with post-quantum attestation and full NIST security test coverage. The infrastructure exists. The numbers are real. The decisions are encrypted.

Eric Beans

CEO, H33.ai, Inc.

Patent pending. U.S. Patent Application Nos. 19/309,560 and 19/645,499. Additional applications pending.
H33-74 is a trademark of H33.ai, Inc. AWS and Graviton4 are trademarks of Amazon Web Services, Inc.