Production-grade encrypted decision making on a single ARM Graviton4 CPU. 96-channel TFHE. 768 TPS on 8-bit greater-than. All NIST security tests passed. 20,000+ tests across the platform. No CUDA. No driver dependencies. Just Rust.
This is a measured throughput number from a production-grade TFHE implementation running on a single AWS c8g.metal-48xl instance — 96 ARM Neoverse V2 cores, no GPU, no accelerator. The workload is fully homomorphic encryption: encrypted comparison and equality operations on multi-bit values, including full programmable bootstrapping, with every result post-quantum attested. All NIST security tests passed. 20,000+ tests across the platform.
With 96 channels executing independent operations in parallel, sustained throughput reaches 768 TPS for 8-bit greater-than comparisons — measured under sustained load, not projected from single-channel latency. The 96-channel architecture scales across bit widths: 372 TPS for 16-bit, 182 TPS for 32-bit, and 91 TPS for 64-bit comparisons. Equality checks are even faster at 769 TPS for 16-bit.
Encrypted decisions are the operations where privacy meets business logic. "Is this encrypted credit score above the threshold?" "Does this encrypted biometric match?" "Is this encrypted transaction within limits?" These are comparison and threshold operations on data that never leaves encryption.
Fully homomorphic encryption makes these operations mathematically possible. The question has always been whether it can be made fast enough for production systems. The answer, as of today, is yes — on commodity CPU hardware, without specialized accelerators.
The production reality: A single Graviton4 instance delivers 768 encrypted 8-bit comparisons per second across 96 channels — measured under sustained parallel load. 16-bit comparisons at 372 TPS. 32-bit at 182 TPS. 64-bit at 91 TPS. 16-bit equality at 769 TPS. These are real-time throughputs for compliance, fraud detection, and identity verification workloads.
A single TFHE operation is inherently sequential. The programmable bootstrap — the core operation that refreshes noise and evaluates the gate function — requires hundreds of iterations of a blind rotation loop. Each iteration depends on the previous iteration's accumulator. This cannot be parallelized within a single operation.
On the Graviton4, one channel completes an encrypted comparison at its bit width's sequential depth. That is the latency floor for a single encrypted decision. No amount of hardware can make one operation faster than the sequential depth of the blind rotation.
But independent operations are embarrassingly parallel. When 96 users each need an encrypted comparison at the same time, the Graviton4 runs 96 operations simultaneously — one per channel, each completing at its bit width's native speed. Throughput scales linearly with channel count because there is no shared state between operations.
This is the fundamental scaling model for CPU-based TFHE: throughput equals channel count times per-channel operation rate. On the 96-channel Graviton4, measured sustained throughput is 768 TPS for 8-bit greater-than with linear scaling efficiency — no NUMA degradation, no cache contention, no thread migration overhead. Hyperthreading provides no benefit for this workload (192 threads is slower than 96) because the NTT butterfly is purely compute-bound with no memory latency to hide.
Real operations span multiple bit widths and operation types. The 96-channel architecture processes encrypted comparisons and equality checks across bit widths from 8 to 64 bits. Within each operation, the circuit scheduler batches independent work from multiple users across all 96 channels simultaneously.
The measured throughput across operation types on the 96-channel Graviton4:
| Operation | Bit Width | Throughput (TPS) |
|---|---|---|
| Greater-Than | 8-bit | 768 |
| Greater-Than | 16-bit | 372 |
| Greater-Than | 32-bit | 182 |
| Greater-Than | 64-bit | 91 |
| Equality | 16-bit | 769 |
The throughput scaling follows the expected pattern: doubling the bit width roughly halves the throughput for comparison operations, since the carry chain depth doubles. Equality checks are faster than comparisons at the same bit width because the AND tree has logarithmic depth versus the comparison's linear carry chain.
At 96 concurrent channels, every core is processing an encrypted operation simultaneously. The system sustains this throughput under load, confirming no thermal throttling or resource degradation.
Encrypted decisions do not exist in isolation. A production privacy pipeline combines encrypted arithmetic (BFV), encrypted approximate computation (CKKS), and encrypted Boolean decisions (TFHE). H33's IQ routing engine automatically selects the right FHE scheme for each operation:
| Operation | Engine | Throughput | Role |
|---|---|---|---|
| Inner product match | BFV | 2,209,429/sec | Biometric similarity |
| ML inference step | CKKS | Production-grade | Risk scoring |
| Threshold decision | TFHE | 768 TPS (8-bit GT) | "Score > threshold?" |
| Attestation | H33-74 | 74 bytes | 3-family PQ signatures |
A complete biometric authentication with threshold decision flows through BFV (encrypted inner product at microsecond latency), then TFHE (encrypted comparison), with every step post-quantum attested. The IQ router selects the engine, the scheduler batches the work, and the attestation pipeline commits the result — all automatically.
GPUs excel at batch parallelism: processing hundreds of identical operations simultaneously. A single A10G GPU can evaluate 1,129 TPS at batch sizes above 64. But for single-operation latency, the GPU is actually slower than the CPU due to kernel launch overhead at small batch sizes.
For the majority of production workloads, the 96-channel CPU path is both faster and more cost-effective. GPUs provide additional throughput at linear cost for high-volume batch scenarios.
The deployment decision: For production workloads, the CPU-only path delivers 768 TPS for 8-bit comparisons with no GPU drivers, no CUDA dependency, and no vendor lock-in. Just a Graviton4 running Rust. When concurrent demand exceeds CPU capacity, GPU acceleration adds throughput linearly at $1.01 per GPU-hour (1,129 TPS on A10G).
The 96-channel architecture delivers predictable throughput across operation types and bit widths:
| Operation | Bit Width | 96-Channel TPS |
|---|---|---|
| Greater-Than | 8-bit | 768 |
| Greater-Than | 16-bit | 372 |
| Greater-Than | 32-bit | 182 |
| Greater-Than | 64-bit | 91 |
| Equality | 16-bit | 769 |
Equality tests are faster than comparisons at the same bit width because the AND tree has logarithmic depth versus the comparison's linear carry chain. The circuit scheduler exploits this: equality gates at the same tree level are independent and batch across channels.
Every gate evaluation produces a post-quantum attestation. The computation result, the routing decision that selected TFHE as the engine, and the authorization metadata that permitted the computation are committed to a 74-byte H33-74 primitive signed under three independent signature families: ML-DSA-65 (lattice-based), FALCON-512 (NTRU-based), and SLH-DSA-SHA2-128f (hash-based).
The attestation cost is negligible relative to the TFHE operation: signing and commitment construction add less than 1 millisecond per operation. The attestation is engine-agnostic — the same 74-byte format attests BFV arithmetic, CKKS inference, and TFHE decisions with identical verification cost.
The production configuration is deliberately simple:
At sustained throughput, the system delivers 768 encrypted 8-bit comparisons per second on commodity cloud hardware with no specialized accelerators. All NIST security tests passed. 20,000+ tests across the platform.
The system already incorporates a fused rotate-diff-decompose optimization that eliminates per-iteration memory allocations in the bootstrapping hot loop, reducing per-gate latency by 5% over the unoptimized path. The NTT (Number Theoretic Transform) butterfly is the performance floor — it accounts for the majority of per-iteration compute and is already compiled with ARM Neoverse V2 target optimizations.
For workloads exceeding the CPU's concurrent capacity, GPU acceleration provides linear throughput scaling at $1.01 per GPU-hour. The IQ router automatically selects CPU for latency-sensitive single requests and GPU for high-volume batch workloads. No application code changes required.
Encrypted decision making is no longer a research prototype. It runs on commodity ARM CPUs at production throughput — 768 TPS for 8-bit comparisons, scaling across bit widths to 91 TPS for 64-bit — with post-quantum attestation and full NIST security test coverage. The infrastructure exists. The numbers are real. The decisions are encrypted.