The FHE industry has a measurement problem. Most CKKS benchmarks report isolated multiply latency — one step of a three-step pipeline. Production code never executes an isolated multiply. Here's what happens when you measure what actually matters: full encrypted workloads on production cloud hardware.
Search for "CKKS benchmark" and you will find numbers like these: 0.03ms multiply. 0.5ms multiply. 3ms multiply. The numbers vary by an order of magnitude, but they all measure the same thing: an isolated tensor product of two encrypted polynomials. The multiply step. One step of three.
In every real CKKS computation, multiply is followed by relinearization (reducing the ciphertext degree from 3 back to 2 via key-switching) and rescale (dropping one modulus to restore the working scale). These are not optional. They are not post-processing. They are the computation. Without relinearization and rescale, the ciphertext cannot be used in the next operation.
Measuring CKKS multiply without relinearization and rescale is like measuring a car's engine RPM and calling it the 0–60 time.
That is not a competitive number against isolated multiply benchmarks. It is a different measurement. A more honest one. And it is the number that tells you what encrypted ML inference actually costs in production.
When a CKKS implementation executes a ciphertext multiplication, three operations execute in sequence:
In our measurements on Graviton4, the breakdown is approximately:
| Step | Approx. Time | % of Pipeline |
|---|---|---|
| Tensor product | ~5ms | ~8% |
| Relinearization (key-switch) | ~50ms | ~82% |
| Rescale | ~6ms | ~10% |
| Total pipeline | 61ms | 100% |
When someone reports "0.5ms CKKS multiply," they are reporting the 8% step. The other 92% — the part that makes the ciphertext usable for the next operation — is excluded from the number.
This is not a criticism of other libraries. SEAL, OpenFHE, and Lattigo are excellent implementations. The issue is that the industry convention for "CKKS multiply benchmark" measures one step of a pipeline, and the number gets cited as though it represents the full cost. It does not.
The deeper problem is that even a full multiply pipeline is not a useful benchmark for production systems. No application performs a single encrypted multiplication and returns the result. Production encrypted compute involves sequences of operations: multiplications, additions, rotations, accumulations, and reductions. These sequences form workloads, and the cost of a workload is not the sum of its isolated operations.
A 64-dimensional encrypted dot product requires:
The total wall-clock time for this workload is 333ms on Graviton4. Not 64 × 61ms = 3.9 seconds. The workload is faster than the sum of its parts because operations share NTT context, ciphertexts persist in NTT domain between operations, and the key-switch engine amortizes basis extension setup.
This is why workload benchmarks matter: they capture real-system behavior that primitive benchmarks miss entirely.
Every H33-CKKS benchmark publishes five workload measurements alongside primitive timings. These are the operations that production encrypted compute actually executes.
| Workload | Latency | Correctness | Description |
|---|---|---|---|
| Multiply pipeline | 61ms | <1e-5 | Full mul + relin + rescale |
| Slot sum (64 slots) | 293ms | 1.24e-5 | Reduce 64 values to one sum via rotation tree |
| Dot product (64-dim) | 333ms | 1.06e-5 | Encrypted vector inner product |
| Polynomial eval (x²) | 133ms | 1.17e-7 | Encrypted polynomial activation function |
| Dense layer (64→4) | 1,555ms | 1.1e-5 | Complete encrypted neural network layer |
Every number includes every step that production code must execute. Every number specifies the hardware. Every number includes a correctness measurement — the maximum error relative to plaintext computation. If the number cannot pass correctness verification, it does not get published.
The dot product is the core primitive of encrypted ML inference. Two encrypted 64-dimensional vectors are multiplied element-wise, then the products are reduced to a single sum via a rotation tree.
At 333ms per dot product, a system can evaluate 3 encrypted dot products per second on a single core. On 96 Graviton4 cores: 288 encrypted dot products per second. Each operates on 4,096 SIMD slots simultaneously, so the effective per-element throughput is 288 × 4,096 = 1.18 million encrypted element-operations per second.
This is what encrypted ML inference actually costs. Not the 0.5ms that an isolated multiply benchmark implies.
A dense (fully connected) neural network layer with 64 inputs and 4 outputs requires 4 independent dot products plus accumulation. The measured wall-clock time is 1,555ms — slightly less than 4 × 333ms because output accumulations overlap with the final rotation steps.
A two-layer encrypted neural network (64→4→1) with polynomial activations between layers can execute in approximately 2 seconds on production cloud hardware. This is the real cost of encrypted ML inference — and it is a cost that no isolated primitive benchmark will ever reveal.
Reducing 64 encrypted values to a single sum via a binary rotation tree. This is the building block for encrypted mean, variance, and any aggregation operation. Six rotation steps, each involving a Galois automorphism and key-switch. The rotation tree is bounded to 64 active slots rather than the full 4,096 to avoid accumulating noise from unused positions.
Evaluating x² on encrypted data: one multiply pipeline. This is the activation function primitive for encrypted neural networks. Chebyshev polynomial approximation of ReLU, sigmoid, or tanh reduces to a sequence of these evaluations. A degree-4 approximation costs approximately 4 × 133ms = 532ms.
The workload numbers above are not from a research prototype. They come from an RNS-native CKKS implementation with specific architectural decisions that matter for production performance:
RNS-native representation. Every polynomial coefficient is stored as a vector of residues modulo multiple 60-bit primes. No BigInt anywhere in the evaluation path. No multi-precision arithmetic. No conversion overhead between polynomial domains. This is the same approach used by SEAL and OpenFHE, implemented from scratch in Rust.
Special-prime key-switching. Relinearization uses the extend-Q-to-QP approach: the ciphertext basis is temporarily extended with special primes P, the key-switch is performed in the extended basis, and then the result is modded down by P to restore CRT consistency. This gives O(1) noise growth per key-switch regardless of the number of moduli in the chain.
Montgomery NTT throughout. All polynomial arithmetic operates in Montgomery form. No modular division in the hot path. Harvey lazy reduction keeps intermediate butterfly values in [0, 2q) between NTT stages, eliminating a reduction per butterfly. Radix-4 transforms minimize memory bandwidth.
NTT-form persistence. Ciphertexts remain in NTT domain between operations. Back-to-back multiply-relin-rescale sequences avoid redundant forward transforms. This reduces NTTs per tensor product from 16 to 9.
CRT-consistent sampling. The RLWE key material is sampled as integers once, then reduced to each modulus. Independent per-modulus sampling breaks global CRT coherence and causes catastrophic noise amplification during mod-down. This is a subtle correctness requirement that many implementations get wrong in initial development.
Rayon-parallel moduli. Each RNS modulus is processed independently via Rayon work-stealing. On 96 physical Graviton4 cores, all moduli process in parallel with verified linear scaling.
CKKS is one of four FHE engines in the H33 platform. The FHE-IQ routing engine selects the appropriate engine for every operation:
| Engine | Best For | Headline |
|---|---|---|
| H33-128 (BFV) | Exact integer arithmetic, biometric auth | 2,209,429 auth/sec |
| H33-CKKS | Real-number ML, statistics, scoring | 1,574 TPS multiply |
| H33-TFHE | Comparisons, thresholds, decisions | 768 TPS (8-bit GT) |
A complete encrypted ML inference pipeline flows through multiple engines: CKKS for the forward pass, BFV for quantization, TFHE for the threshold decision. The router manages transitions automatically. The developer submits a workload; the system handles engine selection, scheme transitions, and post-quantum attestation.
Every computation — regardless of engine — is attested via the H33-74 substrate: 74 bytes containing a three-family post-quantum signature (ML-DSA + FALCON + SLH-DSA) that commits the result, the routing decision, and the authorization context. Permanently.
This is not a claim of "fastest CKKS." Isolated multiply benchmarks from SEAL and OpenFHE are faster — because they measure a different (smaller) thing. This is a claim about measurement methodology and the questions we believe matter:
Most of the FHE industry asks: "How fast is multiply?"
We ask: "How fast is an encrypted ML workload end-to-end?"
That is a category shift, not a benchmark competition.
Complete H33-CKKS throughput breakdown on Graviton4 c8g.metal-48xl (192 vCPUs, 96 physical cores):
| Operation | Latency | Per-Core TPS | 96-Core TPS |
|---|---|---|---|
| Multiply pipeline | 61ms | 16.4 | 1,574 |
| Add | 0.68ms | 1,471 | 141,216 |
| Slot sum (64) | 293ms | 3.4 | 327 |
| Dot product (64-dim) | 333ms | 3.0 | 288 |
| Polynomial eval (x²) | 133ms | 7.5 | 720 |
| Dense layer (64→4) | 1,555ms | 0.64 | 61 |
Every number measured. Every number verified for correctness. Every result post-quantum attested. All NIST security tests passed across every cryptographic library: FIPS 203 (ML-KEM/Kyber), FIPS 204 (ML-DSA/Dilithium), FIPS 205 (SLH-DSA/SPHINCS+). 20,000+ tests across the full platform.
One of the few platforms measuring full encrypted workloads — not just primitives. If you are evaluating FHE platforms for production deployment, ask every vendor the same question: "What is the wall-clock time for a 64-dimensional encrypted dot product on your production cloud hardware?" If they can answer with a measured number, you have a real comparison. If they can only cite isolated multiply latency, you do not.
Workload benchmarks are the starting point, not the end. The next phase of H33-CKKS development focuses on the operation planner: a lazy evaluation engine that fuses multiplication, relinearization, and rescale across computation graphs. When three multiplies feed into an addition, the planner defers relinearization until after accumulation — eliminating redundant key-switches.
The batch executor already provides modulus-first scheduling for independent operations. The operation planner extends this to dependent operations: given a computation DAG, it determines the minimum number of key-switches required and schedules them optimally.
The goal is not faster isolated primitives. The goal is faster encrypted workloads. That is what matters. That is what we measure.