← Blog
April 25, 2026 · Engineering

Encrypted Workloads, Not Primitives: Why CKKS Benchmarks Are Measuring the Wrong Thing

The FHE industry has a measurement problem. Most CKKS benchmarks report isolated multiply latency — one step of a three-step pipeline. Production code never executes an isolated multiply. Here's what happens when you measure what actually matters: full encrypted workloads on production cloud hardware.


The Measurement Gap

Search for "CKKS benchmark" and you will find numbers like these: 0.03ms multiply. 0.5ms multiply. 3ms multiply. The numbers vary by an order of magnitude, but they all measure the same thing: an isolated tensor product of two encrypted polynomials. The multiply step. One step of three.

In every real CKKS computation, multiply is followed by relinearization (reducing the ciphertext degree from 3 back to 2 via key-switching) and rescale (dropping one modulus to restore the working scale). These are not optional. They are not post-processing. They are the computation. Without relinearization and rescale, the ciphertext cannot be used in the next operation.

Measuring CKKS multiply without relinearization and rescale is like measuring a car's engine RPM and calling it the 0–60 time.

61ms
Full CKKS multiply pipeline (multiply + relinearization + rescale) · Graviton4 c8g.metal-48xl · N=8192 · 128-bit security

That is not a competitive number against isolated multiply benchmarks. It is a different measurement. A more honest one. And it is the number that tells you what encrypted ML inference actually costs in production.

What a Full Multiply Pipeline Contains

When a CKKS implementation executes a ciphertext multiplication, three operations execute in sequence:

  1. Tensor product. Two degree-1 ciphertexts (each containing 2 polynomials) are multiplied to produce a degree-2 ciphertext (3 polynomials). This involves NTT transforms, pointwise multiplication, and inverse NTT. This is the step that isolated benchmarks measure.
  2. Relinearization. The degree-2 ciphertext is reduced back to degree-1 using evaluation keys generated during setup. This requires extending the polynomial basis from Q to Q∪P (a separate set of special primes), performing a key-switch operation in the extended basis, and modding down by P to restore CRT consistency. This step dominates the latency.
  3. Rescale. One modulus is dropped from the ciphertext's modulus chain, dividing the ciphertext by the scale factor and restoring the working precision for the next operation.

In our measurements on Graviton4, the breakdown is approximately:

StepApprox. Time% of Pipeline
Tensor product~5ms~8%
Relinearization (key-switch)~50ms~82%
Rescale~6ms~10%
Total pipeline61ms100%

When someone reports "0.5ms CKKS multiply," they are reporting the 8% step. The other 92% — the part that makes the ciphertext usable for the next operation — is excluded from the number.

This is not a criticism of other libraries. SEAL, OpenFHE, and Lattigo are excellent implementations. The issue is that the industry convention for "CKKS multiply benchmark" measures one step of a pipeline, and the number gets cited as though it represents the full cost. It does not.

Why Workloads, Not Primitives

The deeper problem is that even a full multiply pipeline is not a useful benchmark for production systems. No application performs a single encrypted multiplication and returns the result. Production encrypted compute involves sequences of operations: multiplications, additions, rotations, accumulations, and reductions. These sequences form workloads, and the cost of a workload is not the sum of its isolated operations.

A 64-dimensional encrypted dot product requires:

The total wall-clock time for this workload is 333ms on Graviton4. Not 64 × 61ms = 3.9 seconds. The workload is faster than the sum of its parts because operations share NTT context, ciphertexts persist in NTT domain between operations, and the key-switch engine amortizes basis extension setup.

This is why workload benchmarks matter: they capture real-system behavior that primitive benchmarks miss entirely.

The Five Workloads We Measure

Every H33-CKKS benchmark publishes five workload measurements alongside primitive timings. These are the operations that production encrypted compute actually executes.

WorkloadLatencyCorrectnessDescription
Multiply pipeline61ms<1e-5Full mul + relin + rescale
Slot sum (64 slots)293ms1.24e-5Reduce 64 values to one sum via rotation tree
Dot product (64-dim)333ms1.06e-5Encrypted vector inner product
Polynomial eval (x²)133ms1.17e-7Encrypted polynomial activation function
Dense layer (64→4)1,555ms1.1e-5Complete encrypted neural network layer

Every number includes every step that production code must execute. Every number specifies the hardware. Every number includes a correctness measurement — the maximum error relative to plaintext computation. If the number cannot pass correctness verification, it does not get published.

Encrypted Dot Product: 333ms

The dot product is the core primitive of encrypted ML inference. Two encrypted 64-dimensional vectors are multiplied element-wise, then the products are reduced to a single sum via a rotation tree.

At 333ms per dot product, a system can evaluate 3 encrypted dot products per second on a single core. On 96 Graviton4 cores: 288 encrypted dot products per second. Each operates on 4,096 SIMD slots simultaneously, so the effective per-element throughput is 288 × 4,096 = 1.18 million encrypted element-operations per second.

This is what encrypted ML inference actually costs. Not the 0.5ms that an isolated multiply benchmark implies.

Encrypted Dense Layer: 1.56 seconds

A dense (fully connected) neural network layer with 64 inputs and 4 outputs requires 4 independent dot products plus accumulation. The measured wall-clock time is 1,555ms — slightly less than 4 × 333ms because output accumulations overlap with the final rotation steps.

A two-layer encrypted neural network (64→4→1) with polynomial activations between layers can execute in approximately 2 seconds on production cloud hardware. This is the real cost of encrypted ML inference — and it is a cost that no isolated primitive benchmark will ever reveal.

Encrypted Slot Sum: 293ms

Reducing 64 encrypted values to a single sum via a binary rotation tree. This is the building block for encrypted mean, variance, and any aggregation operation. Six rotation steps, each involving a Galois automorphism and key-switch. The rotation tree is bounded to 64 active slots rather than the full 4,096 to avoid accumulating noise from unused positions.

Encrypted Polynomial Evaluation: 133ms

Evaluating x² on encrypted data: one multiply pipeline. This is the activation function primitive for encrypted neural networks. Chebyshev polynomial approximation of ReLU, sigmoid, or tanh reduces to a sequence of these evaluations. A degree-4 approximation costs approximately 4 × 133ms = 532ms.

The Architecture That Makes This Possible

The workload numbers above are not from a research prototype. They come from an RNS-native CKKS implementation with specific architectural decisions that matter for production performance:

RNS-native representation. Every polynomial coefficient is stored as a vector of residues modulo multiple 60-bit primes. No BigInt anywhere in the evaluation path. No multi-precision arithmetic. No conversion overhead between polynomial domains. This is the same approach used by SEAL and OpenFHE, implemented from scratch in Rust.

Special-prime key-switching. Relinearization uses the extend-Q-to-QP approach: the ciphertext basis is temporarily extended with special primes P, the key-switch is performed in the extended basis, and then the result is modded down by P to restore CRT consistency. This gives O(1) noise growth per key-switch regardless of the number of moduli in the chain.

Montgomery NTT throughout. All polynomial arithmetic operates in Montgomery form. No modular division in the hot path. Harvey lazy reduction keeps intermediate butterfly values in [0, 2q) between NTT stages, eliminating a reduction per butterfly. Radix-4 transforms minimize memory bandwidth.

NTT-form persistence. Ciphertexts remain in NTT domain between operations. Back-to-back multiply-relin-rescale sequences avoid redundant forward transforms. This reduces NTTs per tensor product from 16 to 9.

CRT-consistent sampling. The RLWE key material is sampled as integers once, then reduced to each modulus. Independent per-modulus sampling breaks global CRT coherence and causes catastrophic noise amplification during mod-down. This is a subtle correctness requirement that many implementations get wrong in initial development.

Rayon-parallel moduli. Each RNS modulus is processed independently via Rayon work-stealing. On 96 physical Graviton4 cores, all moduli process in parallel with verified linear scaling.

Where This Fits in the Larger Stack

CKKS is one of four FHE engines in the H33 platform. The FHE-IQ routing engine selects the appropriate engine for every operation:

EngineBest ForHeadline
H33-128 (BFV)Exact integer arithmetic, biometric auth2,209,429 auth/sec
H33-CKKSReal-number ML, statistics, scoring1,574 TPS multiply
H33-TFHEComparisons, thresholds, decisions768 TPS (8-bit GT)

A complete encrypted ML inference pipeline flows through multiple engines: CKKS for the forward pass, BFV for quantization, TFHE for the threshold decision. The router manages transitions automatically. The developer submits a workload; the system handles engine selection, scheme transitions, and post-quantum attestation.

Every computation — regardless of engine — is attested via the H33-74 substrate: 74 bytes containing a three-family post-quantum signature (ML-DSA + FALCON + SLH-DSA) that commits the result, the routing decision, and the authorization context. Permanently.

What We Publish That Others Do Not

This is not a claim of "fastest CKKS." Isolated multiply benchmarks from SEAL and OpenFHE are faster — because they measure a different (smaller) thing. This is a claim about measurement methodology and the questions we believe matter:

Most of the FHE industry asks: "How fast is multiply?"

We ask: "How fast is an encrypted ML workload end-to-end?"

That is a category shift, not a benchmark competition.

The Numbers, All of Them

Complete H33-CKKS throughput breakdown on Graviton4 c8g.metal-48xl (192 vCPUs, 96 physical cores):

OperationLatencyPer-Core TPS96-Core TPS
Multiply pipeline61ms16.41,574
Add0.68ms1,471141,216
Slot sum (64)293ms3.4327
Dot product (64-dim)333ms3.0288
Polynomial eval (x²)133ms7.5720
Dense layer (64→4)1,555ms0.6461

Every number measured. Every number verified for correctness. Every result post-quantum attested. All NIST security tests passed across every cryptographic library: FIPS 203 (ML-KEM/Kyber), FIPS 204 (ML-DSA/Dilithium), FIPS 205 (SLH-DSA/SPHINCS+). 20,000+ tests across the full platform.

One of the few platforms measuring full encrypted workloads — not just primitives. If you are evaluating FHE platforms for production deployment, ask every vendor the same question: "What is the wall-clock time for a 64-dimensional encrypted dot product on your production cloud hardware?" If they can answer with a measured number, you have a real comparison. If they can only cite isolated multiply latency, you do not.


What Comes Next

Workload benchmarks are the starting point, not the end. The next phase of H33-CKKS development focuses on the operation planner: a lazy evaluation engine that fuses multiplication, relinearization, and rescale across computation graphs. When three multiplies feed into an addition, the planner defers relinearization until after accumulation — eliminating redundant key-switches.

The batch executor already provides modulus-first scheduling for independent operations. The operation planner extends this to dependent operations: given a computation DAG, it determines the minimum number of key-switches required and schedules them optimally.

The goal is not faster isolated primitives. The goal is faster encrypted workloads. That is what matters. That is what we measure.


Eric Beans
CEO, H33.ai, Inc.
Patent pending. U.S. Patent Application Nos. 19/309,560 and 19/645,499. Additional applications pending.
All benchmarks measured on AWS c8g.metal-48xl (Graviton4, 192 vCPUs, Neoverse V2), April 2026. Rust 1.94.0.
All NIST security tests passed: FIPS 203 (ML-KEM), FIPS 204 (ML-DSA), FIPS 205 (SLH-DSA). FIPS 140-3 KATs operational. 20,000+ tests across the platform.
H33-74 is a trademark of H33.ai, Inc. AWS and Graviton4 are trademarks of Amazon Web Services, Inc.
SEAL is a trademark of Microsoft Corporation. OpenFHE is developed by Duality Technologies.