What is CKKS homomorphic encryption?

CKKS (Cheon-Kim-Kim-Song) is a fully homomorphic encryption scheme designed for approximate arithmetic on real numbers. It allows encrypted addition, multiplication, and rotation operations on floating-point data without decryption. CKKS is the standard choice for encrypted machine learning inference, statistical aggregation, and any workload involving continuous values.

How fast is CKKS in production?

H33's RNS-native CKKS achieves 61ms for a full multiply pipeline (multiply + relinearization + rescale), 333ms for a 64-dimensional encrypted dot product, and 1.56 seconds for a 64-input, 4-output encrypted dense neural network layer. All measurements on AWS Graviton4 c8g.metal-48xl.

What is the difference between CKKS primitive benchmarks and workload benchmarks?

Primitive benchmarks measure isolated operations like a single multiply (0.03-5ms in most libraries). Workload benchmarks measure the full computation pipeline including relinearization, rescale, rotations, and accumulations required for real computations like dot products and neural network layers. Production code always executes the full pipeline, so workload benchmarks reflect actual deployment costs.

How does H33 CKKS compare to Microsoft SEAL and OpenFHE?

SEAL and OpenFHE report 0.03-5ms for isolated multiply operations. H33 reports 61ms for the full multiply pipeline (multiply + relinearization + rescale) and publishes workload-level benchmarks (dot product, dense layer, slot sum) that SEAL and OpenFHE do not publish. The measurements are not directly comparable because they measure different things. H33 is one of the few platforms measuring full encrypted workloads, not just isolated primitives.

← Blog

April 25, 2026 · Engineering

Encrypted Workloads, Not Primitives: Why CKKS Benchmarks Are Measuring the Wrong Thing

The FHE industry has a measurement problem. Most CKKS benchmarks report isolated multiply latency — one step of a three-step pipeline. Production code never executes an isolated multiply. Here's what happens when you measure what actually matters: full encrypted workloads on production cloud hardware.

The Measurement Gap

Search for "CKKS benchmark" and you will find numbers like these: 0.03ms multiply. 0.5ms multiply. 3ms multiply. The numbers vary by an order of magnitude, but they all measure the same thing: an isolated tensor product of two encrypted polynomials. The multiply step. One step of three.

In every real CKKS computation, multiply is followed by relinearization (reducing the ciphertext degree from 3 back to 2 via key-switching) and rescale (dropping one modulus to restore the working scale). These are not optional. They are not post-processing. They are the computation. Without relinearization and rescale, the ciphertext cannot be used in the next operation.

Measuring CKKS multiply without relinearization and rescale is like measuring a car's engine RPM and calling it the 0–60 time.

61ms

Full CKKS multiply pipeline (multiply + relinearization + rescale) · Graviton4 c8g.metal-48xl · N=8192 · 128-bit security

That is not a competitive number against isolated multiply benchmarks. It is a different measurement. A more honest one. And it is the number that tells you what encrypted ML inference actually costs in production.

What a Full Multiply Pipeline Contains

When a CKKS implementation executes a ciphertext multiplication, three operations execute in sequence:

Tensor product. Two degree-1 ciphertexts (each containing 2 polynomials) are multiplied to produce a degree-2 ciphertext (3 polynomials). This involves NTT transforms, pointwise multiplication, and inverse NTT. This is the step that isolated benchmarks measure.
Relinearization. The degree-2 ciphertext is reduced back to degree-1 using evaluation keys generated during setup. This requires extending the polynomial basis from Q to Q∪P (a separate set of special primes), performing a key-switch operation in the extended basis, and modding down by P to restore CRT consistency. This step dominates the latency.
Rescale. One modulus is dropped from the ciphertext's modulus chain, dividing the ciphertext by the scale factor and restoring the working precision for the next operation.

In our measurements on Graviton4, the breakdown is approximately:

Step	Approx. Time	% of Pipeline
Tensor product	~5ms	~8%
Relinearization (key-switch)	~50ms	~82%
Rescale	~6ms	~10%
Total pipeline	61ms	100%

When someone reports "0.5ms CKKS multiply," they are reporting the 8% step. The other 92% — the part that makes the ciphertext usable for the next operation — is excluded from the number.

This is not a criticism of other libraries. SEAL, OpenFHE, and Lattigo are excellent implementations. The issue is that the industry convention for "CKKS multiply benchmark" measures one step of a pipeline, and the number gets cited as though it represents the full cost. It does not.

Why Workloads, Not Primitives

The deeper problem is that even a full multiply pipeline is not a useful benchmark for production systems. No application performs a single encrypted multiplication and returns the result. Production encrypted compute involves sequences of operations: multiplications, additions, rotations, accumulations, and reductions. These sequences form workloads, and the cost of a workload is not the sum of its isolated operations.

A 64-dimensional encrypted dot product requires:

64 encrypted multiply-accumulate operations
6 encrypted rotations (log₂(64) rotation steps for slot sum reduction)
63 encrypted additions (accumulation after each multiply)
6 additional encrypted additions (during reduction)

The total wall-clock time for this workload is 333ms on Graviton4. Not 64 × 61ms = 3.9 seconds. The workload is faster than the sum of its parts because operations share NTT context, ciphertexts persist in NTT domain between operations, and the key-switch engine amortizes basis extension setup.

This is why workload benchmarks matter: they capture real-system behavior that primitive benchmarks miss entirely.

The Five Workloads We Measure

Every H33-CKKS benchmark publishes five workload measurements alongside primitive timings. These are the operations that production encrypted compute actually executes.

Workload	Latency	Correctness	Description
Multiply pipeline	61ms	<1e-5	Full mul + relin + rescale
Slot sum (64 slots)	293ms	1.24e-5	Reduce 64 values to one sum via rotation tree
Dot product (64-dim)	333ms	1.06e-5	Encrypted vector inner product
Polynomial eval (x²)	133ms	1.17e-7	Encrypted polynomial activation function
Dense layer (64→4)	1,555ms	1.1e-5	Complete encrypted neural network layer

Every number includes every step that production code must execute. Every number specifies the hardware. Every number includes a correctness measurement — the maximum error relative to plaintext computation. If the number cannot pass correctness verification, it does not get published.

Encrypted Dot Product: 333ms

The dot product is the core primitive of encrypted ML inference. Two encrypted 64-dimensional vectors are multiplied element-wise, then the products are reduced to a single sum via a rotation tree.

At 333ms per dot product, a system can evaluate 3 encrypted dot products per second on a single core. On 96 Graviton4 cores: 288 encrypted dot products per second. Each operates on 4,096 SIMD slots simultaneously, so the effective per-element throughput is 288 × 4,096 = 1.18 million encrypted element-operations per second.

This is what encrypted ML inference actually costs. Not the 0.5ms that an isolated multiply benchmark implies.

Encrypted Dense Layer: 1.56 seconds

A dense (fully connected) neural network layer with 64 inputs and 4 outputs requires 4 independent dot products plus accumulation. The measured wall-clock time is 1,555ms — slightly less than 4 × 333ms because output accumulations overlap with the final rotation steps.

A two-layer encrypted neural network (64→4→1) with polynomial activations between layers can execute in approximately 2 seconds on production cloud hardware. This is the real cost of encrypted ML inference — and it is a cost that no isolated primitive benchmark will ever reveal.

Encrypted Slot Sum: 293ms

Reducing 64 encrypted values to a single sum via a binary rotation tree. This is the building block for encrypted mean, variance, and any aggregation operation. Six rotation steps, each involving a Galois automorphism and key-switch. The rotation tree is bounded to 64 active slots rather than the full 4,096 to avoid accumulating noise from unused positions.

Encrypted Polynomial Evaluation: 133ms

Evaluating x² on encrypted data: one multiply pipeline. This is the activation function primitive for encrypted neural networks. Chebyshev polynomial approximation of ReLU, sigmoid, or tanh reduces to a sequence of these evaluations. A degree-4 approximation costs approximately 4 × 133ms = 532ms.

The Architecture That Makes This Possible

The workload numbers above are not from a research prototype. They come from an RNS-native CKKS implementation with specific architectural decisions that matter for production performance:

RNS-native representation. Every polynomial coefficient is stored as a vector of residues modulo multiple 60-bit primes. No BigInt anywhere in the evaluation path. No multi-precision arithmetic. No conversion overhead between polynomial domains. This is the same approach used by SEAL and OpenFHE, implemented from scratch in Rust.

Special-prime key-switching. Relinearization uses the extend-Q-to-QP approach: the ciphertext basis is temporarily extended with special primes P, the key-switch is performed in the extended basis, and then the result is modded down by P to restore CRT consistency. This gives O(1) noise growth per key-switch regardless of the number of moduli in the chain.

Montgomery NTT throughout. All polynomial arithmetic operates in Montgomery form. No modular division in the hot path. Harvey lazy reduction keeps intermediate butterfly values in [0, 2q) between NTT stages, eliminating a reduction per butterfly. Radix-4 transforms minimize memory bandwidth.

NTT-form persistence. Ciphertexts remain in NTT domain between operations. Back-to-back multiply-relin-rescale sequences avoid redundant forward transforms. This reduces NTTs per tensor product from 16 to 9.

CRT-consistent sampling. The RLWE key material is sampled as integers once, then reduced to each modulus. Independent per-modulus sampling breaks global CRT coherence and causes catastrophic noise amplification during mod-down. This is a subtle correctness requirement that many implementations get wrong in initial development.

Rayon-parallel moduli. Each RNS modulus is processed independently via Rayon work-stealing. On 96 physical Graviton4 cores, all moduli process in parallel with verified linear scaling.

Where This Fits in the Larger Stack

CKKS is one of four FHE engines in the H33 platform. The FHE-IQ routing engine selects the appropriate engine for every operation:

Engine	Best For	Headline
H33-128 (BFV)	Exact integer arithmetic, biometric auth	2,209,429 auth/sec
H33-CKKS	Real-number ML, statistics, scoring	1,574 TPS multiply
H33-TFHE	Comparisons, thresholds, decisions	768 TPS (8-bit GT)

A complete encrypted ML inference pipeline flows through multiple engines: CKKS for the forward pass, BFV for quantization, TFHE for the threshold decision. The router manages transitions automatically. The developer submits a workload; the system handles engine selection, scheme transitions, and post-quantum attestation.

Every computation — regardless of engine — is attested via the H33-74 substrate: 74 bytes containing a three-family post-quantum signature (ML-DSA + FALCON + SLH-DSA) that commits the result, the routing decision, and the authorization context. Permanently.

What We Publish That Others Do Not

This is not a claim of "fastest CKKS." Isolated multiply benchmarks from SEAL and OpenFHE are faster — because they measure a different (smaller) thing. This is a claim about measurement methodology and the questions we believe matter:

Full pipeline timings. Every multiply number includes relinearization and rescale. Always. No exceptions.
Workload-level benchmarks. Dot product, dense layer, slot sum, polynomial evaluation. Measured end-to-end. Published with correctness bounds.
Production hardware. AWS Graviton4 c8g.metal-48xl. 192 vCPUs. Not a developer laptop, not a lab rig. Cloud hardware you can deploy on today.
Correctness alongside performance. Every benchmark includes error measurement relative to plaintext computation. Performance without correctness is meaningless.
What CKKS cannot do. Approximate only. No exact comparisons. Depth-limited. We publish the limitations because the industry has a credibility problem with unpublished ones.

Most of the FHE industry asks: "How fast is multiply?"

We ask: "How fast is an encrypted ML workload end-to-end?"

That is a category shift, not a benchmark competition.

The Numbers, All of Them

Complete H33-CKKS throughput breakdown on Graviton4 c8g.metal-48xl (192 vCPUs, 96 physical cores):

Operation	Latency	Per-Core TPS	96-Core TPS
Multiply pipeline	61ms	16.4	1,574
Add	0.68ms	1,471	141,216
Slot sum (64)	293ms	3.4	327
Dot product (64-dim)	333ms	3.0	288
Polynomial eval (x²)	133ms	7.5	720
Dense layer (64→4)	1,555ms	0.64	61

Every number measured. Every number verified for correctness. Every result post-quantum attested. All NIST security tests passed across every cryptographic library: FIPS 203 (ML-KEM/Kyber), FIPS 204 (ML-DSA/Dilithium), FIPS 205 (SLH-DSA/SPHINCS+). 20,000+ tests across the full platform.

One of the few platforms measuring full encrypted workloads — not just primitives. If you are evaluating FHE platforms for production deployment, ask every vendor the same question: "What is the wall-clock time for a 64-dimensional encrypted dot product on your production cloud hardware?" If they can answer with a measured number, you have a real comparison. If they can only cite isolated multiply latency, you do not.

What Comes Next

Workload benchmarks are the starting point, not the end. The next phase of H33-CKKS development focuses on the operation planner: a lazy evaluation engine that fuses multiplication, relinearization, and rescale across computation graphs. When three multiplies feed into an addition, the planner defers relinearization until after accumulation — eliminating redundant key-switches.

The batch executor already provides modulus-first scheduling for independent operations. The operation planner extends this to dependent operations: given a computation DAG, it determines the minimum number of key-switches required and schedules them optimally.

The goal is not faster isolated primitives. The goal is faster encrypted workloads. That is what matters. That is what we measure.

Eric Beans

CEO, H33.ai, Inc.

Patent pending. U.S. Patent Application Nos. 19/309,560 and 19/645,499. Additional applications pending.
All benchmarks measured on AWS c8g.metal-48xl (Graviton4, 192 vCPUs, Neoverse V2), April 2026. Rust 1.94.0.
All NIST security tests passed: FIPS 203 (ML-KEM), FIPS 204 (ML-DSA), FIPS 205 (SLH-DSA). FIPS 140-3 KATs operational. 20,000+ tests across the platform.
H33-74 is a trademark of H33.ai, Inc. AWS and Graviton4 are trademarks of Amazon Web Services, Inc.
SEAL is a trademark of Microsoft Corporation. OpenFHE is developed by Duality Technologies.