Every number on this page includes multiply, relinearization, and rescale. We benchmark encrypted workloads—dot products, dense layers, slot sums—because isolated multiply times do not predict real performance. One of the few platforms measuring full encrypted workloads on production cloud hardware.
All timings measured on AWS Graviton4 c8g.metal-48xl (192 vCPUs, 371 GiB). Every operation includes multiply + relinearization + rescale where applicable. Correctness verified before each run.
| Operation | Latency | Pipeline Steps | Correctness Bound | Hardware |
|---|---|---|---|---|
| Encrypted Multiply | 61 ms | Multiply + Relin + Rescale | < 2-20 relative error | Graviton4 |
| Encrypted Add | 0.68 ms | Add (no relin needed) | < 2-40 relative error | Graviton4 |
| Encrypted Slot Sum | 293 ms | log2(N) rotations + adds | < 2-15 relative error | Graviton4 |
| Encrypted Dot Product | 333 ms | Multiply + Relin + Rescale + Slot Sum | < 2-12 relative error | Graviton4 |
| Polynomial Evaluation | 133 ms | Degree-3 poly via Horner (2 mul + relin + rescale) | < 2-15 relative error | Graviton4 |
| Encrypted Dense Layer | 1,555 ms | Matrix-vector via diagonal method + rotations | < 2-10 relative error | Graviton4 |
Why correctness bounds matter. CKKS is approximate arithmetic. Every operation introduces noise. We publish the measured correctness bound for each operation so you can determine whether H33-CKKS meets your application's precision requirements before writing a single line of code.
Throughput measured per core and scaled to 96 cores on Graviton4. Linear scaling verified: 96-core aggregate matches per-core throughput multiplied by 96.
| Operation | Per-Core TPS | 96-Core TPS | Scaling |
|---|---|---|---|
| Encrypted Multiply | 16.4 | 1,574 | Linear |
| Encrypted Add | 1,471 | 141,176 | Linear |
| Encrypted Slot Sum | 3.4 | 326 | Linear |
| Encrypted Dot Product | 3.0 | 288 | Linear |
| Polynomial Evaluation | 7.5 | 720 | Linear |
| Encrypted Dense Layer | 0.64 | 62 | Linear |
Linear scaling verified. No lock contention, no shared-state bottleneck. Each core runs an independent CKKS pipeline. Rayon work-stealing distributes batches without synchronization overhead. What you measure on one core, you get on 96.
The FHE industry has a benchmarking transparency problem. Most published CKKS numbers measure isolated operations under conditions that do not predict production workload performance.
We do not claim fastest CKKS. We claim measured. A 0.03 ms isolated multiply and a 333 ms encrypted dot product are measuring different things. One tells you how fast the library runs a single function call. The other tells you how fast your encrypted workload will actually execute.
The CKKS parameters used for all benchmarks on this page. These follow NIST/HES recommendations for 128-bit security.
5-level chain means 5 multiplications before bootstrapping. Dot products consume 1 level. Polynomial evaluation of degree d consumes log2(d) levels. Dense layers consume 1 level plus rotations (rotations are free in level consumption). Plan your circuit depth against the chain budget.
Sustained throughput under continuous load. Batch sizes 1–32. Modulus-first scheduling ensures the NTT domain stays hot across batch elements.
Measured at batch sizes 1–32 on Graviton4. Throughput is consistent across batch sizes due to modulus-first scheduling that maximizes NTT cache reuse.
Each CKKS ciphertext encodes 4,096 complex values. A single encrypted dot product processes 4,096 element pairs in 333 ms. Effective element throughput: 12,288 elements/sec per core.
Operations are scheduled modulus-first rather than ciphertext-first. This keeps NTT twiddle factors in L1 cache across batch elements, eliminating redundant cache loads.
14 ops/sec sustained at batch 1–32. Each operation is a full encrypted dot product (multiply + relin + rescale + slot sum). This is the throughput ceiling for depth-1 workloads on a single Graviton4 core. Scale linearly to 96 cores for ~1,344 ops/sec aggregate.
The techniques behind the numbers. Every optimization is measured in the full pipeline, not in isolation.
All NTT twiddle factors stored in Montgomery form. Modular multiplication via Montgomery reduction eliminates division entirely from the innermost loop. This is the single largest contributor to throughput.
Ciphertexts remain in NTT domain across operations. A naive multiply-relin-rescale chain requires 16 NTT transforms. NTT persistence reduces this to 9 transforms—a 44% reduction in the most expensive primitive.
Independent ciphertext operations distributed across all 96 Graviton4 cores via Rayon work-stealing. No shared mutable state. No lock contention. Linear scaling verified empirically.
Rotation key switching decomposition is computed once and reused across all rotation steps in slot sum and dense layer operations. 1.35x speedup on rotation-heavy workloads.
Static analysis of the computation graph identifies where relinearization can be deferred or shared. Reduces relinearization count by 75–87% on multi-operation workloads without affecting correctness.
Batch operations process all ciphertexts for a single modulus before advancing. NTT twiddle tables stay in L1 cache. Eliminates thrashing at 96-worker concurrency.