CKKS Benchmarks — Full Pipeline, Production Hardware

Workloads, Not Primitives

Every number on this page includes multiply, relinearization, and rescale. We benchmark encrypted workloads—dot products, dense layers, slot sums—because isolated multiply times do not predict real performance. One of the few platforms measuring full encrypted workloads on production cloud hardware.

333 ms
Encrypted dot product · 4,096 slots · Graviton4
Get API Key H33-CKKS Product Page

Full Pipeline Benchmarks

All timings measured on AWS Graviton4 c8g.metal-48xl (192 vCPUs, 371 GiB). Every operation includes multiply + relinearization + rescale where applicable. Correctness verified before each run.

OperationLatencyPipeline StepsCorrectness BoundHardware
Encrypted Multiply 61 ms Multiply + Relin + Rescale < 2-20 relative error Graviton4
Encrypted Add 0.68 ms Add (no relin needed) < 2-40 relative error Graviton4
Encrypted Slot Sum 293 ms log2(N) rotations + adds < 2-15 relative error Graviton4
Encrypted Dot Product 333 ms Multiply + Relin + Rescale + Slot Sum < 2-12 relative error Graviton4
Polynomial Evaluation 133 ms Degree-3 poly via Horner (2 mul + relin + rescale) < 2-15 relative error Graviton4
Encrypted Dense Layer 1,555 ms Matrix-vector via diagonal method + rotations < 2-10 relative error Graviton4

Why correctness bounds matter. CKKS is approximate arithmetic. Every operation introduces noise. We publish the measured correctness bound for each operation so you can determine whether H33-CKKS meets your application's precision requirements before writing a single line of code.

TPS at Scale

Throughput measured per core and scaled to 96 cores on Graviton4. Linear scaling verified: 96-core aggregate matches per-core throughput multiplied by 96.

OperationPer-Core TPS96-Core TPSScaling
Encrypted Multiply 16.4 1,574 Linear
Encrypted Add 1,471 141,176 Linear
Encrypted Slot Sum 3.4 326 Linear
Encrypted Dot Product 3.0 288 Linear
Polynomial Evaluation 7.5 720 Linear
Encrypted Dense Layer 0.64 62 Linear

Linear scaling verified. No lock contention, no shared-state bottleneck. Each core runs an independent CKKS pipeline. Rayon work-stealing distributes batches without synchronization overhead. What you measure on one core, you get on 96.

What We Measure vs What Others Measure

The FHE industry has a benchmarking transparency problem. Most published CKKS numbers measure isolated operations under conditions that do not predict production workload performance.

Typical Published Benchmarks

  • Isolated ciphertext multiply: 0.03–5 ms
  • Relinearization reported separately or omitted
  • Rescale not included in headline number
  • No workload-level benchmarks (dot product, dense layer)
  • Lab hardware or unspecified instance types
  • Correctness bounds not published
  • Single-core numbers only

H33-CKKS Benchmarks

  • Full pipeline: multiply + relin + rescale = 61 ms
  • Relinearization included in every timing
  • Rescale included in every timing
  • Workload benchmarks: dot product, dense layer, slot sum, poly eval
  • AWS Graviton4 c8g.metal-48xl (production cloud hardware)
  • Correctness bound published for every operation
  • Per-core and 96-core TPS with verified linear scaling

We do not claim fastest CKKS. We claim measured. A 0.03 ms isolated multiply and a 333 ms encrypted dot product are measuring different things. One tells you how fast the library runs a single function call. The other tells you how fast your encrypted workload will actually execute.

Production Parameters

The CKKS parameters used for all benchmarks on this page. These follow NIST/HES recommendations for 128-bit security.

Polynomial Degree (N) 8,192
Security Level 128-bit
Usable SIMD Slots 4,096
Modulus Chain Depth 5 levels
Scale Factor 240
Encoding Canonical embedding (complex slots)
Relin Key Size ~130 MB (precomputed, cached)
Ciphertext Size ~1 MB per ciphertext

5-level chain means 5 multiplications before bootstrapping. Dot products consume 1 level. Polynomial evaluation of degree d consumes log2(d) levels. Dense layers consume 1 level plus rotations (rotations are free in level consumption). Plan your circuit depth against the chain budget.

Batch Throughput

Sustained throughput under continuous load. Batch sizes 1–32. Modulus-first scheduling ensures the NTT domain stays hot across batch elements.

14 ops/sec

Sustained Batch Throughput

Measured at batch sizes 1–32 on Graviton4. Throughput is consistent across batch sizes due to modulus-first scheduling that maximizes NTT cache reuse.

4,096

Slots per Ciphertext

Each CKKS ciphertext encodes 4,096 complex values. A single encrypted dot product processes 4,096 element pairs in 333 ms. Effective element throughput: 12,288 elements/sec per core.

Modulus-First

Scheduling Strategy

Operations are scheduled modulus-first rather than ciphertext-first. This keeps NTT twiddle factors in L1 cache across batch elements, eliminating redundant cache loads.

14 ops/sec sustained at batch 1–32. Each operation is a full encrypted dot product (multiply + relin + rescale + slot sum). This is the throughput ceiling for depth-1 workloads on a single Graviton4 core. Scale linearly to 96 cores for ~1,344 ops/sec aggregate.

Optimization Stack

The techniques behind the numbers. Every optimization is measured in the full pipeline, not in isolation.

Montgomery NTT

Zero Division in Hot Path

All NTT twiddle factors stored in Montgomery form. Modular multiplication via Montgomery reduction eliminates division entirely from the innermost loop. This is the single largest contributor to throughput.

9 vs 16

NTT Persistence

Ciphertexts remain in NTT domain across operations. A naive multiply-relin-rescale chain requires 16 NTT transforms. NTT persistence reduces this to 9 transforms—a 44% reduction in the most expensive primitive.

96 Cores

Rayon Parallelism

Independent ciphertext operations distributed across all 96 Graviton4 cores via Rayon work-stealing. No shared mutable state. No lock contention. Linear scaling verified empirically.

1.35x

Hoisted Rotations

Rotation key switching decomposition is computed once and reused across all rotation steps in slot sum and dense layer operations. 1.35x speedup on rotation-heavy workloads.

75–87%

Operation Planner

Static analysis of the computation graph identifies where relinearization can be deferred or shared. Reduces relinearization count by 75–87% on multi-operation workloads without affecting correctness.

L1 Hot

Modulus-First Scheduling

Batch operations process all ciphertexts for a single modulus before advancing. NTT twiddle tables stay in L1 cache. Eliminates thrashing at 96-worker concurrency.