What hardware are the CKKS benchmarks measured on?

All benchmarks are measured on AWS Graviton4 c8g.metal-48xl with 192 vCPUs and 371 GiB RAM. This is production cloud hardware, not a lab setup. Results are reproducible on any Graviton4 instance.

Why measure workloads instead of isolated operations?

An isolated ciphertext multiply can take 0.03-5 ms depending on parameters, but that number omits relinearization, rescale, key switching, and the depth management that real computations require. A dot product over 4,096 encrypted slots includes all of those steps. That is what matters for production planning.

What CKKS parameters does H33 use?

N=8192 (polynomial degree), 128-bit security level, 4,096 usable SIMD slots, 5-level modulus chain, and a scale factor of 2^40. These parameters follow NIST/HES recommendations for 128-bit security.

Is H33-CKKS the fastest CKKS implementation?

We do not claim fastest. We claim measured. H33-CKKS is one of the few platforms that publishes full encrypted workload timings on production cloud hardware with correctness bounds included. Most published CKKS benchmarks report isolated multiply latency without relinearization, rescale, or workload composition.

Does H33-CKKS scale linearly across cores?

Yes. We verified linear scaling from single-core to 96-core on Graviton4. Per-core throughput remains consistent, and 96-core aggregate throughput matches the per-core number multiplied by 96.

CKKS Benchmarks — Full Pipeline, Production Hardware

Workloads, Not Primitives

Every number on this page includes multiply, relinearization, and rescale. We benchmark encrypted workloads—dot products, dense layers, slot sums—because isolated multiply times do not predict real performance. One of the few platforms measuring full encrypted workloads on production cloud hardware.

333 ms

Encrypted dot product · 4,096 slots · Graviton4

Get API Key H33-CKKS Product Page

Full Pipeline Benchmarks

All timings measured on AWS Graviton4 c8g.metal-48xl (192 vCPUs, 371 GiB). Every operation includes multiply + relinearization + rescale where applicable. Correctness verified before each run.

Operation	Latency	Pipeline Steps	Correctness Bound	Hardware
Encrypted Multiply	61 ms	Multiply + Relin + Rescale	< 2^-20 relative error	Graviton4
Encrypted Add	0.68 ms	Add (no relin needed)	< 2^-40 relative error	Graviton4
Encrypted Slot Sum	293 ms	log₂(N) rotations + adds	< 2^-15 relative error	Graviton4
Encrypted Dot Product	333 ms	Multiply + Relin + Rescale + Slot Sum	< 2^-12 relative error	Graviton4
Polynomial Evaluation	133 ms	Degree-3 poly via Horner (2 mul + relin + rescale)	< 2^-15 relative error	Graviton4
Encrypted Dense Layer	1,555 ms	Matrix-vector via diagonal method + rotations	< 2^-10 relative error	Graviton4

Why correctness bounds matter. CKKS is approximate arithmetic. Every operation introduces noise. We publish the measured correctness bound for each operation so you can determine whether H33-CKKS meets your application's precision requirements before writing a single line of code.

TPS at Scale

Throughput measured per core and scaled to 96 cores on Graviton4. Linear scaling verified: 96-core aggregate matches per-core throughput multiplied by 96.

Operation	Per-Core TPS	96-Core TPS	Scaling
Encrypted Multiply	16.4	1,574	Linear
Encrypted Add	1,471	141,176	Linear
Encrypted Slot Sum	3.4	326	Linear
Encrypted Dot Product	3.0	288	Linear
Polynomial Evaluation	7.5	720	Linear
Encrypted Dense Layer	0.64	62	Linear

Linear scaling verified. No lock contention, no shared-state bottleneck. Each core runs an independent CKKS pipeline. Rayon work-stealing distributes batches without synchronization overhead. What you measure on one core, you get on 96.

What We Measure vs What Others Measure

The FHE industry has a benchmarking transparency problem. Most published CKKS numbers measure isolated operations under conditions that do not predict production workload performance.

Typical Published Benchmarks

Isolated ciphertext multiply: 0.03–5 ms
Relinearization reported separately or omitted
Rescale not included in headline number
No workload-level benchmarks (dot product, dense layer)
Lab hardware or unspecified instance types
Correctness bounds not published
Single-core numbers only

H33-CKKS Benchmarks

Full pipeline: multiply + relin + rescale = 61 ms
Relinearization included in every timing
Rescale included in every timing
Workload benchmarks: dot product, dense layer, slot sum, poly eval
AWS Graviton4 c8g.metal-48xl (production cloud hardware)
Correctness bound published for every operation
Per-core and 96-core TPS with verified linear scaling

We do not claim fastest CKKS. We claim measured. A 0.03 ms isolated multiply and a 333 ms encrypted dot product are measuring different things. One tells you how fast the library runs a single function call. The other tells you how fast your encrypted workload will actually execute.

Production Parameters

The CKKS parameters used for all benchmarks on this page. These follow NIST/HES recommendations for 128-bit security.

Polynomial Degree (N) 8,192
Security Level 128-bit
Usable SIMD Slots 4,096
Modulus Chain Depth 5 levels
Scale Factor 2⁴⁰
Encoding Canonical embedding (complex slots)
Relin Key Size ~130 MB (precomputed, cached)
Ciphertext Size ~1 MB per ciphertext

5-level chain means 5 multiplications before bootstrapping. Dot products consume 1 level. Polynomial evaluation of degree d consumes log₂(d) levels. Dense layers consume 1 level plus rotations (rotations are free in level consumption). Plan your circuit depth against the chain budget.

Batch Throughput

Sustained throughput under continuous load. Batch sizes 1–32. Modulus-first scheduling ensures the NTT domain stays hot across batch elements.

14 ops/sec

Sustained Batch Throughput

Measured at batch sizes 1–32 on Graviton4. Throughput is consistent across batch sizes due to modulus-first scheduling that maximizes NTT cache reuse.

4,096

Slots per Ciphertext

Each CKKS ciphertext encodes 4,096 complex values. A single encrypted dot product processes 4,096 element pairs in 333 ms. Effective element throughput: 12,288 elements/sec per core.

Modulus-First

Scheduling Strategy

Operations are scheduled modulus-first rather than ciphertext-first. This keeps NTT twiddle factors in L1 cache across batch elements, eliminating redundant cache loads.

14 ops/sec sustained at batch 1–32. Each operation is a full encrypted dot product (multiply + relin + rescale + slot sum). This is the throughput ceiling for depth-1 workloads on a single Graviton4 core. Scale linearly to 96 cores for ~1,344 ops/sec aggregate.

Optimization Stack

The techniques behind the numbers. Every optimization is measured in the full pipeline, not in isolation.

Montgomery NTT

Zero Division in Hot Path

All NTT twiddle factors stored in Montgomery form. Modular multiplication via Montgomery reduction eliminates division entirely from the innermost loop. This is the single largest contributor to throughput.

9 vs 16

NTT Persistence

Ciphertexts remain in NTT domain across operations. A naive multiply-relin-rescale chain requires 16 NTT transforms. NTT persistence reduces this to 9 transforms—a 44% reduction in the most expensive primitive.

96 Cores

Rayon Parallelism

Independent ciphertext operations distributed across all 96 Graviton4 cores via Rayon work-stealing. No shared mutable state. No lock contention. Linear scaling verified empirically.

1.35x

Hoisted Rotations

Rotation key switching decomposition is computed once and reused across all rotation steps in slot sum and dense layer operations. 1.35x speedup on rotation-heavy workloads.

75–87%

Operation Planner

Static analysis of the computation graph identifies where relinearization can be deferred or shared. Reduces relinearization count by 75–87% on multi-operation workloads without affecting correctness.

L1 Hot

Modulus-First Scheduling

Batch operations process all ciphertexts for a single modulus before advancing. NTT twiddle tables stay in L1 cache. Eliminates thrashing at 96-worker concurrency.

Workloads, Not Primitives

Full Pipeline Benchmarks

TPS at Scale

What We Measure vs What Others Measure

Typical Published Benchmarks

H33-CKKS Benchmarks

Production Parameters

Batch Throughput

Sustained Batch Throughput

Slots per Ciphertext

Scheduling Strategy

Optimization Stack

Zero Division in Hot Path

NTT Persistence

Rayon Parallelism

Hoisted Rotations

Operation Planner

Modulus-First Scheduling

Build on Measured Performance