April 24, 2026 · Engineering · Updated April 27, 2026 with RNS-native benchmark results from c8g.metal-48xl

CKKS That Scales: Encrypted ML Inference in Production

Production CKKS fully homomorphic encryption on Graviton4 metal (c8g.metal-48xl, 192 vCPUs). Encrypted addition in 0.68 milliseconds. Full multiply pipeline (multiply + relinearization + rescale) in 61 milliseconds. 4,096 slots per ciphertext. How approximate FHE fits into a three-engine stack that handles arithmetic, inference, and decisions — all on encrypted data.

The Numbers

141,216 adds/sec

CKKS encrypted addition · N=8192 · 128-bit security · 4,096 slots · c8g.metal-48xl (Graviton4, 192 vCPUs)

These are measured numbers from a production CKKS implementation running on AWS Graviton4 (c8g.metal-48xl, 192 vCPUs, ARM Neoverse V2). The RNS-native pipeline eliminates the BigInt path entirely — all modular arithmetic runs in native 64-bit residue channels. The scheme is Cheon-Kim-Kim-Song — the approximate arithmetic variant of fully homomorphic encryption designed for real-number computation on encrypted data.

Operation	Latency	Ops/sec
Encrypted Addition	0.68ms	141,216
Encrypted Multiply (full pipeline: multiply + relin + rescale)	61ms	1,574
Encrypt	~10ms	~100
Decrypt	<1ms	>1,000

Parameters: N=8192, multiplicative depth 4, 128-bit security per the Homomorphic Encryption Standard v1.1. Each ciphertext packs 4,096 complex slots — one operation on a single ciphertext computes across 4,096 independent values simultaneously.

What CKKS Is For

CKKS is the FHE scheme built for real numbers. Unlike BFV (exact integers) and TFHE (encrypted bits), CKKS operates on approximate arithmetic — encrypted floating-point values with bounded precision loss per operation. This makes it the natural choice for machine learning inference, statistical computation, and any workload where the inputs are continuous values rather than discrete integers or Boolean flags.

The core use cases for CKKS in production privacy systems:

Encrypted ML inference: Neural network forward passes on encrypted feature vectors. Weights are plaintext (the model is public); inputs are encrypted (the data is private). Matrix-vector products and polynomial activations execute entirely on ciphertexts.
Encrypted statistical aggregation: Mean, variance, and covariance computed over encrypted datasets without decrypting individual records. Regulatory analytics where the regulator sees aggregate statistics but not individual entries.
Encrypted risk scoring: Portfolio risk models, credit scoring, and fraud detection features computed on encrypted financial data. The model evaluates; the data stays private.
Encrypted similarity search: Cosine similarity and Euclidean distance between encrypted vectors. The distance is computed without revealing either vector.

The 4,096-slot advantage: Each CKKS ciphertext packs 4,096 independent values into a single encrypted polynomial. One encrypted addition operates on all 4,096 values simultaneously for 0.68ms — an effective per-element cost of 166 nanoseconds. This SIMD-style batching is what makes CKKS practical for vector and matrix operations.

How CKKS Works (Without the Math)

Approximate Arithmetic

Every CKKS operation introduces a small amount of precision loss. An encrypted addition preserves high precision. An encrypted multiplication preserves working precision but consumes one level of the modulus chain — a finite resource that determines the total multiplicative depth available before the ciphertext must be refreshed via bootstrapping.

With our production parameters, the system supports 4 sequential multiplications before bootstrapping. Each multiplication followed by rescaling restores the working precision. This is sufficient for most ML inference and statistical workloads.

In practice, 4 levels of multiplicative depth is sufficient for most ML inference tasks: a two-layer neural network with polynomial activations, or a degree-4 polynomial approximation of a sigmoid or ReLU function. Deeper networks can be evaluated by interleaving bootstrapping operations at the cost of additional latency.

SIMD Slot Packing

CKKS encodes multiple values into a single ciphertext using the algebraic structure of the encryption scheme. With our parameters, each ciphertext provides 4,096 independent slots. Each slot holds one value. Operations on the ciphertext — addition, multiplication, rotation — apply to all slots simultaneously.

This means a single encrypted matrix-vector product with a 64-dimensional feature vector can be computed using approximately 64 encrypted rotations and additions, each operating on all 4,096 slots in parallel. The effective per-element throughput is orders of magnitude higher than evaluating one element at a time.

Depth Budget

CKKS uses a leveled approach: each multiplication consumes one level of a finite precision budget. After each multiplication, rescaling restores working precision by consuming the next modulus in the chain. After 4 multiplications, the depth budget is exhausted and bootstrapping is required to continue.

This is the fundamental tradeoff of leveled CKKS: more depth levels enable deeper computation but require larger polynomial degrees for the same security level. Our parameters sit at the HE Standard v1.1 boundary for 128-bit security.

Where CKKS Fits in the Stack

CKKS is not a standalone system. In H33's architecture, it is one of three FHE engines, each optimized for a different class of computation. The IQ routing engine automatically selects the appropriate engine based on the operation requested:

Engine	Best For	Measured Throughput
BFV (exact integer)	Biometric matching, inner products	2,209,429 auths/sec
CKKS (approximate real)	ML inference, statistics, scoring	1,574 TPS multiply pipeline · 141,216 TPS add
TFHE (Boolean gates)	Comparisons, thresholds, decisions	768 TPS 8-bit GT, 96 channels

These engines are not interchangeable. BFV cannot efficiently compute polynomial activations on real-valued features. CKKS cannot efficiently evaluate comparison operations (greater-than, less-than). TFHE can evaluate arbitrary Boolean circuits but operates on individual encrypted bits, not packed vectors. Each engine does what it does best, and the routing layer selects automatically.

The Complete ML Privacy Pipeline

A typical encrypted ML inference pipeline flows through multiple engines:

Stage 1 — CKKS: Encrypted feature vector enters. Matrix-vector product with public model weights. Polynomial activation function (Chebyshev approximation of ReLU or sigmoid). Output: encrypted score vector.
Stage 2 — Scheme transition: CKKS approximate result is quantized to exact integer and re-encrypted under BFV. This happens in a secure conversion boundary — plaintext exists only in volatile memory and is immediately zeroized.
Stage 3 — BFV or TFHE: The quantized score is compared against a threshold. "Is the risk score above 0.7?" This is a non-polynomial operation — it requires a different engine optimized for comparisons.
Stage 4 — H33-74 attestation: The routing decision, the computation result, and the scheme transition are all committed to a 74-byte post-quantum attestation primitive signed under three independent signature families.

The IQ routing engine manages these transitions automatically. The developer submits an operation; the system determines the engine, manages scheme transitions, and attests the result. No manual engine selection. No scheme-switching code. One API.

Performance Characteristics

Encrypted Addition: 0.68ms

Addition is the cheapest CKKS operation. Two ciphertexts are added directly with no additional cryptographic overhead. No rescaling needed. The result has the same level and precision as the inputs.

At 0.68ms per addition with 4,096 slots, the effective per-element addition cost is 166 nanoseconds. At 96-core scale on Graviton4 metal, this yields 141,216 additions per second. Addition does not consume any multiplicative depth.

Encrypted Multiplication: 61ms (Full Pipeline)

The full multiply pipeline includes three stages: the polynomial multiplication itself, relinearization (key-switching to reduce the ciphertext back to standard form), and rescaling to restore precision. The 61ms figure measures all three stages end-to-end on Graviton4 metal — this is the honest cost of one complete CKKS multiplication.

Relinearization dominates the pipeline cost. This is characteristic of all CKKS implementations — the mathematical structure requires significant computation to maintain the ciphertext format after multiplication.

With 192 vCPUs processing independent multiplications in parallel on the RNS-native pipeline: 1,574 full multiply operations per second. Each multiplication operates on all 4,096 slots simultaneously, giving an effective throughput of over 6.4 million encrypted element-multiplications per second.

Security Parameters

The implementation uses parameters that satisfy the Homomorphic Encryption Standard v1.1 for 128-bit security:

Parameter	Value	Purpose
Polynomial degree (N)	8,192	Ring dimension for security
Modulus chain depth	5 levels	4 multiplications before bootstrap
Precision	~12 decimal digits	Per operation working precision
Security level	128-bit	HE Standard v1.1 compliant
Multiplicative depth	4	Sequential multiplications before bootstrap
Slots	4,096 complex / 8,192 real	SIMD parallel values

The security guarantee rests on the Ring Learning With Errors (RLWE) problem. No known classical or quantum algorithm can break RLWE at these parameters in polynomial time. The scheme is lattice-based — the same mathematical foundation as the NIST-standardized ML-KEM and ML-DSA post-quantum algorithms.

Post-Quantum Attestation

Every CKKS computation in the H33 stack produces a post-quantum attestation. The computation result, the routing decision that selected CKKS as the engine, and the authorization metadata are committed to a 74-byte H33-74 primitive signed under three independent post-quantum signature families: ML-DSA-65 (lattice-based), FALCON-512 (NTRU-based), and SLH-DSA-SHA2-128f (hash-based).

The attestation cost is negligible: less than 1 millisecond for commitment construction and triple signing. Against a 61ms full multiply pipeline, attestation adds less than 2% overhead. The result is a quantum-resistant proof that the computation was performed correctly, on the claimed data, by an authorized party, using the specified engine and parameters.

What CKKS Cannot Do

CKKS is powerful for continuous computation but has fundamental limitations that the multi-engine architecture addresses:

No exact arithmetic. CKKS is approximate. If you need exact integer matching (biometric template comparison, hash verification), use BFV.
No comparison operations. "Is X greater than Y?" is a non-polynomial operation. CKKS can approximate it with polynomial approximation, but the precision is limited. For reliable encrypted comparisons, use TFHE.
Limited depth without bootstrapping. At depth 4, you can evaluate degree-4 polynomials. Deeper computation requires bootstrapping, which adds ~100ms latency per refresh on current hardware.
Precision degrades per multiplication. Each multiplication costs ~10 bits of precision after rescaling. After 4 multiplications, you have ~40 bits remaining — still 12 decimal digits, but not infinite.

These limitations are not weaknesses — they are the design boundaries that make CKKS fast. Exact arithmetic, comparison, and unlimited depth would require a different scheme (BFV or TFHE), and those schemes cannot do efficient real-number computation. The three-engine architecture exists precisely because no single FHE scheme does everything well.

Deployment

The CKKS implementation runs on the same Graviton4 infrastructure as the rest of the H33 FHE stack:

Hardware: AWS c8g.metal-48xl (Graviton4, 192 vCPUs, ARM Neoverse V2)
Runtime: Rust, system allocator, no external FHE library dependencies
Transform engine: Shared with BFV pipeline, optimized for ARM
Key material: Pre-generated and resident in memory. No per-operation key generation.
Attestation: Integrated H33-74 pipeline. Every result attested with three PQ signature families.

The CKKS context initialization (key generation and precomputation) takes approximately 4 seconds. This is a one-time startup cost, amortized across all subsequent operations. Key material persists for the session lifetime.

The Three-Engine Advantage

Most FHE deployments use a single scheme and accept its limitations. H33 runs three schemes simultaneously with automatic routing between them. The practical consequence: an application can encrypt a feature vector, run ML inference (CKKS), quantize the result to an integer (scheme transition), compare against a threshold (TFHE), and attest the entire chain (H33-74) — in a single API call.

No other production system provides this. Individual FHE libraries (SEAL, OpenFHE, Lattigo) implement one or two schemes but leave routing, scheme transitions, and attestation to the application developer. H33 integrates these at the infrastructure level, so the developer writes "compare score against threshold" and the system handles CKKS → BFV → TFHE → attestation automatically.

The complete stack, measured: BFV at 2,209,429 auths/sec for exact matching. CKKS at 141,216 adds/sec and 1,574 TPS full multiply pipeline for approximate computation. TFHE at 768 TPS (8-bit GT, 96 channels) for encrypted decisions. All post-quantum attested. All on the same Graviton4 metal instance. All measured, not projected.

CKKS scales when it runs alongside engines that cover its blind spots. Approximate arithmetic is powerful for ML inference, statistical aggregation, and risk scoring — but a production privacy system also needs exact matching and encrypted decisions. The three-engine architecture delivers all three, with automatic routing and post-quantum attestation on every result.

The numbers are real. The stack is production. The attestation is quantum-resistant.

Eric Beans

CEO, H33.ai, Inc.

Patent pending. U.S. Patent Application Nos. 19/309,560 and 19/645,499. Additional applications pending.
H33-74 is a trademark of H33.ai, Inc. AWS and Graviton4 are trademarks of Amazon Web Services, Inc.