PricingDemo
Log InGet API Key
FHE Engineering

CKKS That Scales: Approximate FHE in Production

How approximate arithmetic over encrypted data enables machine learning inference at production throughput without ever decrypting inputs.

Machine learning models work with approximations. Every weight in a neural network is a floating-point number. Every activation function produces a real-valued output. Every gradient descent step adjusts parameters by fractional amounts. This fundamental characteristic of modern AI creates a remarkable opportunity for a specific class of fully homomorphic encryption: CKKS, the scheme built from the ground up for approximate arithmetic on encrypted data.

Most organizations exploring encrypted computation start with exact schemes. BFV and BGV operate on integers, producing bit-perfect results. For many applications, exact arithmetic is exactly what you need. But when the workload is machine learning inference, exact arithmetic becomes an impedance mismatch. You are forcing integer precision onto computations that never needed it, paying a performance penalty for guarantees the application does not require.

CKKS takes a different approach. It encodes real numbers directly into ciphertext polynomial coefficients, treating small rounding errors as acceptable noise rather than catastrophic failure. This design decision aligns perfectly with how neural networks already operate. A model that classifies images does not care whether an intermediate activation is 0.73291 or 0.73290. The classification outcome is identical. CKKS exploits this tolerance to deliver encrypted computation that is fundamentally faster and more memory-efficient than forcing real-valued workloads through integer schemes.

The Mechanics of Approximate Encryption

Understanding why CKKS matters for production ML requires understanding how it handles noise differently from exact schemes. In BFV, the noise budget is a hard constraint. Exceed it, and decryption fails entirely, returning garbage. Every homomorphic operation consumes noise budget, and the only way to get more is to use larger parameters, which means larger ciphertexts and slower operations.

CKKS reframes noise as precision loss rather than correctness failure. When you perform a homomorphic multiplication in CKKS, the result is correct to a certain number of significant digits. The noise manifests as rounding in the least significant bits, exactly the same behavior you get from floating-point arithmetic on plaintext data. This means CKKS can use smaller parameters for equivalent computational depth, because the noise tolerance is built into the scheme's semantics rather than being an external constraint.

The rescaling operation is central to making this work. After each multiplication, CKKS rescales the ciphertext by dividing out a scaling factor, simultaneously reducing the ciphertext modulus and the noise. This is analogous to how floating-point multiplication works: you multiply the significands and add the exponents, then renormalize. The result is that CKKS maintains consistent precision throughout a deep computation chain, rather than accumulating unbounded noise.

SIMD Batching: 4096 Inferences per Ciphertext

Raw operation latency is only half the performance story. The other half is throughput, and this is where CKKS truly shines for production workloads. Through the Chinese Remainder Theorem applied to cyclotomic polynomials, a single CKKS ciphertext can encode up to 4096 independent complex values in separate SIMD slots. Each homomorphic operation applies simultaneously to all 4096 slots, turning a single encrypted multiply into 4096 parallel multiplications.

For ML inference, this means a single API call can process 4096 independent input samples through the same model simultaneously. A healthcare organization encrypting patient feature vectors can batch thousands of diagnostic predictions into a single ciphertext operation sequence. A financial institution scoring credit applications can evaluate 4096 applicants in the time it takes to evaluate one. The throughput implications are enormous.

H33 exposes this capability through a batching API that automatically handles the encoding, slot management, and result extraction. Developers submit inference requests, and the system packs them into ciphertext slots, executes the encrypted model, and returns individual encrypted results. The entire process happens without the developer needing to understand polynomial ring arithmetic or slot rotation.

Depth Management for Deep Networks

Neural network inference requires sequential layers of computation. Each layer typically involves matrix multiplication (consuming one multiplicative level) followed by an activation function (consuming additional levels depending on the polynomial approximation). A ten-layer network might require fifteen or more multiplicative levels, and each level demands larger encryption parameters.

This is the fundamental tension in encrypted ML: deeper models require larger parameters, which increase latency and memory consumption. CKKS helps by making each level cheaper than the equivalent BFV level, but the depth-parameter tradeoff remains the primary engineering challenge.

H33 addresses this through automatic parameter selection. Given a computation graph representing the ML model, the system determines the minimum multiplicative depth required, selects the smallest CKKS parameters that support that depth with adequate precision, and generates the corresponding encryption keys. This eliminates the most error-prone step in FHE development, where incorrect parameter selection leads to either decryption failure or unnecessary performance overhead.

For activation functions, H33 uses polynomial approximations tailored to the precision requirements of each layer. ReLU, the most common activation function, is not a polynomial, so it must be approximated. Low-degree polynomial approximations are faster but less accurate. Higher-degree approximations consume more multiplicative depth but preserve model accuracy. The system automatically selects the approximation degree based on the remaining noise budget and the model's sensitivity analysis.

From Research to Production: What Changed

Academic CKKS implementations have existed for years. The Microsoft SEAL library, the PALISADE framework, and various other open-source implementations provide correct CKKS functionality. So what makes production CKKS different from research CKKS?

The answer is everything that surrounds the core arithmetic. Production systems must handle key management, ciphertext serialization, network transport, error recovery, monitoring, and multi-tenancy. They must operate on specific hardware with predictable performance characteristics. They must integrate with existing infrastructure rather than running as standalone benchmarks.

H33 runs CKKS on ARM Graviton4 processors, leveraging the wide SIMD units and high memory bandwidth of the platform. The polynomial arithmetic that underlies CKKS is inherently parallelizable, and Graviton4 provides massive parallel execution capacity. Combined with 4096-slot SIMD batching at the CKKS level, the system achieves throughput that makes encrypted ML inference practical for real workloads.

Key management is equally critical. CKKS requires evaluation keys for operations like rotation (needed for matrix-vector products in linear layers) and relinearization (needed after every multiplication to keep ciphertexts compact). These keys are large, often hundreds of megabytes for deep computation chains. H33 manages key lifecycle, storage, and distribution as infrastructure concerns, invisible to the application developer.

Attestation: The Missing Piece

Encrypting ML inference solves the confidentiality problem. The cloud never sees the input data, and the model owner never sees the inference results in plaintext. But encryption alone does not answer a critical question: how do you know the computation was performed correctly?

A malicious or compromised server could return random encrypted values instead of actual inference results. It could run a cheaper, less accurate model and claim it ran the expensive one. Without verification, the client has no way to distinguish correct encrypted results from fabricated ones.

This is why H33 pairs CKKS computation with post-quantum attestation. Every encrypted inference generates a cryptographic attestation that the specific computation was performed on the specific ciphertext. The attestation is bound to the input, the computation graph, and the output through a chain of post-quantum signatures. The client can verify this attestation independently, confirming that their encrypted data was processed as expected.

The attestation uses three independent hardness assumptions, ensuring that the verification remains trustworthy even if one cryptographic family is compromised. This is not theoretical engineering. Quantum computing threatens lattice assumptions, and the CKKS scheme itself is lattice-based. If lattice cryptography falls, you want your attestation layer to survive independently. H33-74 distills the full attestation into just 74 bytes, making verification lightweight enough to embed in any response.

Real-World CKKS Workloads

The most immediate production use case for CKKS is privacy-preserving inference on sensitive data. Consider a hospital system that wants to use a cloud-hosted diagnostic model without sharing patient data. The hospital encrypts patient feature vectors using CKKS, sends the ciphertexts to the cloud, and receives encrypted predictions. The cloud never sees the patient data. The hospital decrypts the predictions locally.

Financial services present another compelling case. Credit scoring models evaluate applicants based on sensitive financial history. With CKKS, a credit bureau can offer model-as-a-service where lenders submit encrypted applicant features and receive encrypted scores. Neither party sees the other's sensitive data: the lender's customer information stays encrypted, and the bureau's model weights remain private.

Fraud detection benefits from CKKS in consortium settings. Multiple banks can contribute encrypted transaction features to a shared fraud detection model. Each bank encrypts its customer transactions, the model evaluates all encrypted inputs simultaneously using SIMD batching, and each bank receives only its own encrypted results. No bank ever sees another bank's customer data, yet all benefit from the collective intelligence of the shared model.

Advertising and recommendation systems can use CKKS to personalize without profiling. A user's preference vector can be encrypted locally and matched against encrypted item embeddings on the server. The server performs the similarity computation on encrypted vectors and returns encrypted recommendations. The server never learns the user's preferences, and the user never sees the raw item database.

Performance Characteristics in Production

Production CKKS performance depends on three primary factors: polynomial degree, coefficient modulus size, and the specific operations in the computation graph. Larger polynomial degrees support deeper computations but increase per-operation latency. Larger coefficient moduli provide more precision but require more memory and bandwidth.

For typical ML inference workloads with ten to fifteen multiplicative levels, H33's CKKS implementation on Graviton4 delivers throughput that makes encrypted inference practical for batch processing. The 4096 SIMD slots mean that even if a single inference chain takes hundreds of milliseconds, the amortized per-sample cost is a fraction of a millisecond. Batch 4096 medical diagnostic evaluations, and the per-patient cost becomes negligible.

Memory management is a significant engineering challenge at scale. A deep CKKS computation chain requires holding multiple ciphertexts in memory simultaneously, plus evaluation keys that can reach hundreds of megabytes. H33 uses a streaming computation model that minimizes peak memory usage by processing the computation graph layer by layer, releasing intermediate ciphertexts as soon as they are no longer needed.

CKKS Security Considerations

The approximate nature of CKKS introduces a security subtlety that does not exist in exact schemes. In 2020, researchers demonstrated that the approximate decryption itself can leak information about the secret key if the decrypted values are shared carelessly. The noise in decrypted CKKS results is correlated with the secret key, and an adversary who observes many noisy decryptions can potentially recover the key.

H33 mitigates this through several measures. First, decrypted results are rounded to the application's required precision, eliminating the low-order noise bits that carry key-correlated information. Second, the system enforces minimum noise flooding during encryption, ensuring that the noise added during encryption dominates any key-correlated signal. Third, the system tracks how many decryptions have been performed under each key and enforces rotation before the information leakage threshold is reached.

These mitigations are invisible to the application developer but critical for production security. An implementation that ignores the Li-Micciancio attack is not suitable for deployment, regardless of how fast it computes.

The Road Ahead for Encrypted AI

CKKS for ML inference is moving from batch processing toward interactive latency. Current systems excel at processing thousands of inputs simultaneously but struggle with single-input latency-sensitive applications. The path forward involves hardware acceleration, algorithmic improvements in bootstrapping efficiency, and compiler optimizations that minimize the multiplicative depth of neural network representations.

H33 is investing in all three directions. ARM's Scalable Vector Extension provides wider SIMD lanes for polynomial arithmetic. Bootstrapping innovations reduce the cost of refreshing noise budgets mid-computation, enabling arbitrarily deep networks. And the H33-Compile system automatically transforms neural network architectures into FHE-optimized computation graphs, finding equivalent representations that require fewer multiplicative levels.

The goal is straightforward: make encrypted ML inference as accessible as plaintext inference. Not as fast, necessarily, since encryption has inherent overhead, but as accessible. A developer should be able to take a trained model, submit it to H33, and receive an encrypted inference endpoint that works identically to the plaintext version except that it never sees the input data. We are not there yet, but CKKS is the scheme that makes it possible, and we are getting closer every month.

The convergence of approximate encryption with approximate computation is not a coincidence. It is the reason CKKS will become the dominant FHE scheme for AI workloads. When the encryption scheme matches the computation paradigm, everything gets simpler, faster, and more practical. That is what scaling looks like.

Run Encrypted ML Inference Today

Deploy CKKS-powered inference on H33 with 4096 SIMD slots and post-quantum attestation.

Get API Key Explore H33-CKKS
Verify It Yourself