There is a question that every CTO building with third-party AI models must eventually confront: who else can see the data you send to inference? The honest answer, in nearly every production deployment today, is uncomfortable. The cloud provider can see it. The GPU memory can be dumped. The model operator can log it. Side-channel attacks can extract it. And once your data leaves your encryption boundary, you are trusting policy, contracts, and good intentions rather than mathematics.
This post is not an introduction to Fully Homomorphic Encryption. H33 has published those. This is the architecture document. This is for the CTO or principal engineer who has decided that "trust the cloud provider" is not a security posture and wants to understand, concretely, how inference works when the model never sees plaintext at any stage of the pipeline.
The Threat Model You Are Ignoring
Before we discuss the architecture, we need to be precise about what we are defending against. Most AI security conversations focus on model theft or prompt injection. Those matter, but they are not the existential risk. The existential risk is data exposure during inference.
Cloud Provider Access
When you send patient records, financial data, or biometric templates to a cloud-hosted model, that data exists in plaintext in the provider's memory. Every major cloud provider has internal access controls, yes. But those controls are organizational, not mathematical. A compromised employee, a government subpoena, or a misconfigured access policy can expose everything. The shared responsibility model means the provider is responsible for the infrastructure, and you are responsible for the data. But if the data must be decrypted for the infrastructure to process it, that boundary collapses.
GPU Memory Dumps
Modern GPU inference loads input tensors into device memory in plaintext. A memory dump of the GPU at any point during inference captures your data. This is not theoretical. Researchers have demonstrated extraction of training data from GPU memory on shared cloud instances. The same technique applies to inference inputs. If your model runs on a shared GPU instance, and most cloud AI does, your data coexists in the same physical memory space as other tenants' data, separated only by software isolation.
Model Inversion and Training Data Extraction
Model inversion attacks reconstruct input data from model outputs. A classification model that returns confidence scores leaks information about the input with every prediction. Training data extraction attacks recover verbatim training examples from model weights. If your data was used to fine-tune a model, that data can, in principle, be extracted by anyone with access to the model. These attacks improve with every academic publication. They do not require access to the model's infrastructure. They require only API access to the model's outputs.
The fundamental problem: conventional AI inference requires decrypting data before processing. Every defense built on top of that requirement is a mitigation, not a solution. Zero-access architecture eliminates the requirement entirely.
The Zero-Access Pipeline
Zero-access AI inference means that at no point in the pipeline does any party other than the data owner hold plaintext data. The model operator processes ciphertext. The cloud provider stores ciphertext. The GPU computes on ciphertext. The output is ciphertext, decryptable only by the data owner. Here is the pipeline, stage by stage.
Stage 1: Client-Side Encryption
The data owner encrypts their input using a Fully Homomorphic Encryption scheme. The choice of scheme depends on the model type, which we will cover in detail below. The encryption key never leaves the client. The encrypted input, a ciphertext, is sent to the inference server. The ciphertext is mathematically indistinguishable from random noise to anyone without the secret key. There is no partial decryption, no secure enclave handshake, no key escrow. The server receives noise that happens to encode a computation.
Stage 2: Homomorphic Inference
The inference server executes the model on the ciphertext. This is where the mathematics of FHE become load-bearing. Homomorphic encryption permits computation on encrypted data such that the result, when decrypted, matches the result of performing the same computation on the plaintext. The server performs additions, multiplications, rotations, and comparisons on ciphertexts. It never decrypts. It cannot decrypt. It does not have the key. The model weights can be in plaintext (for standard inference) or themselves encrypted (for model-private inference). We support both configurations.
Stage 3: Encrypted Output and Attestation
The server produces an encrypted result. Before returning it, the H33 pipeline generates an H33-74 attestation: a 74-byte cryptographic proof that the computation was performed correctly on the encrypted input, that the model version is the one claimed, and that no intermediate decryption occurred. This attestation is signed using three independent post-quantum signature families. The client receives the encrypted result and the attestation. The client decrypts locally using their secret key. The attestation can be verified by any third party without revealing the data or the result.
Choosing the Right FHE Scheme for Your Model
Not all FHE schemes are equal, and not all models require the same operations. The choice of scheme determines the computational profile, the precision guarantees, and the performance characteristics of the encrypted inference. Here is how we map schemes to model architectures.
CKKS for Neural Networks
CKKS (Cheon-Kim-Kim-Song) is an approximate arithmetic FHE scheme. It operates on vectors of complex numbers, which maps directly to the tensor operations that neural networks require. Matrix-vector multiplications, the core operation of every neural network layer, are performed using CKKS's SIMD (Single Instruction, Multiple Data) slots. A single ciphertext can encode thousands of values, and a single homomorphic operation processes all of them in parallel.
The "approximate" in CKKS means that each operation introduces a small, bounded error. For neural network inference, this is not a limitation but a feature. Neural networks are inherently approximate: they operate on floating-point values, they use stochastic training, and their outputs are probabilistic. The error introduced by CKKS is orders of magnitude smaller than the noise already present in the model's weights. We have measured CKKS inference accuracy within 0.01% of plaintext inference across standard benchmarks.
CKKS supports the operations neural networks need: addition (bias terms, residual connections), multiplication (weight application, attention scores), and rotation (data rearrangement across SIMD slots). Activation functions like ReLU, which are not natively polynomial, are approximated using low-degree polynomial approximations. A degree-7 Chebyshev approximation of ReLU introduces negligible accuracy loss while remaining efficiently computable in CKKS.
BFV for Decision Trees and Exact Computation
BFV (Brakerski/Fan-Vercauteren) is an exact arithmetic FHE scheme operating on integers modulo a plaintext modulus. Unlike CKKS, BFV introduces zero approximation error. The decrypted result is bit-for-bit identical to the plaintext computation. This makes BFV the correct choice for decision trees, random forests, gradient-boosted models, and any classifier where the output must be deterministic.
Decision tree inference is a sequence of comparisons: is feature X greater than threshold T? Each comparison is an integer operation. BFV handles these natively. A random forest of 500 trees can be evaluated on encrypted input by encoding each comparison as an arithmetic circuit over BFV ciphertexts. The result is an encrypted class label or regression value, exact to the last bit.
H33's BFV implementation uses a single 56-bit modulus with polynomial degree N=4096, providing 128 bits of security while keeping ciphertext sizes manageable. Our batch processing encodes 32 inputs per ciphertext using the CRT (Chinese Remainder Theorem) batching slots, achieving 943 microseconds per batch on Graviton4 hardware.
TFHE for Boolean Classification
TFHE (Torus Fully Homomorphic Encryption) operates on individual bits rather than vectors of integers or complex numbers. Its distinguishing feature is programmable bootstrapping: the ability to evaluate arbitrary lookup tables on encrypted bits in constant time. This makes TFHE uniquely suited for binary classifiers, comparators, and logic circuits that process encrypted boolean flags.
Where CKKS and BFV amortize cost across many values packed into a single ciphertext, TFHE processes individual encrypted bits with low latency per operation. For applications like encrypted fraud flags (is this transaction suspicious: yes or no?), encrypted access control decisions (does this user have permission: yes or no?), or encrypted diagnostic thresholds (is this biomarker above critical: yes or no?), TFHE provides the minimal-overhead path.
Our TFHE implementation achieves 768 TPS for 16-bit equality comparisons and 372 TPS for 16-bit greater-than operations on 96-channel hardware. For boolean classification tasks that reduce to a small number of bit-level comparisons, TFHE provides the lowest end-to-end latency of the three schemes.
| FHE Scheme | Data Type | Best For | Error Model |
|---|---|---|---|
| CKKS | Approximate real/complex | Neural networks, embeddings | Bounded approximation |
| BFV | Exact integers | Decision trees, exact classifiers | Zero error |
| TFHE | Individual bits | Boolean classifiers, comparators | Zero error (per bootstrap) |
Encrypted Model Weights: When the Model Is Also Private
In the standard zero-access configuration, the model weights are in plaintext and the input is encrypted. The model operator can see the model but not the data. For many deployments, this is sufficient. But there is a second configuration: the model weights themselves are encrypted.
This matters when the model is proprietary and the model owner does not want to reveal the weights to the inference operator. A pharmaceutical company's drug-interaction model, a defense contractor's threat classifier, a hedge fund's trading signal model. In these cases, the model owner encrypts the weights under their own key, and inference is performed as a two-party computation where neither party sees the other's private input.
CKKS supports this configuration natively. The encrypted model weights and encrypted input data are multiplied homomorphically. The result is encrypted under a combination of both keys. A lightweight key-switching protocol produces a result decryptable by the data owner alone. The model owner learns nothing about the data. The data owner learns nothing about the model beyond what the output reveals. The inference operator learns nothing about either.
Why the Alternatives Fall Short
Zero-access architecture did not emerge in a vacuum. It exists because the alternatives have fundamental limitations that cannot be patched.
Trusted Execution Environments (TEEs)
Intel SGX, AMD SEV, and ARM TrustZone create hardware-isolated enclaves where code and data are protected from the operating system and hypervisor. In theory, data is encrypted in memory and only decrypted inside the enclave. In practice, SGX has been broken repeatedly. The Foreshadow attack extracted SGX enclave secrets via speculative execution. The Plundervolt attack corrupted SGX computations via voltage manipulation. The SmashEx attack broke SGX's exception handling. AMD SEV has its own catalog of side-channel vulnerabilities.
The fundamental issue is that TEEs rely on hardware for security guarantees. Hardware has side channels. Microarchitectural state leaks information through timing, power consumption, and electromagnetic emanation. Every new CPU generation introduces new microarchitectural features and new side channels. TEEs provide defense in depth, and we support them as an additional layer, but they are not a substitute for mathematical encryption guarantees.
Differential Privacy
Differential privacy adds calibrated noise to model outputs so that no individual input can be inferred from the output. It provides a rigorous mathematical privacy guarantee, parameterized by epsilon. The problem is accuracy. Useful differential privacy (small epsilon) requires adding significant noise, which degrades model accuracy. A diagnostic model operating under differential privacy may become clinically useless. A financial model under differential privacy may produce predictions too noisy to trade on.
More fundamentally, differential privacy protects against output-based inference. It does not protect the input during processing. The model still sees the plaintext input. The cloud provider still sees the plaintext input. Differential privacy is an output defense, not an input defense. Zero-access architecture protects the input, the output, and every intermediate state.
Federated Learning
Federated learning keeps raw data on the client device and sends only model gradients to a central server. The server aggregates gradients to update the model without seeing individual data points. But gradient inversion attacks have demonstrated that raw training data can be reconstructed from gradients with surprising fidelity. A single gradient update can leak individual training examples. Federated learning also does not apply to inference at all. It is a training technique, not an inference architecture. Once the model is trained, inference still requires sending data to the model, which returns us to the original problem.
| Approach | Input Protected | During Processing | Output Protected | Cryptographic Guarantee |
|---|---|---|---|---|
| TEE (SGX/SEV) | Partial | Hardware-dependent | No | No (hardware trust) |
| Differential Privacy | No | No | Partial | Statistical only |
| Federated Learning | Training only | No (gradients leak) | No | No |
| FHE (Zero-Access) | Yes | Yes | Yes | Yes (lattice hardness) |
H33-74 Attestation: Proof That It Actually Happened
Running inference on encrypted data is necessary but not sufficient. You also need proof. Proof that the computation was performed on the ciphertext you submitted. Proof that the model version is the one you agreed to. Proof that no intermediate decryption occurred. Proof that the result corresponds to the input. Without proof, you are back to trusting the operator.
Every inference in the H33 pipeline produces an H33-74 attestation. This is a 74-byte cryptographic artifact, 32 bytes stored on-chain and 42 bytes in our cache layer, that encodes the following commitments: a SHA3-256 hash of the input ciphertext, a SHA3-256 hash of the output ciphertext, a commitment to the model version and configuration, a timestamp, and a signature bundle using three independent post-quantum signature families. The three families are ML-DSA (lattice-based), FALCON (NTRU lattice-based), and SLH-DSA (hash-based). An attacker must break all three independent mathematical hardness assumptions simultaneously to forge an attestation.
The attestation is verifiable by any third party. An auditor, a regulator, a counterparty, or an automated compliance system can verify that a specific inference was performed correctly without seeing the data or the result. The verification operation takes 71 microseconds. This is not a compliance report generated after the fact. It is a cryptographic proof generated as a byproduct of every inference, in real time, at line speed.
What the Attestation Chain Proves
Each attestation links to the previous attestation for the same data owner, forming a tamper-evident chain. Any modification to a historical attestation invalidates all subsequent attestations in the chain. This gives you an append-only, cryptographically linked audit log of every inference ever performed on your data. If a regulator asks for evidence that patient data was processed correctly six months ago, you produce the attestation. If a counterparty disputes a model output, you produce the attestation. If an internal audit needs to verify that a specific model version was used for a specific decision, you produce the attestation.
Performance: The Overhead Is Real, but Manageable
FHE inference is not free. Encrypted computation is approximately three to five orders of magnitude slower than plaintext computation for equivalent operations. A plaintext neural network inference that takes 10 milliseconds may take 10 seconds under CKKS encryption. This is the honest answer, and anyone who claims otherwise is either using trivial models or misleading you.
But "three to five orders of magnitude" needs context. First, the overhead is constant factor, not algorithmic. It does not change the complexity class of the computation. A model that is O(n) in plaintext is O(n) in ciphertext, with a larger constant. Second, the overhead amortizes across batch processing. Our BFV implementation encodes 32 inputs per ciphertext, processing all 32 in a single homomorphic operation. The per-input overhead is 1/32 of the per-ciphertext overhead. Third, for many high-value use cases, the alternative to encrypted inference is not plaintext inference. It is no inference at all, because the data cannot leave the security boundary. A hospital that cannot send patient data to a cloud model because of HIPAA constraints gets zero value from a fast model it cannot use. A 10-second encrypted inference is infinitely better than no inference.
On current hardware (Graviton4 with 192 vCPUs), our full pipeline processes a 32-input BFV batch in 1,345 microseconds total: 943 microseconds for the FHE computation, 391 microseconds for batch attestation, and under 1 microsecond for the cached ZKP verification. That is 42 microseconds per authenticated inference. For CKKS neural network inference, latency depends on model depth, but a 4-layer network with 512-dimensional hidden layers completes encrypted inference in under 2 seconds per input.
Deployment Architecture
The zero-access pipeline is designed to integrate with existing ML infrastructure, not replace it. The deployment pattern is a sidecar: your model runs in its existing framework (PyTorch, TensorFlow, ONNX Runtime), and the H33 FHE layer wraps the input and output boundaries.
The client SDK handles key generation, encryption, and decryption. It is available in Rust, with bindings for Python, TypeScript, and Go. Key generation is a one-time operation per client. The public key is sent to the server; the secret key never leaves the client. Encryption and decryption are local operations, typically completing in single-digit milliseconds for standard input sizes.
The server component intercepts encrypted inputs, routes them through the homomorphic inference engine, generates the attestation, and returns the encrypted result. It does not require modifications to the model itself. The model's weight matrices are loaded into the FHE scheme's plaintext space (or encrypted, for the dual-private configuration), and inference proceeds through the model's existing layer structure, with each layer's operations translated to their homomorphic equivalents.
Scaling is horizontal. Each inference is independent, with no shared state between requests. A load balancer distributes encrypted inference requests across a pool of FHE compute workers. Each worker processes one batch at a time. Autoscaling is based on queue depth, not CPU utilization, because FHE operations are deliberately CPU-saturating by design.
Who Needs This Now
Zero-access architecture is not for every AI deployment. It is for deployments where the cost of data exposure exceeds the cost of encrypted computation. That threshold is lower than most people think. Healthcare organizations processing diagnostic models on patient data. Financial institutions running risk models on portfolio positions. Defense and intelligence organizations processing classified inputs through unclassified models. Pharmaceutical companies evaluating drug-interaction models without revealing compound structures. Any organization operating under the EU AI Act's high-risk data segregation requirements (covered in depth in our companion post on EU AI Act compliance).
If your data is valuable enough that you encrypt it at rest and in transit, but you decrypt it for inference, you have a gap. Zero-access architecture closes it.
See Zero-Access Inference in Production
We will walk through the full pipeline on your data, on your model, with your security team in the room. Encrypted input to encrypted output, with attestation at every step.
Schedule a Demo