How to Run AI on Encrypted Data
Encrypted ML inference with CKKS for neural networks, BFV for decision trees, TFHE for boolean circuits, and the H33 Agent-Zero architecture
The promise of AI depends on access to data. The promise of privacy depends on restricting access to data. These two promises are in direct conflict, and every organization that deploys AI on sensitive data -- healthcare, finance, government, legal -- lives in the tension between them. Anonymization fails. Synthetic data introduces distribution shift. Federated learning leaks gradients. Differential privacy degrades model quality. Every approach that tries to preserve privacy by limiting or transforming the data makes the AI worse.
Fully homomorphic encryption resolves this tension by eliminating it. FHE allows you to run AI models directly on encrypted data. The model operator never sees the plaintext input. The model operator never sees the plaintext output. The computation happens entirely on ciphertexts, and the result, when decrypted by the data owner, is mathematically identical to the result of running the same model on plaintext. Privacy and AI quality are no longer tradeoffs. They are simultaneous guarantees.
This article covers the three FHE schemes used for encrypted AI inference, the specific architectural patterns that make each one practical, the production considerations that separate research prototypes from deployed systems, and the H33 Agent-Zero framework that brings encrypted AI to production at 2,293,766 operations per second.
CKKS for Neural Network Inference
Neural networks operate on real-valued numbers. Weights, biases, activations, and outputs are all floating-point quantities. The CKKS (Cheon-Kim-Kim-Song) FHE scheme is designed specifically for approximate arithmetic on real numbers, making it the natural choice for encrypted neural network inference.
A CKKS-encrypted neural network layer works as follows. The input vector is encrypted into a CKKS ciphertext, with each element of the vector occupying a slot in the polynomial encoding. The weight matrix is encoded (but not encrypted -- the model weights are not secret in this scenario, only the data is). The matrix-vector multiplication is performed by a series of ciphertext-plaintext multiplications and rotations. The bias addition is a ciphertext-plaintext addition. The activation function is approximated as a polynomial (since FHE only supports additions and multiplications, non-polynomial functions like ReLU must be approximated).
The polynomial approximation of activation functions is the most significant engineering challenge in encrypted neural network inference. ReLU, the most common activation in modern networks, is a piecewise-linear function that cannot be directly computed in FHE. The standard approach is to replace ReLU with a polynomial approximation -- typically a low-degree Chebyshev polynomial -- that is close enough over the expected input range. For networks trained specifically for encrypted inference, you can use polynomial activations (like square or higher-degree polynomials) during training, eliminating the approximation gap entirely.
SIMD Batching for Throughput
CKKS supports SIMD (Single Instruction Multiple Data) encoding, where a single ciphertext holds multiple independent values in its polynomial slots. H33's implementation uses 4,096 SIMD slots per ciphertext. For inference, this means you can process 4,096 independent inputs through the same network layer in a single encrypted operation. The computation cost is the same whether you are processing 1 input or 4,096 inputs, so batching is essential for practical throughput.
In H33's production pipeline, SIMD batching is the primary throughput lever. The BFV pipeline processes 32 users per ciphertext for the authentication workload, utilizing a subset of the available 4,096 slots. For pure CKKS inference workloads where every slot is filled, throughput scales linearly with the slot count. This is why the per-operation cost in H33's pipeline is 38 microseconds -- the batch amortization reduces the per-input cost by orders of magnitude compared to single-input processing.
BFV for Decision Trees and Classification
Not all AI is neural networks. Decision trees, random forests, gradient-boosted classifiers, and rule-based systems operate on discrete comparisons and integer arithmetic. These models are better served by BFV (Brakerski-Fan-Vercauteren), which provides exact integer arithmetic without the approximation overhead of CKKS.
An encrypted decision tree evaluation works by converting the tree traversal into a polynomial circuit. At each node, the feature value is compared to the threshold. In BFV, this comparison is implemented as a subtraction followed by a sign extraction circuit (which is more complex than it sounds in FHE, since you cannot branch on encrypted values). The result of each comparison is an encrypted bit: 0 or 1. The final classification is a weighted sum of leaf values, where the weights are products of the encrypted comparison bits along each path from root to leaf.
The advantage of this approach is that the result is exact. If the plaintext decision tree would classify an input as class 3, the encrypted decision tree will classify it as class 3, with no approximation error. For applications where the classification must be precisely reproducible -- regulatory compliance, legal discovery, medical diagnosis -- exact arithmetic is not optional.
Encrypted Random Forests
Random forests present an interesting optimization opportunity in FHE. Each tree in the forest is independent, which means you can evaluate all trees in parallel using separate ciphertexts or SIMD slots. A forest of 100 trees with 4,096 SIMD slots per ciphertext can evaluate all 100 trees on 40 different inputs simultaneously, producing 4,000 classification results in a single batch. The majority vote across trees for each input is then computed as an encrypted summation and threshold comparison.
This level of parallelism is what makes encrypted random forests practical. The per-tree evaluation cost is high relative to plaintext, but the per-forest-per-input cost, amortized across SIMD batching, brings the overhead into a range that production systems can tolerate.
TFHE for Boolean Circuits and Arbitrary Logic
Some AI workloads require operations that neither CKKS nor BFV handles efficiently: encrypted string matching, arbitrary comparisons with branching, lookup tables, and control flow. TFHE (Torus Fully Homomorphic Encryption) operates on individual encrypted bits and supports arbitrary boolean gates, making it the universal fallback for any computation expressible as a circuit.
TFHE's programmable bootstrapping is its key advantage for AI workloads. Every gate evaluation automatically resets the noise, enabling unlimited computation depth. This means you can evaluate arbitrarily deep circuits without the noise budget constraints that limit BFV and CKKS. For AI applications that require complex conditional logic -- if the patient's age is over 65 AND their medication list contains more than 5 items AND their allergy profile matches pattern X, THEN flag for pharmacist review -- TFHE can express the entire logic tree in encrypted boolean gates.
The tradeoff is throughput. H33's TFHE implementation achieves 768 operations per second for 8-bit greater-than comparisons and 372 for 16-bit comparisons on Graviton4. These numbers are real and measured. For batch classification tasks, TFHE is orders of magnitude slower than BFV. For complex conditional logic that cannot be expressed as integer arithmetic, TFHE is the only option that maintains full encryption throughout the computation.
Multi-Scheme Pipelines: Combining Strengths
Production encrypted AI rarely uses a single FHE scheme. The most effective architectures combine schemes, using each one in its optimal regime. H33's pipeline demonstrates this principle in its authentication workload: BFV for the encrypted inner product (exact integer arithmetic), with the downstream pipeline handling STARK verification and post-quantum signing.
For a more complex AI scenario -- say, an encrypted loan approval system -- the multi-scheme pipeline might look like this. CKKS evaluates the credit scoring neural network on the applicant's encrypted financial profile, producing an encrypted credit score (approximate real number). The encrypted score is then converted to a BFV integer representation (by scaling and rounding within the encrypted domain) and compared against the approval threshold using BFV exact arithmetic. If additional compliance checks are needed -- does the applicant match any sanctioned-entity patterns? -- TFHE boolean circuits handle the encrypted string matching against the sanctions list. The final approval decision, still encrypted, combines the outputs of all three stages.
The scheme transitions (CKKS to BFV, BFV to TFHE) require careful parameter alignment. The polynomial degree and modulus chain must be compatible across schemes, or explicit conversion steps must be inserted. H33's FHE-IQ (Intelligent Query) system handles these transitions automatically, selecting the optimal scheme for each sub-computation and managing the parameter compatibility constraints.
H33 Agent-Zero: Encrypted AI Agents
H33 Agent-Zero extends encrypted inference from individual model calls to full agent workflows. An AI agent is a system that makes decisions, takes actions, and iterates based on results. Agents call multiple models, access external tools, and maintain state across interactions. Running an agent on encrypted data means every model call, every tool invocation, and every state update must happen on ciphertexts.
Agent-Zero achieves this by maintaining all agent state as FHE ciphertexts. The agent's memory is encrypted. The agent's intermediate reasoning steps are encrypted. The agent's tool calls pass encrypted parameters and receive encrypted results. The only plaintext in the system is the agent's code itself (the logic of what to compute), not the data it operates on.
The architectural pattern is straightforward: the agent's control flow is a fixed circuit (known at design time), and the data flowing through that circuit is encrypted. This works because FHE is data-oblivious -- the same sequence of operations executes regardless of the encrypted values. The agent cannot branch on encrypted values (it cannot see them), so the control flow must be predefined. This is a meaningful constraint that rules out free-form conversational agents but enables structured decision-making agents with predefined workflows -- exactly the type of agent used in healthcare, finance, and compliance automation.
Agent-Zero Architecture
The Agent-Zero pipeline has four stages. Stage one: the data owner encrypts the input under their FHE key and sends the ciphertext to the agent. Stage two: the agent executes its predefined workflow on the encrypted data, making encrypted model calls (CKKS inference, BFV classification, TFHE logic) and encrypted tool calls (encrypted database queries, encrypted API calls with H33-74 attestation). Stage three: a STARK proof is generated covering the entire agent execution, proving the agent ran the correct workflow on the encrypted data. Stage four: the encrypted output is signed with three-family post-quantum signatures and returned to the data owner.
The data owner decrypts the result to see the agent's decision. The STARK proof lets the data owner (or a third-party auditor) verify that the agent executed the correct workflow without substituting a different model, skipping a compliance check, or tampering with intermediate results. The post-quantum signatures ensure the attestation is valid for decades, surviving any future cryptographic breaks.
Production Considerations
Model Training vs Model Inference
Encrypted inference and encrypted training are fundamentally different challenges. Encrypted inference runs a fixed model on encrypted inputs -- the computation depth is bounded and known at compile time. Encrypted training iterates over data for hundreds of epochs, with each iteration requiring deep circuits (backpropagation through the entire network). Training on encrypted data is possible with TFHE's unlimited depth bootstrapping, but the performance overhead makes it impractical for large models.
The practical approach is to train models on plaintext data (in a secure environment) and deploy the trained model for encrypted inference. This is the pattern H33 uses: the model weights are public (or encrypted under a separate key if they are proprietary), and only the inference inputs and outputs are encrypted. This dramatically reduces the FHE computation depth and makes encrypted inference practical at scale.
Activation Function Engineering
The choice of activation functions during model training determines the efficiency of encrypted inference. Models trained with ReLU activations require polynomial approximation at inference time, which adds computation depth and reduces accuracy. Models trained with polynomial activations (square function, GELU approximated as a low-degree polynomial) can be evaluated directly in CKKS without approximation overhead. If you know your model will be deployed for encrypted inference, train it with FHE-friendly activations from the start.
H33 Compile, the FHE compiler tool, analyzes a trained model and identifies activation functions that need replacement, estimates the computation depth and noise budget requirements, and suggests parameter configurations. This analysis happens before deployment, not at runtime, ensuring that production encrypted inference runs at the optimal parameter point.
Memory and Bandwidth
FHE ciphertexts are large. A BFV ciphertext with N=4096 and a 56-bit modulus is approximately 64KB. A CKKS ciphertext with multiple moduli in the chain can be several hundred KB. For a neural network with many layers, the total ciphertext traffic between layers can reach gigabytes. Memory management and ciphertext serialization must be engineered carefully to avoid becoming the bottleneck.
H33's pipeline keeps all ciphertexts in shared memory within a single process, eliminating serialization overhead between pipeline stages. The 192 vCPUs on the Graviton4 metal instance share 371 GiB of memory, which is sufficient for the in-flight ciphertexts of the batched authentication pipeline. For larger models with deeper circuits and more intermediate ciphertexts, memory becomes the binding constraint, and careful ciphertext lifecycle management (freeing intermediate ciphertexts as soon as they are consumed) is essential.
Benchmarks: What Is Actually Achievable
H33's production pipeline on AWS Graviton4 (c8g.metal-48xl, 192 vCPUs) processes 2,293,766 authentications per second through the complete pipeline: BFV encryption, homomorphic inner product, STARK verification, and ML-DSA-65 signing. Per-authentication latency is 38 microseconds. The FHE stage (BFV inner product for 32 users) takes 943 microseconds, representing 70% of total pipeline time.
For CKKS inference workloads specifically, the numbers depend heavily on the model architecture. A single-layer encrypted matrix multiplication on a 128x128 matrix with 4,096 SIMD slots completes in low single-digit milliseconds. A three-layer encrypted neural network with polynomial activations runs in the tens of milliseconds per batch. These numbers are meaningful benchmarks, not marketing claims -- they reflect the actual computation times on production hardware with production security parameters.
TFHE numbers are lower but still production-viable for targeted workloads: 768 TPS for 8-bit comparisons, 372 TPS for 16-bit comparisons. These are not competitive with BFV or CKKS for bulk throughput, but they serve the specific use case of encrypted boolean logic where the other schemes cannot operate.
The Security Model: What FHE Guarantees
FHE provides IND-CPA security: an adversary who observes the ciphertext cannot distinguish between encryptions of different plaintexts. The security rests on the hardness of lattice problems (RLWE for BFV and CKKS, TLWE for TFHE), which are believed to be resistant to both classical and quantum computers. This means encrypted AI inference is post-quantum secure by default -- the data remains protected even if a quantum computer attacks the ciphertexts.
What FHE does not guarantee is protection against side-channel attacks on the computation itself. If the model operator can observe timing variations, memory access patterns, or power consumption during the encrypted computation, they may be able to infer information about the encrypted inputs. H33 mitigates this by running all FHE operations in constant time (the computation takes the same number of cycles regardless of the encrypted values) and by deploying on dedicated hardware rather than shared cloud instances where cross-tenant side channels are a concern.
The combination of FHE for data protection, STARK proofs for computation verification, and three-family post-quantum signatures for attestation creates a security model that protects the data (FHE), proves the computation (STARK), and secures the proof (PQ signatures) -- all against quantum adversaries. This layered approach is what makes encrypted AI practical for regulated industries where the security model must satisfy auditors, not just engineers.
Getting Started
The fastest path to encrypted AI inference is H33's API. Send your model definition and encrypted inputs to the inference endpoint, and receive encrypted outputs with STARK proofs and PQ attestation. The API handles FHE parameter selection, scheme routing, noise management, and proof generation. For organizations that need on-premises deployment, H33 ships as a Rust binary that runs on any ARM or x86 server with sufficient memory.
Contact support@h33.ai for architecture guidance on your specific AI workload. The FHE scheme selection, parameter configuration, and model preparation depend on your model architecture, throughput requirements, and security constraints. We have deployed encrypted AI inference for biometric authentication, fraud detection, medical diagnosis support, and compliance automation, and each deployment has a different optimal configuration.