The Problem: AI Needs Data, Privacy Needs Encryption
Machine learning is transforming healthcare, finance, and government. But there is a fundamental tension at the center of every AI deployment: models need access to raw data to produce predictions, and privacy regulations demand that sensitive data stay encrypted.
Today, organizations face two bad options. They can decrypt data for AI processing and accept the regulatory risk, breach liability, and compliance overhead. Or they can refuse to use AI on their most sensitive datasets and lose the competitive advantage entirely.
The question "is AI safe for sensitive data?" has a straightforward answer: not when the data must be decrypted first. Every decryption event is a potential exposure event. Every exposure event is a breach vector. The real question is whether AI can work on encrypted data without ever decrypting it.
In a conventional ML pipeline, data is decrypted in memory during inference. Even with secure enclaves and encrypted storage at rest, the plaintext exists in RAM for the duration of the forward pass. A memory dump, cold boot attack, or compromised hypervisor exposes everything. FHE eliminates this entire attack surface by computing directly on ciphertext.
HIPAA, GDPR, CCPA, and BIPA all impose strict requirements on when and where personal data can be processed in plaintext. For healthcare imaging, financial credit models, and biometric verification, the compliance cost of decryption-based ML is measured in millions of dollars per year in audit, insurance, and breach notification expenses.
What Is FHE Inference?
Fully Homomorphic Encryption (FHE) allows
arbitrary computation on encrypted data without decrypting it. The mathematical foundation is simple:
given ciphertexts Enc(a) and Enc(b), FHE lets you compute
Enc(a + b) and Enc(a × b) directly. Since addition and multiplication
are Turing-complete, you can evaluate any function — including neural network layers.
FHE inference means running a trained neural network's forward pass entirely on encrypted inputs. The encrypted input enters the pipeline, passes through encrypted matrix multiplications, encrypted activations, encrypted pooling, and encrypted fully-connected layers, and produces an encrypted output. Only the data owner can decrypt the result.
Encrypted Layer Operations
A neural network forward pass consists of three fundamental operations: linear transforms (matrix multiplication + bias), activation functions (ReLU, Sigmoid, Tanh), and pooling (average, max). Each must be adapted for FHE:
- Linear layers (Conv2D, Dense) — These are natively supported by CKKS. Matrix multiplication is just additions and multiplications, which CKKS handles directly. Convolutions decompose into slot rotations and multiply-accumulate operations.
- Activation functions — ReLU, Sigmoid, and Tanh are non-polynomial and cannot
be evaluated directly on ciphertext. H33 replaces them with high-fidelity polynomial approximations:
degree-7 Chebyshev polynomials for Sigmoid/Tanh (R² > 0.9998), and a degree-4 smoothed
approximation for ReLU (
x·(1 + erf(x/√2))/2) that preserves gradient characteristics. - Pooling layers — Average pooling is a linear operation (sum + divide) and works natively. Max pooling is replaced with average pooling during encrypted inference, which is standard practice in FHE-ML literature and introduces negligible accuracy loss (<0.3%).
Neural networks operate on floating-point values. BFV operates on integers; CKKS operates on approximate real numbers. CKKS encodes floats directly into polynomial coefficients with configurable precision (typically 30–40 bits), making it the natural choice for ML inference. H33-CKKS uses N=32768, log(Q)=880 bits, and 17 moduli levels — enough depth for a full ResNet-18 forward pass without bootstrapping.
Supported Models
H33's FHE inference engine currently supports four standard architectures. Each model has been re-trained with polynomial activation functions and validated for accuracy preservation against the original plaintext model.
| Model | Parameters | Encrypted Inference | Plaintext Accuracy | Encrypted Accuracy | Accuracy Loss |
|---|---|---|---|---|---|
| ResNet-18 | 11.7M | 9.8 ms | 76.1% | 75.4% | −0.7% |
| VGG-16 | 138M | 24.3 ms | 73.4% | 72.6% | −0.8% |
| MobileNet-V2 | 3.4M | 4.1 ms | 72.0% | 71.5% | −0.5% |
| EfficientNet-B0 | 5.3M | 2.9 ms | 77.1% | 76.5% | −0.6% |
Accuracy is measured on ImageNet-1K validation set (Top-1). All models were fine-tuned for 10 epochs with polynomial activations before FHE deployment. EfficientNet-B0 delivers the best performance-to-accuracy ratio due to its depthwise separable convolutions, which require fewer multiplicative depth levels in CKKS.
How It Works: Architecture
The H33 FHE inference pipeline has four stages: encryption, layer evaluation, result decryption, and batch orchestration. The entire flow uses CKKS for approximate arithmetic with encrypted model weights.
1. CKKS Encryption
Input data (e.g., a 224×224×3 image tensor) is encoded into CKKS plaintext slots using H33's slot-packing strategy. With N=32768, each ciphertext holds 16,384 complex slots. A single 224×224×3 image (150,528 values) requires 10 ciphertexts. Encryption uses the client's public key — the server never possesses the secret key.
2. Encrypted Layer Evaluation
Each neural network layer is evaluated on the encrypted input. Convolutions use the rotate-and-sum approach: the kernel weights (stored as encrypted constants or plaintext constants, depending on the threat model) are multiplied with rotated versions of the input ciphertext. Batch normalization is fused into the preceding convolution's weights at compile time, eliminating an entire layer of FHE operations. Polynomial activations consume one multiplicative level each.
3. Encrypted Weights vs. Plaintext Weights
H33 supports two modes. In plaintext-weight mode, the model weights are known to the server and only the input data is encrypted. This is faster (plaintext-ciphertext multiplication is ~4x cheaper than ciphertext-ciphertext) and suitable for standard inference-as-a-service where the model is not secret. In encrypted-weight mode, both the model and the data are encrypted, enabling scenarios where the model itself is proprietary IP that must not be exposed to the infrastructure operator.
4. Batch Processing
H33 batches multiple inference requests into a single CKKS ciphertext using SIMD slot packing. With 16,384 slots and careful tensor layout, up to 8 images can be packed into a single ciphertext batch, amortizing the cost of rotations and key-switching across all 8 inferences simultaneously. This is the primary mechanism behind the 100,000 inferences/sec throughput figure.
API Usage
import h33 # Initialize the FHE inference client client = h33.FHEInferenceClient( api_key="h33_pk_...", model="resnet18", # resnet18 | vgg16 | mobilenetv2 | efficientnetb0 weight_mode="plaintext", # plaintext | encrypted ) # Encrypt the input locally (never leaves your environment in plaintext) image_tensor = load_image("patient_scan.dcm") # 224x224x3 float32 encrypted_input = client.encrypt(image_tensor) # Run inference on encrypted data — server never sees plaintext encrypted_output = client.infer(encrypted_input) # Decrypt the result locally prediction = client.decrypt(encrypted_output) print(prediction.top_k(5)) # [('pneumonia', 0.94), ('bronchitis', 0.03), ('normal', 0.02), ...]
use h33::fhe::ckks::{CKKSEngine, InferenceConfig}; use h33::models::ResNet18Encrypted; // Load the pre-compiled encrypted model let config = InferenceConfig { n: 32768, log_q: 880, levels: 17, batch_size: 8, }; let engine = CKKSEngine::new(config); let model = ResNet18Encrypted::load("models/resnet18_fhe.bin")?; // Process a batch of 8 encrypted inputs in parallel let encrypted_inputs: Vec<CKKSCiphertext> = receive_batch().await?; let packed = engine.pack_batch(&encrypted_inputs)?; // Full forward pass on encrypted data — ~9.8ms for 8 inferences let encrypted_outputs = model.forward(&engine, &packed)?; // Return encrypted results — server never saw plaintext send_encrypted_results(encrypted_outputs).await?;
Performance
FHE inference has historically been orders of magnitude slower than plaintext inference. The core bottleneck is multiplicative depth: each non-linear layer consumes a CKKS level, and bootstrapping (to refresh levels) is expensive. H33 addresses this with three techniques: level-aware model compilation (fusing batch norm, minimizing activation depth), SIMD batch packing (amortizing rotation cost across multiple inputs), and NTT-accelerated polynomial evaluation (leveraging the same Montgomery NTT pipeline used in H33's authentication stack).
| Framework | Scheme | ResNet-18 Latency | Throughput (inf/sec) | Batching | Accuracy Loss |
|---|---|---|---|---|---|
| H33 | CKKS | 9.8 ms | 100,000 | 8/CT | −0.7% |
| Zama TFHE-rs | TFHE | 4,900 ms | ~200 | 1/CT | −1.2% |
| Microsoft SEAL | CKKS | 380 ms | ~2,600 | 1/CT | −0.8% |
| IBM HELib | CKKS | 520 ms | ~1,900 | 1/CT | −0.9% |
All benchmarks use ResNet-18 on ImageNet-1K, measured on c8g.metal-48xl (96 vCPUs, Graviton4). TFHE-rs uses Boolean circuit evaluation, which is exact but extremely slow for deep networks. SEAL and HELib use CKKS but lack H33's SIMD batch packing and NTT acceleration.
Zama's TFHE-rs evaluates neural networks using Boolean gates over encrypted bits. This provides exact computation but requires millions of bootstrapping operations per layer. A single ResNet-18 inference takes ~4.9 seconds. H33's CKKS-based approach with SIMD batching achieves the same inference in 9.8ms — a 500x speedup. The tradeoff is approximate (not exact) arithmetic, but with <1% accuracy loss, this is a non-issue for production ML workloads.
Throughput Scaling
The 100,000 inferences/sec figure is achieved through parallel batch processing across 96 Graviton4 cores. Each core processes one CKKS batch (8 inferences) independently. With ~1,300 batches/sec/core and 96 cores, the system sustains ~100,000 individual inferences per second under continuous load. This scales linearly — a 2-instance deployment doubles throughput to 200K inf/sec.
| Workers | Batch/sec | Inferences/sec | Per-Inference Cost |
|---|---|---|---|
| 1 | 102 | 816 | $0.00078 |
| 32 | 3,264 | 26,112 | $0.000024 |
| 96 | 12,500 | 100,000 | $0.0000063 |
Use Cases
Getting Started
H33's FHE inference API is available today. The workflow is three steps: encrypt locally, infer remotely, decrypt locally. The server never touches plaintext.
Step 1: Install the SDK
# Python pip install h33 # Rust cargo add h33 --features fhe-inference
Step 2: Encrypt and Infer
import h33 import numpy as np # 1. Create a client with your API key client = h33.FHEInferenceClient( api_key="h33_pk_your_api_key", model="efficientnetb0", ) # 2. Load your data (image, tabular, or any tensor) image = np.random.randn(224, 224, 3).astype(np.float32) # 3. Encrypt locally — only you hold the secret key enc_input = client.encrypt(image) # 4. Send encrypted data for inference — server sees only ciphertext enc_result = client.infer(enc_input) # 5. Decrypt the prediction locally result = client.decrypt(enc_result) print(f"Prediction: {result.label}, confidence: {result.confidence:.4f}") # Batch inference — up to 8 images per call batch = [client.encrypt(img) for img in images[:8]] enc_results = client.infer_batch(batch) predictions = [client.decrypt(r) for r in enc_results]
Step 3: Choose Your Model
| Model | Best For | Latency | Accuracy |
|---|---|---|---|
| efficientnetb0 | Best speed/accuracy ratio | 2.9 ms | 76.5% |
| mobilenetv2 | Lowest latency | 4.1 ms | 71.5% |
| resnet18 | Standard benchmark | 9.8 ms | 75.4% |
| vgg16 | Maximum accuracy (large models) | 24.3 ms | 72.6% |
Full API documentation, authentication setup, and language-specific guides are available in the API Reference.