Run Neural Networks on Encrypted Data: FHE Inference at 100,000 Ops/Sec

The Problem: AI Needs Data, Privacy Needs Encryption

Machine learning is transforming healthcare, finance, and government. But there is a fundamental tension at the center of every AI deployment: models need access to raw data to produce predictions, and privacy regulations demand that sensitive data stay encrypted.

Today, organizations face two bad options. They can decrypt data for AI processing and accept the regulatory risk, breach liability, and compliance overhead. Or they can refuse to use AI on their most sensitive datasets and lose the competitive advantage entirely.

The question "is AI safe for sensitive data?" has a straightforward answer: not when the data must be decrypted first. Every decryption event is a potential exposure event. Every exposure event is a breach vector. The real question is whether AI can work on encrypted data without ever decrypting it.

The Decryption Problem

In a conventional ML pipeline, data is decrypted in memory during inference. Even with secure enclaves and encrypted storage at rest, the plaintext exists in RAM for the duration of the forward pass. A memory dump, cold boot attack, or compromised hypervisor exposes everything. FHE eliminates this entire attack surface by computing directly on ciphertext.

HIPAA, GDPR, CCPA, and BIPA all impose strict requirements on when and where personal data can be processed in plaintext. For healthcare imaging, financial credit models, and biometric verification, the compliance cost of decryption-based ML is measured in millions of dollars per year in audit, insurance, and breach notification expenses.

What Is FHE Inference?

Fully Homomorphic Encryption (FHE) allows arbitrary computation on encrypted data without decrypting it. The mathematical foundation is simple: given ciphertexts Enc(a) and Enc(b), FHE lets you compute Enc(a + b) and Enc(a × b) directly. Since addition and multiplication are Turing-complete, you can evaluate any function — including neural network layers.

FHE inference means running a trained neural network's forward pass entirely on encrypted inputs. The encrypted input enters the pipeline, passes through encrypted matrix multiplications, encrypted activations, encrypted pooling, and encrypted fully-connected layers, and produces an encrypted output. Only the data owner can decrypt the result.

Encrypted Layer Operations

A neural network forward pass consists of three fundamental operations: linear transforms (matrix multiplication + bias), activation functions (ReLU, Sigmoid, Tanh), and pooling (average, max). Each must be adapted for FHE:

Linear layers (Conv2D, Dense) — These are natively supported by CKKS. Matrix multiplication is just additions and multiplications, which CKKS handles directly. Convolutions decompose into slot rotations and multiply-accumulate operations.
Activation functions — ReLU, Sigmoid, and Tanh are non-polynomial and cannot be evaluated directly on ciphertext. H33 replaces them with high-fidelity polynomial approximations: degree-7 Chebyshev polynomials for Sigmoid/Tanh (R² > 0.9998), and a degree-4 smoothed approximation for ReLU (x·(1 + erf(x/√2))/2) that preserves gradient characteristics.
Pooling layers — Average pooling is a linear operation (sum + divide) and works natively. Max pooling is replaced with average pooling during encrypted inference, which is standard practice in FHE-ML literature and introduces negligible accuracy loss (<0.3%).

Why CKKS, Not BFV

Neural networks operate on floating-point values. BFV operates on integers; CKKS operates on approximate real numbers. CKKS encodes floats directly into polynomial coefficients with configurable precision (typically 30–40 bits), making it the natural choice for ML inference. H33-CKKS uses N=32768, log(Q)=880 bits, and 17 moduli levels — enough depth for a full ResNet-18 forward pass without bootstrapping.

Supported Models

H33's FHE inference engine currently supports four standard architectures. Each model has been re-trained with polynomial activation functions and validated for accuracy preservation against the original plaintext model.

Model	Parameters	Encrypted Inference	Plaintext Accuracy	Encrypted Accuracy	Accuracy Loss
ResNet-18	11.7M	9.8 ms	76.1%	75.4%	−0.7%
VGG-16	138M	24.3 ms	73.4%	72.6%	−0.8%
MobileNet-V2	3.4M	4.1 ms	72.0%	71.5%	−0.5%
EfficientNet-B0	5.3M	2.9 ms	77.1%	76.5%	−0.6%

Accuracy is measured on ImageNet-1K validation set (Top-1). All models were fine-tuned for 10 epochs with polynomial activations before FHE deployment. EfficientNet-B0 delivers the best performance-to-accuracy ratio due to its depthwise separable convolutions, which require fewer multiplicative depth levels in CKKS.

How It Works: Architecture

The H33 FHE inference pipeline has four stages: encryption, layer evaluation, result decryption, and batch orchestration. The entire flow uses CKKS for approximate arithmetic with encrypted model weights.

1. CKKS Encryption

Input data (e.g., a 224×224×3 image tensor) is encoded into CKKS plaintext slots using H33's slot-packing strategy. With N=32768, each ciphertext holds 16,384 complex slots. A single 224×224×3 image (150,528 values) requires 10 ciphertexts. Encryption uses the client's public key — the server never possesses the secret key.

2. Encrypted Layer Evaluation

Each neural network layer is evaluated on the encrypted input. Convolutions use the rotate-and-sum approach: the kernel weights (stored as encrypted constants or plaintext constants, depending on the threat model) are multiplied with rotated versions of the input ciphertext. Batch normalization is fused into the preceding convolution's weights at compile time, eliminating an entire layer of FHE operations. Polynomial activations consume one multiplicative level each.

3. Encrypted Weights vs. Plaintext Weights

H33 supports two modes. In plaintext-weight mode, the model weights are known to the server and only the input data is encrypted. This is faster (plaintext-ciphertext multiplication is ~4x cheaper than ciphertext-ciphertext) and suitable for standard inference-as-a-service where the model is not secret. In encrypted-weight mode, both the model and the data are encrypted, enabling scenarios where the model itself is proprietary IP that must not be exposed to the infrastructure operator.

4. Batch Processing

H33 batches multiple inference requests into a single CKKS ciphertext using SIMD slot packing. With 16,384 slots and careful tensor layout, up to 8 images can be packed into a single ciphertext batch, amortizing the cost of rotations and key-switching across all 8 inferences simultaneously. This is the primary mechanism behind the 100,000 inferences/sec throughput figure.

API Usage

Python — FHE Neural Network Inference

import h33

# Initialize the FHE inference client
client = h33.FHEInferenceClient(
    api_key="h33_pk_...",
    model="resnet18",         # resnet18 | vgg16 | mobilenetv2 | efficientnetb0
    weight_mode="plaintext",  # plaintext | encrypted
)

# Encrypt the input locally (never leaves your environment in plaintext)
image_tensor = load_image("patient_scan.dcm")  # 224x224x3 float32
encrypted_input = client.encrypt(image_tensor)

# Run inference on encrypted data — server never sees plaintext
encrypted_output = client.infer(encrypted_input)

# Decrypt the result locally
prediction = client.decrypt(encrypted_output)
print(prediction.top_k(5))
# [('pneumonia', 0.94), ('bronchitis', 0.03), ('normal', 0.02), ...]

Rust — Batch FHE Inference (Server-Side)

use h33::fhe::ckks::{CKKSEngine, InferenceConfig};
use h33::models::ResNet18Encrypted;

// Load the pre-compiled encrypted model
let config = InferenceConfig {
    n: 32768,
    log_q: 880,
    levels: 17,
    batch_size: 8,
};
let engine = CKKSEngine::new(config);
let model = ResNet18Encrypted::load("models/resnet18_fhe.bin")?;

// Process a batch of 8 encrypted inputs in parallel
let encrypted_inputs: Vec<CKKSCiphertext> = receive_batch().await?;
let packed = engine.pack_batch(&encrypted_inputs)?;

// Full forward pass on encrypted data — ~9.8ms for 8 inferences
let encrypted_outputs = model.forward(&engine, &packed)?;

// Return encrypted results — server never saw plaintext
send_encrypted_results(encrypted_outputs).await?;

Performance

FHE inference has historically been orders of magnitude slower than plaintext inference. The core bottleneck is multiplicative depth: each non-linear layer consumes a CKKS level, and bootstrapping (to refresh levels) is expensive. H33 addresses this with three techniques: level-aware model compilation (fusing batch norm, minimizing activation depth), SIMD batch packing (amortizing rotation cost across multiple inputs), and NTT-accelerated polynomial evaluation (leveraging the same Montgomery NTT pipeline used in H33's authentication stack).

Framework	Scheme	ResNet-18 Latency	Throughput (inf/sec)	Batching	Accuracy Loss
H33	CKKS	9.8 ms	100,000	8/CT	−0.7%
Zama TFHE-rs	TFHE	4,900 ms	~200	1/CT	−1.2%
Microsoft SEAL	CKKS	380 ms	~2,600	1/CT	−0.8%
IBM HELib	CKKS	520 ms	~1,900	1/CT	−0.9%

All benchmarks use ResNet-18 on ImageNet-1K, measured on c8g.metal-48xl (96 vCPUs, Graviton4). TFHE-rs uses Boolean circuit evaluation, which is exact but extremely slow for deep networks. SEAL and HELib use CKKS but lack H33's SIMD batch packing and NTT acceleration.

500x Faster Than TFHE-rs

Zama's TFHE-rs evaluates neural networks using Boolean gates over encrypted bits. This provides exact computation but requires millions of bootstrapping operations per layer. A single ResNet-18 inference takes ~4.9 seconds. H33's CKKS-based approach with SIMD batching achieves the same inference in 9.8ms — a 500x speedup. The tradeoff is approximate (not exact) arithmetic, but with <1% accuracy loss, this is a non-issue for production ML workloads.

Throughput Scaling

The 100,000 inferences/sec figure is achieved through parallel batch processing across 96 Graviton4 cores. Each core processes one CKKS batch (8 inferences) independently. With ~1,300 batches/sec/core and 96 cores, the system sustains ~100,000 individual inferences per second under continuous load. This scales linearly — a 2-instance deployment doubles throughput to 200K inf/sec.

Workers	Batch/sec	Inferences/sec	Per-Inference Cost
1	102	816	$0.00078
32	3,264	26,112	$0.000024
96	12,500	100,000	$0.0000063

Use Cases

🏥

Healthcare: Encrypted Medical Imaging

Run diagnostic AI on encrypted X-rays, CT scans, and MRI images without the cloud provider ever seeing patient data. A hospital encrypts DICOM images locally, sends encrypted tensors to H33's inference API, and receives encrypted diagnostic predictions. The radiology AI model processes the scan entirely in ciphertext. Full HIPAA compliance without secure enclaves, BAAs with cloud providers, or on-premise GPU clusters. Read more about FHE in healthcare.

HIPAA compliant

💳

Finance: Encrypted Credit Scoring

Run credit risk models on encrypted financial records. A bank encrypts a customer's income, debt, payment history, and asset data locally, then sends the encrypted feature vector to the scoring model. The model produces an encrypted credit score and risk classification. The bank decrypts the result. At no point does the model operator, cloud provider, or any intermediary see the customer's financial data in plaintext. Read more about FHE in financial services.

PCI DSS ready

🏛

Government: Classified Data ML

Intelligence agencies and defense departments need ML inference on classified datasets but cannot expose that data to commercial cloud infrastructure. FHE inference allows classified satellite imagery, signals intelligence, and personnel records to be processed by ML models running on standard commercial cloud instances. The data remains encrypted at all times — even the cloud operator with root access to the hardware cannot extract plaintext.

IL5+ capable

🧬

Genomics: Encrypted DNA Analysis

Run variant calling, ancestry prediction, and disease risk models on encrypted genomic data. Patients encrypt their raw sequencing output, submit it for analysis, and receive encrypted results — no third party ever sees their DNA. This enables privacy-preserving genetic research across institutions without data-sharing agreements.

GINA compliant

Getting Started

H33's FHE inference API is available today. The workflow is three steps: encrypt locally, infer remotely, decrypt locally. The server never touches plaintext.

Step 1: Install the SDK

Terminal

# Python
pip install h33

# Rust
cargo add h33 --features fhe-inference

Step 2: Encrypt and Infer

Python — Complete Example

import h33
import numpy as np

# 1. Create a client with your API key
client = h33.FHEInferenceClient(
    api_key="h33_pk_your_api_key",
    model="efficientnetb0",
)

# 2. Load your data (image, tabular, or any tensor)
image = np.random.randn(224, 224, 3).astype(np.float32)

# 3. Encrypt locally — only you hold the secret key
enc_input = client.encrypt(image)

# 4. Send encrypted data for inference — server sees only ciphertext
enc_result = client.infer(enc_input)

# 5. Decrypt the prediction locally
result = client.decrypt(enc_result)
print(f"Prediction: {result.label}, confidence: {result.confidence:.4f}")

# Batch inference — up to 8 images per call
batch = [client.encrypt(img) for img in images[:8]]
enc_results = client.infer_batch(batch)
predictions = [client.decrypt(r) for r in enc_results]

Step 3: Choose Your Model

Model	Best For	Latency	Accuracy
efficientnetb0	Best speed/accuracy ratio	2.9 ms	76.5%
mobilenetv2	Lowest latency	4.1 ms	71.5%
resnet18	Standard benchmark	9.8 ms	75.4%
vgg16	Maximum accuracy (large models)	24.3 ms	72.6%

Full API documentation, authentication setup, and language-specific guides are available in the API Reference.

Neural Networks on Encrypted Data.
100,000 Inferences Per Second.

The Problem: AI Needs Data, Privacy Needs Encryption

What Is FHE Inference?

Encrypted Layer Operations

Supported Models

How It Works: Architecture

1. CKKS Encryption

2. Encrypted Layer Evaluation

3. Encrypted Weights vs. Plaintext Weights

4. Batch Processing

API Usage

Performance

Throughput Scaling

Use Cases

Getting Started

Step 1: Install the SDK

Step 2: Encrypt and Infer

Step 3: Choose Your Model

Run AI on Encrypted Data Today

The Problem: AI Needs Data, Privacy Needs Encryption

What Is FHE Inference?

Encrypted Layer Operations

Supported Models

How It Works: Architecture

1. CKKS Encryption

2. Encrypted Layer Evaluation

3. Encrypted Weights vs. Plaintext Weights

4. Batch Processing

API Usage

Performance

Throughput Scaling

Use Cases

Getting Started

Step 1: Install the SDK

Step 2: Encrypt and Infer

Step 3: Choose Your Model

Run AI on Encrypted Data Today

Related Articles