January 2026 Benchmark Report

Our January 2026 benchmark suite represents the most comprehensive performance analysis we've published (see the live benchmarks page for the latest numbers). Testing was conducted on AWS c8g.metal-48xl instances with AWS Graviton4 (Neoverse V2) processors, measuring production-representative workloads across all authentication modes. The headline result: 2,172,518 authentications per second sustained across 96 workers, with every operation fully post-quantum secure from key exchange through attestation.

1.28ms

Full Auth

42µs

Session Resume

8.6M/s

Batch Auth

67x

Cache Speedup

Test Infrastructure

Benchmark Environment

Instance: AWS c8g.metal-48xl
CPU: AWS Graviton4 (Neoverse V2, 96 cores)
Memory: 377 GiB DDR5
OS: Amazon Linux 2023
H33 Version: 2.4.0

All benchmarks were run with warm caches where applicable, representing typical production conditions. Cold-start measurements are noted separately. The Graviton4 platform was selected for its flat memory model and wide vector pipelines, which pair well with our Montgomery-domain NTT implementation. We use the system allocator rather than jemalloc—on this architecture, glibc malloc is heavily optimized for ARM's memory hierarchy, and jemalloc's arena bookkeeping introduces measurable overhead under tight 96-worker FHE loops.

Full Stack Authentication

Full Stack Auth combines biometric verification, FHE-encrypted matching, ZK proof generation, and Dilithium attestation into a single API call. Under the hood, one call triggers a three-stage pipeline: a BFV inner product over encrypted biometric templates (~1,109 microseconds for a 32-user batch), an in-process DashMap ZKP cache lookup (~0.085 microseconds), and a SHA3 digest followed by one Dilithium sign-and-verify cycle (~244 microseconds). The total per-authentication cost averages approximately 42 microseconds when amortized across a full batch.

Mode	Latency	Description
Turbo	1.28ms	Optimized for speed, full security
Standard	633µs	Balanced performance and features
Precision	2.1ms	Maximum accuracy, extended checks

Key Insight

Every mode is fully post-quantum secure. The latency difference between Turbo and Precision comes from the number of NTT polynomial multiplications and the depth of the biometric comparison circuit—not from any reduction in cryptographic strength.

The BFV Pipeline: Why N=4096 Matters

H33 uses the BFV (Brakerski/Fan-Vercauteren) fully homomorphic encryption scheme with a polynomial degree of N=4096, a single 56-bit modulus Q, and a plaintext modulus t=65537. This parameter set was deliberately chosen for authentication workloads: it satisfies the CRT batching condition (t is congruent to 1 mod 2N), which enables SIMD-style packing of 4,096 plaintext slots. Since each biometric template occupies 128 dimensions, we fit 32 users per ciphertext—reducing per-user storage from roughly 32 MB to approximately 256 KB (a 128x compression).

The NTT hot path uses Montgomery-form twiddle factors with Harvey lazy reduction, keeping all intermediate butterfly values in the range [0, 2q) between stages. This eliminates division from the inner loop entirely. Enrolled templates are stored in NTT form at enrollment time so that multiply_plain_accumulate never needs a forward NTT during verification. The result: FHE batch processing for 32 users completes in roughly 1,109 microseconds, down 19.3% from the previous baseline of 1,375 microseconds.

Session Management

Session operations show the benefit of our context caching architecture:

Operation	Latency	Speedup
Session Resume	42µs	4.4x vs Full Auth
Incremental Auth (5% delta)	<50µs	4.4x vs Full Auth
Session Validation	12µs	18x vs Full Auth

Session resume avoids re-running the full FHE pipeline by retaining the ZKP cache entry from the original authentication. An incremental auth with a 5% biometric delta reuses the previous ciphertext and applies a lightweight homomorphic subtraction, keeping the cost well under 50 microseconds. Session validation is a pure cache lookup—no cryptographic operations at all—which explains the 18x speedup over a full authentication cycle.

Proof Operations

ZK proof performance across generation, verification, and caching:

Operation	Latency	Notes
Proof Generation	1.28ms	Dilithium3 signatures
Proof Verification (cold)	2.14ms	First verification
Proof Verification (cached)	32µs	67x speedup
Biometric ZK Proof	260µs	FHE + ZK combined

The 67x cache speedup is achieved via an in-process DashMap that replaces our previous TCP-based cache proxy. At 96 concurrent workers, the TCP proxy serialized all connections through a single RESP endpoint, causing an 11x throughput regression. The in-process DashMap delivers 0.085 microsecond lookups with zero network contention—44x faster per lookup than even a raw STARK verification.

Batch Processing

Batch operations demonstrate sub-linear scaling for high-throughput scenarios. The key optimization here is batch attestation: instead of signing and verifying each user individually, H33 computes a single Dilithium sign+verify pair for the entire batch of 32 users. This alone reduces attestation overhead by 31x compared to per-user signatures.

Batch Size	Total Latency	Per-User
10 users	12µs	1.2µs
100 users	45µs	0.45µs
1,000 users	116µs	0.116µs
10,000 users	890µs	0.089µs

At 1,000 users in 116µs, that's high-throughput authentication on a single node. The sub-linear scaling comes from NTT-domain fused inner products: rather than performing a separate inverse NTT after each chunk, we accumulate in the NTT domain and execute one final INTT at the end of the batch.

Batch ZKP

Batch Size	Total Latency	vs Sequential
10 proofs	4.2ms	64% faster
100 proofs	35ms	73% faster
1,000 proofs	310ms	77% faster

FHE Operations

Fully Homomorphic Encryption performance for biometric matching:

Operation	Latency
Template Encryption	85µs
Encrypted Matching	260µs
Result Decryption	45µs
End-to-End FHE Auth	260µs

Template encryption uses parallel NTT across all moduli via Rayon, with batch CBD sampling (one RNG call per 10 coefficients) cutting noise generation time by 5x. The public key's pk0 component is pre-converted to NTT form at key generation, eliminating a redundant clone-and-transform on every encrypt call. Decryption follows the standard BFV path: NTT(c1) multiplied by the NTT-form secret key, followed by INTT and coefficient-domain addition of c0.

Key Insight

The multiply_plain_ntt() optimization skips 2xM inverse NTT transforms per call (where M is the number of CRT moduli). Since ciphertexts remain in NTT form, partial_decrypt_crt skips the forward NTT on c1, and combine_crt applies the INTT to c0 only once at the very end. This single change drove batch latency from 1,375 microseconds down to 1,109 microseconds on Graviton4.

Memory and CPU Utilization

Resource consumption under sustained load:

Peak memory: 2.8GB for 10,000 concurrent sessions
CPU utilization: 65% at 1M auth/sec sustained
Cache hit rate: 94% for returning users
GC pause: <1ms (Rust core, no GC)

The low memory footprint is a direct consequence of SIMD batching. Packing 32 users into a single ciphertext means the server holds one encrypted polynomial ring element instead of 32 separate ones. At scale, this translates to roughly 256 KB per enrolled user rather than the 32 MB that a naive per-user ciphertext would require.

Methodology

All benchmarks follow these principles:

Warm cache: Tests run after cache warmup unless measuring cold performance
P50 latency: Reported numbers are median (50th percentile)
Production workloads: Test data represents real authentication patterns
Isolated measurement: Benchmarks run in isolation to avoid interference
Repeated trials: Each measurement is the median of 1,000+ runs

Reproducing These Results

You can reproduce these benchmarks with your own H33 API key:

npm install @h33/benchmark-suite
h33-benchmark --api-key YOUR_KEY --suite full

The benchmark suite includes all tests documented here and outputs comparable metrics for your infrastructure. For Graviton4-specific profiling, pass --platform graviton4 to enable ARM-native NTT paths and NEON-accelerated Galois operations.

Run Your Own Benchmarks

Get an API key and see these performance numbers on your infrastructure.

Get Free API Key

January 2026 Benchmark Report:
H33 Performance Analysis

Test Infrastructure

Benchmark Environment

Full Stack Authentication

The BFV Pipeline: Why N=4096 Matters

Session Management

Proof Operations

Batch Processing

Batch ZKP

FHE Operations

Memory and CPU Utilization

Methodology

Reproducing These Results

Run Your Own Benchmarks

Build With Post-Quantum Security

Test Infrastructure

Benchmark Environment

Full Stack Authentication

The BFV Pipeline: Why N=4096 Matters

Session Management

Proof Operations

Batch Processing

Batch ZKP

FHE Operations

Memory and CPU Utilization

Methodology

Reproducing These Results

Run Your Own Benchmarks

Build With Post-Quantum Security

Related Articles