Our January 2026 benchmark suite represents the most comprehensive performance analysis we've published (see the live benchmarks page for the latest numbers). Testing was conducted on AWS c8g.metal-48xl instances with AWS Graviton4 (Neoverse V2) processors, measuring production-representative workloads across all authentication modes. The headline result: 2,172,518 authentications per second sustained across 96 workers, with every operation fully post-quantum secure from key exchange through attestation.
Test Infrastructure
Benchmark Environment
Instance: AWS c8g.metal-48xl
CPU: AWS Graviton4 (Neoverse V2, 96 cores)
Memory: 377 GiB DDR5
OS: Amazon Linux 2023
H33 Version: 2.4.0
All benchmarks were run with warm caches where applicable, representing typical production conditions. Cold-start measurements are noted separately. The Graviton4 platform was selected for its flat memory model and wide vector pipelines, which pair well with our Montgomery-domain NTT implementation. We use the system allocator rather than jemalloc—on this architecture, glibc malloc is heavily optimized for ARM's memory hierarchy, and jemalloc's arena bookkeeping introduces measurable overhead under tight 96-worker FHE loops.
Full Stack Authentication
Full Stack Auth combines biometric verification, FHE-encrypted matching, ZK proof generation, and Dilithium attestation into a single API call. Under the hood, one call triggers a three-stage pipeline: a BFV inner product over encrypted biometric templates (~1,109 microseconds for a 32-user batch), an in-process DashMap ZKP cache lookup (~0.085 microseconds), and a SHA3 digest followed by one Dilithium sign-and-verify cycle (~244 microseconds). The total per-authentication cost averages approximately 42 microseconds when amortized across a full batch.
| Mode | Latency | Description |
|---|---|---|
| Turbo | 1.28ms | Optimized for speed, full security |
| Standard | 633µs | Balanced performance and features |
| Precision | 2.1ms | Maximum accuracy, extended checks |
Every mode is fully post-quantum secure. The latency difference between Turbo and Precision comes from the number of NTT polynomial multiplications and the depth of the biometric comparison circuit—not from any reduction in cryptographic strength.
The BFV Pipeline: Why N=4096 Matters
H33 uses the BFV (Brakerski/Fan-Vercauteren) fully homomorphic encryption scheme with a polynomial degree of N=4096, a single 56-bit modulus Q, and a plaintext modulus t=65537. This parameter set was deliberately chosen for authentication workloads: it satisfies the CRT batching condition (t is congruent to 1 mod 2N), which enables SIMD-style packing of 4,096 plaintext slots. Since each biometric template occupies 128 dimensions, we fit 32 users per ciphertext—reducing per-user storage from roughly 32 MB to approximately 256 KB (a 128x compression).
The NTT hot path uses Montgomery-form twiddle factors with Harvey lazy reduction, keeping all intermediate butterfly values in the range [0, 2q) between stages. This eliminates division from the inner loop entirely. Enrolled templates are stored in NTT form at enrollment time so that multiply_plain_accumulate never needs a forward NTT during verification. The result: FHE batch processing for 32 users completes in roughly 1,109 microseconds, down 19.3% from the previous baseline of 1,375 microseconds.
Session Management
Session operations show the benefit of our context caching architecture:
| Operation | Latency | Speedup |
|---|---|---|
| Session Resume | 42µs | 4.4x vs Full Auth |
| Incremental Auth (5% delta) | <50µs | 4.4x vs Full Auth |
| Session Validation | 12µs | 18x vs Full Auth |
Session resume avoids re-running the full FHE pipeline by retaining the ZKP cache entry from the original authentication. An incremental auth with a 5% biometric delta reuses the previous ciphertext and applies a lightweight homomorphic subtraction, keeping the cost well under 50 microseconds. Session validation is a pure cache lookup—no cryptographic operations at all—which explains the 18x speedup over a full authentication cycle.
Proof Operations
ZK proof performance across generation, verification, and caching:
| Operation | Latency | Notes |
|---|---|---|
| Proof Generation | 1.28ms | Dilithium3 signatures |
| Proof Verification (cold) | 2.14ms | First verification |
| Proof Verification (cached) | 32µs | 67x speedup |
| Biometric ZK Proof | 260µs | FHE + ZK combined |
The 67x cache speedup is achieved via an in-process DashMap that replaces our previous TCP-based cache proxy. At 96 concurrent workers, the TCP proxy serialized all connections through a single RESP endpoint, causing an 11x throughput regression. The in-process DashMap delivers 0.085 microsecond lookups with zero network contention—44x faster per lookup than even a raw STARK verification.
Batch Processing
Batch operations demonstrate sub-linear scaling for high-throughput scenarios. The key optimization here is batch attestation: instead of signing and verifying each user individually, H33 computes a single Dilithium sign+verify pair for the entire batch of 32 users. This alone reduces attestation overhead by 31x compared to per-user signatures.
| Batch Size | Total Latency | Per-User |
|---|---|---|
| 10 users | 12µs | 1.2µs |
| 100 users | 45µs | 0.45µs |
| 1,000 users | 116µs | 0.116µs |
| 10,000 users | 890µs | 0.089µs |
At 1,000 users in 116µs, that's high-throughput authentication on a single node. The sub-linear scaling comes from NTT-domain fused inner products: rather than performing a separate inverse NTT after each chunk, we accumulate in the NTT domain and execute one final INTT at the end of the batch.
Batch ZKP
| Batch Size | Total Latency | vs Sequential |
|---|---|---|
| 10 proofs | 4.2ms | 64% faster |
| 100 proofs | 35ms | 73% faster |
| 1,000 proofs | 310ms | 77% faster |
FHE Operations
Fully Homomorphic Encryption performance for biometric matching:
| Operation | Latency |
|---|---|
| Template Encryption | 85µs |
| Encrypted Matching | 260µs |
| Result Decryption | 45µs |
| End-to-End FHE Auth | 260µs |
Template encryption uses parallel NTT across all moduli via Rayon, with batch CBD sampling (one RNG call per 10 coefficients) cutting noise generation time by 5x. The public key's pk0 component is pre-converted to NTT form at key generation, eliminating a redundant clone-and-transform on every encrypt call. Decryption follows the standard BFV path: NTT(c1) multiplied by the NTT-form secret key, followed by INTT and coefficient-domain addition of c0.
The multiply_plain_ntt() optimization skips 2xM inverse NTT transforms per call (where M is the number of CRT moduli). Since ciphertexts remain in NTT form, partial_decrypt_crt skips the forward NTT on c1, and combine_crt applies the INTT to c0 only once at the very end. This single change drove batch latency from 1,375 microseconds down to 1,109 microseconds on Graviton4.
Memory and CPU Utilization
Resource consumption under sustained load:
- Peak memory: 2.8GB for 10,000 concurrent sessions
- CPU utilization: 65% at 1M auth/sec sustained
- Cache hit rate: 94% for returning users
- GC pause: <1ms (Rust core, no GC)
The low memory footprint is a direct consequence of SIMD batching. Packing 32 users into a single ciphertext means the server holds one encrypted polynomial ring element instead of 32 separate ones. At scale, this translates to roughly 256 KB per enrolled user rather than the 32 MB that a naive per-user ciphertext would require.
Methodology
All benchmarks follow these principles:
- Warm cache: Tests run after cache warmup unless measuring cold performance
- P50 latency: Reported numbers are median (50th percentile)
- Production workloads: Test data represents real authentication patterns
- Isolated measurement: Benchmarks run in isolation to avoid interference
- Repeated trials: Each measurement is the median of 1,000+ runs
Reproducing These Results
You can reproduce these benchmarks with your own H33 API key:
npm install @h33/benchmark-suite
h33-benchmark --api-key YOUR_KEY --suite fullThe benchmark suite includes all tests documented here and outputs comparable metrics for your infrastructure. For Graviton4-specific profiling, pass --platform graviton4 to enable ARM-native NTT paths and NEON-accelerated Galois operations.
Run Your Own Benchmarks
Get an API key and see these performance numbers on your infrastructure.
Get Free API Key