When your authentication system needs to handle billions of users, every microsecond of per-user overhead adds up. H33's batch architecture achieves sub-microsecond per-user costs by packing 32 biometric templates into a single BFV ciphertext and amortizing every expensive cryptographic operation across the batch. The result: 2,172,518 authentications per second on a single node, with each auth completing in approximately 42 microseconds—fully post-quantum secure.
The Scale Math
1,000 users authenticated in 116µs = 0.116µs per user
At this efficiency: 8.6 million auth/second per node
10-node cluster: linearly scaled throughput
This is how you authenticate Earth's population.
The Batch Efficiency Curve
H33's batch processing exhibits sub-linear scaling—adding more users to a batch increases total time, but decreases per-user cost:
This curve is not accidental. It is a direct consequence of the BFV fully homomorphic encryption scheme H33 uses. With a polynomial degree N=4096 and plaintext modulus t=65537, the CRT batching condition (t ≡ 1 mod 2N) is satisfied, which means the plaintext space decomposes into 4,096 independent SIMD slots. Each biometric template occupies 128 dimensions, so exactly 32 users fit into one ciphertext (4096 ÷ 128 = 32). Every FHE operation—encrypt, inner product, decrypt—processes all 32 users in lockstep with zero marginal cost per additional user within the batch.
Why Batching Works
Authentication operations share significant common computation:
- Key loading: Cryptographic keys loaded once, used for entire batch
- Circuit initialization: ZK circuit setup amortized across all proofs
- Memory allocation: Single allocation serves entire batch
- SIMD parallelism: Vector instructions process multiple users simultaneously
Sequential processing pays these costs for every user. Batching pays them once.
The BFV inner product for a 32-user batch completes in approximately 1,109 microseconds. But the per-auth cost is only ~42µs because ZKP verification (0.085µs via in-process DashMap lookup) and Dilithium attestation (~244µs for one sign+verify per batch) are fully amortized. Signing once per batch instead of 32 times yields a 31x reduction in attestation overhead alone.
Inside the Pipeline: Three Stages, One API Call
Every H33 authentication request passes through three cryptographic stages within a single API call. Understanding how each stage batches is critical to understanding the throughput numbers.
| Stage | Operation | Batch Latency (32 users) | Per-Auth Cost | PQ-Secure |
|---|---|---|---|---|
| 1. FHE Batch | BFV inner product (SIMD slots) | ~1,109 µs | ~34.7 µs | Yes (lattice) |
| 2. ZKP Verify | In-process DashMap lookup | ~2.7 µs (32 lookups) | ~0.085 µs | Yes (SHA3-256) |
| 3. Attestation | SHA3 digest + Dilithium sign+verify | ~244 µs (1 per batch) | ~7.6 µs | Yes (ML-DSA) |
| Total | ~1,356 µs | ~42 µs |
Stage 1 dominates wall-clock time, but it is also where SIMD batching provides the largest win. The BFV encrypt operation runs a forward NTT on each polynomial modulus in parallel via Rayon, and the inner product is computed entirely in the NTT domain—one final inverse NTT instead of per-chunk transforms. Montgomery multiplication with Harvey lazy reduction keeps butterfly values in [0, 2q) between NTT stages, eliminating all division from the hot path.
Stage 2 was originally a full STARK proof verification at ~3.7µs per user. Moving to an in-process DashMap cache dropped this to 0.085µs—a 44x improvement—and eliminated TCP serialization that had throttled the system to 136K auth/sec when routing through a RESP proxy at 96 workers.
Stage 3 uses a single CRYSTALS-Dilithium (ML-DSA-65) signature over the SHA3-256 digest of the entire batch result. One sign and one verify per 32 users, not 32 of each. This single optimization cuts attestation cost by 31x compared to individual signing.
Architecture Overview
┌─────────────────────────────────────────────────────┐
│ Request Collector │
│ Accumulates requests until batch threshold/timeout │
└──────────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Batch Scheduler │
│ Groups requests by type, allocates to processors │
└──────────────────────┬──────────────────────────────┘
│
┌─────────────┼─────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Processor │ │ Processor │ │ Processor │
│ (Core 1) │ │ (Core 2) │ │ (Core N) │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
└─────────────┼─────────────┘
▼
┌─────────────────────────────────────────────────────┐
│ Result Aggregator │
│ Collects results, dispatches to waiting requests │
└─────────────────────────────────────────────────────┘
On the production Graviton4 c8g.metal-48xl instance (192 vCPUs), 96 Rayon workers run in parallel, each processing its own 32-user batch. The batch scheduler fills workers using a work-stealing queue, which keeps all cores saturated even when request arrival is bursty.
Batch Collection Strategies
H33 supports multiple batching strategies:
Time-windowed: Collect requests for up to N milliseconds, then process.
const batcher = h33.createBatcher({
maxWaitMs: 10, // Max 10ms collection window
maxBatchSize: 1000 // Or until 1000 requests
});Size-triggered: Process immediately when batch reaches target size.
const batcher = h33.createBatcher({
maxBatchSize: 100, // Process at 100 requests
maxWaitMs: 50 // Or 50ms, whichever first
});Adaptive: Adjust batch parameters based on load.
const batcher = h33.createBatcher({
mode: 'adaptive',
targetLatency: 5 // Target 5ms response time
});Handling Heterogeneous Requests
Not all authentication requests are identical. H33 batches intelligently:
- Same-type batching: Group full auth with full auth, session resume with session resume
- Priority lanes: High-priority requests get dedicated batch slots
- Mixed batches: When necessary, different types share computation where possible
NTT Domain Optimization: Where the Microseconds Went
The single largest performance win in the batch pipeline is keeping data in the Number Theoretic Transform (NTT) domain as long as possible. A naive implementation would transform into NTT form for each polynomial multiplication and immediately transform back. H33 avoids this entirely.
Enrolled biometric templates are stored pre-transformed in NTT form. The public key component pk0 is converted to NTT form at keygen time and cached. The multiply_plain_ntt() function returns ciphertexts that remain in NTT form, skipping two inverse NTTs per modulus. The inner product across all 128 dimensions accumulates in the NTT domain, and only one final INTT is performed at the end. For a system with M moduli, this eliminates 2×M×127 redundant transforms per batch—the difference between a 1,375µs batch and a 1,109µs batch (a 19.3% improvement).
Failure Isolation
One failing request must not sink the batch:
- Individual error handling: Each request in batch gets its own success/failure status
- Partial results: Batch returns all completed results even if some fail
- Circuit breakers: Persistent failures trigger fallback to sequential processing
Multi-Node Scaling
Single-node performance is high-throughput auth. Scaling horizontally:
- Stateless design: Any node can handle any request
- Shared cache: Redis cluster for session state and proof cache
- Load balancing: Consistent hashing for cache affinity
- Auto-scaling: Scale based on batch queue depth
With 10 nodes: linearly scaled throughput. With 100 nodes: linearly scaled throughput. The architecture scales linearly because there is no shared mutable state between batch processors—each node holds its own copy of the public parameters, and the DashMap ZKP cache is per-process.
A single c8g.metal-48xl node sustains 2.17M auth/sec (see full benchmarks). At linear scaling, a 10-node cluster handles nearly 16 million authentications per second—enough to authenticate every human on Earth in under 9 minutes, with full FHE encryption, ZKP verification, and post-quantum Dilithium attestation on every single request.
Real-World Deployment
For a social platform with 1 billion daily active users:
- Peak load: ~50M authentications/minute (morning login wave)
- Required capacity: ~830K auth/second
- H33 nodes needed: 1 (with 10x headroom)
The math works because batch efficiency transforms impossible scale into trivial infrastructure. A single Graviton4 metal instance at approximately $2/hour spot pricing delivers nearly twice the required peak capacity. The entire authentication infrastructure for a billion-user platform costs less than a cup of coffee per hour—and every authentication is encrypted end-to-end with lattice-based FHE, verified with zero-knowledge proofs, and attested with post-quantum signatures.