Single-modality biometric systems are inherently fragile. A high-resolution photograph can fool a face scanner. A gelatin mold can replicate a fingerprint. A recorded audio clip can bypass a voice gate. Each modality, taken alone, carries a measurable false-accept rate (FAR) that attackers can target with commodity hardware. Multimodal biometric fusion solves this by combining independent biometric signals into a single authentication decision, making spoofing exponentially harder because an attacker must defeat every modality simultaneously.
At H33, we push this concept further. Every biometric template is encrypted under BFV fully homomorphic encryption before it ever leaves the client device, and the entire fusion pipeline executes on ciphertext. This is particularly critical in healthcare settings where patient biometric data carries heightened regulatory obligations. The server never sees raw biometric data. The result: 1.595 million authentications per second on production hardware, with each individual auth completing in approximately 42 microseconds -- all while remaining fully post-quantum secure.
The Three Levels of Biometric Fusion
Fusion can happen at different stages in the biometric pipeline. The level you choose determines the balance between accuracy, latency, and architectural complexity.
1. Feature-Level Fusion
Feature-level fusion concatenates or interleaves raw feature vectors from multiple modalities before any matching takes place. A 128-dimensional face embedding and a 96-dimensional voiceprint become a single 224-dimensional vector that is matched against a fused enrollment template. This approach preserves the most discriminative information, but it requires that all modalities share a compatible feature space and that extraction happens synchronously.
2. Score-Level Fusion
Score-level fusion is the most widely deployed approach in production systems. Each modality produces an independent match score, and a fusion function combines them into a single decision metric. The common strategies are:
- Weighted sum -- Assign a reliability weight to each modality. Face might carry 0.45, fingerprint 0.35, and voice 0.20, reflecting their respective equal-error rates.
- Min/max/product rules -- Simple combiners that require no training data. The product rule works well when scores are approximately statistically independent.
- Trained classifiers -- A logistic regression or lightweight neural network learns optimal fusion weights from labeled enrollment data.
3. Decision-Level Fusion
Decision-level fusion is the simplest: each modality independently returns accept or reject, and a voting scheme (majority vote, AND-rule, OR-rule) produces the final verdict. AND-rule maximizes security (all modalities must agree), while OR-rule maximizes convenience (any single match suffices). In practice, a weighted majority vote with modality-specific confidence thresholds offers the best tradeoff.
Practical Modality Combinations
| Combination | Use Case | FAR (Single) | FAR (Fused) | Spoof Difficulty |
|---|---|---|---|---|
| Face + Fingerprint | Physical access control | 0.1% | 0.001% | Requires photo + mold |
| Face + Voice | Remote verification | 0.1% | 0.003% | Requires deepfake + voice clone |
| Face + Behavioral | Continuous auth | 0.5% | 0.01% | Requires sustained impersonation |
| Face + Voice + Fingerprint | High-security facilities | 0.1% | 0.0001% | Near impossible simultaneously |
When modalities are statistically independent, the fused FAR approximates the product of the individual FARs. Face + Fingerprint with individual FARs of 0.1% yields a theoretical fused FAR of 0.0001%. Real-world correlation between modalities raises this slightly, but the improvement remains dramatic -- often two or three orders of magnitude.
Encrypted Fusion with BFV FHE
The critical challenge with multimodal fusion is the expanded attack surface. More biometric templates stored on the server means more data at risk. H33 eliminates this risk entirely by performing all fusion computations on encrypted data using BFV (Brakerski/Fan-Vercauteren) fully homomorphic encryption.
// Encrypted score-level fusion (simplified)
// All operations happen on BFV ciphertexts -- server never sees plaintext
let face_score_ct = bfv.encrypt(&face_score_plain, &pk);
let voice_score_ct = bfv.encrypt(&voice_score_plain, &pk);
let finger_score_ct = bfv.encrypt(&finger_score_plain, &pk);
// Weighted sum: 0.45*face + 0.35*finger + 0.20*voice
let fused_ct = bfv.add(
bfv.add(
bfv.multiply_plain(&face_score_ct, &weight_face),
bfv.multiply_plain(&finger_score_ct, &weight_finger),
),
bfv.multiply_plain(&voice_score_ct, &weight_voice),
);
// Result is still encrypted -- only the client can decryptH33 uses SIMD batching to pack 32 users into a single BFV ciphertext. Each user occupies 128 dimensions across 4,096 polynomial slots (4096 / 128 = 32 users). This means a single encrypted inner-product operation verifies an entire batch of 32 users in approximately 1,109 microseconds, with each individual authentication averaging just 42 microseconds.
The Full Post-Quantum Pipeline
Encrypted biometric fusion is only one component of a secure authentication call. H33 chains three cryptographic stages into a single API request:
FHE biometric match, then ZKP cache verification, then Dilithium attestation -- all in one call, all post-quantum secure, all under 1.4 milliseconds for a 32-user batch.
| Stage | Component | Latency | PQ-Secure |
|---|---|---|---|
| 1. FHE Batch | BFV inner product (32 users/CT) | ~1,109 µs | Yes (lattice) |
| 2. ZKP Verify | In-process DashMap lookup | 0.085 µs | Yes (SHA3-256) |
| 3. Attestation | Dilithium sign + verify | ~244 µs | Yes (ML-DSA) |
The ZKP cache layer deserves special attention. Rather than recomputing a full STARK proof on every request, H33 caches verified proofs in an in-process DashMap that delivers lookups in 0.085 microseconds -- 44 times faster than a raw STARK computation, with zero TCP overhead. This is what enables 1.595 million authentications per second on a single Graviton4 instance.
Liveness Detection and Anti-Spoofing
Multimodal fusion provides a natural defense against spoofing because an attacker must simultaneously defeat multiple independent liveness checks. A presentation attack against fused face + voice authentication requires both a photorealistic 3D face rendering and a real-time voice synthesis model that responds to challenge prompts -- a dramatically harder problem than defeating either modality alone.
H33 strengthens this further by incorporating challenge-response liveness into the FHE pipeline. The server generates an encrypted random challenge (a specific phrase to speak, a head movement to perform), and the client's response is verified entirely on ciphertext. An attacker who intercepts the encrypted challenge gains no information about what action is required.
Choosing the Right Fusion Strategy
The optimal fusion strategy depends on your threat model and deployment constraints:
- Maximum security (AND-rule, three modalities) -- Use for high-value targets: financial institutions, government facilities, cryptocurrency custody. FAR drops below 0.0001%, but user friction increases.
- Balanced (weighted score fusion, two modalities) -- Use for consumer applications: mobile banking, enterprise SSO, healthcare portals. Face + voice provides strong security with minimal hardware requirements.
- Maximum convenience (OR-rule with fallback) -- Use for continuous authentication: session persistence, ambient verification. Any single modality can re-verify a user, with escalation to multimodal on anomaly detection.
Ready to Go Quantum-Secure?
Start protecting your users with post-quantum authentication today. 1,000 free auths, no credit card required.
Get Free API Key →