This document presents production benchmark data for ML-DSA-65 (formerly CRYSTALS-Dilithium Level 3) as implemented in the H33 cryptographic pipeline. All measurements were recorded on AWS Graviton4 hardware under sustained load conditions (minimum 30 seconds per measurement). Numbers represent median latency with p99 bounds.
| Property | Value |
|---|---|
| Instance | AWS c8g.metal-48xl |
| Processor | Graviton4 (Arm Neoverse V2) |
| vCPUs | 192 |
| Memory | 371 GiB |
| OS | Amazon Linux 2023 (kernel 6.1) |
| Rust | 1.78.0 (stable) |
| Allocator | System (glibc) |
| Build flags | target-cpu=native |
| SIMD | NEON + SVE2 (256-bit) |
| Operation | Median | p99 | Min | Samples |
|---|---|---|---|---|
| Key generation | 28 us | 34 us | 26 us | 100,000 |
| Sign (32B message) | 72 us | 91 us | 68 us | 1,000,000 |
| Verify (32B message) | 24 us | 29 us | 22 us | 1,000,000 |
| Sign + Verify | 96 us | 118 us | 92 us | 1,000,000 |
| Artifact | Size (bytes) |
|---|---|
| Public key | 1,952 |
| Secret key | 4,032 |
| Signature | 3,309 |
Batch signing amortizes key loading overhead across multiple sign operations. The following table shows throughput at different batch sizes on a single core.
| Batch Size | Total Time | Per-Sign | Throughput (signs/sec) |
|---|---|---|---|
| 1 | 72 us | 72 us | 13,889 |
| 32 | 2,210 us | 69.1 us | 14,480 |
| 100 | 6,840 us | 68.4 us | 14,620 |
| 1,000 | 68,100 us | 68.1 us | 14,684 |
At batch sizes above 32, per-sign latency converges to approximately 68 us (amortized). Single-core throughput plateaus at approximately 14,700 signs/sec.
With independent key pairs per thread (no shared mutable state), ML-DSA-65 signing scales linearly across cores:
| Cores | Throughput (signs/sec) | Scaling Factor |
|---|---|---|
| 1 | 14,684 | 1.00x |
| 16 | 234,200 | 15.95x |
| 48 | 698,400 | 47.56x |
| 96 | 1,389,000 | 94.60x |
| 192 | 2,752,000 | 187.43x |
Near-linear scaling (97.6% efficiency at 192 cores) is achieved because ML-DSA-65 signing has no shared state and fits entirely within L1 cache. The slight sub-linearity at 192 cores is attributable to memory bandwidth saturation, not lock contention.
The following table compares ML-DSA-65 against Ed25519 and RSA-2048, all measured on the same Graviton4 hardware under identical conditions.
| Scheme | Security Level | Sign | Verify | Signature Size | PQ-Secure |
|---|---|---|---|---|---|
| ML-DSA-65 | NIST Level 3 | 72 us | 24 us | 3,309 B | Yes |
| Ed25519 | 128-bit classical | 18 us | 52 us | 64 B | No |
| RSA-2048 | 112-bit classical | 1,420 us | 38 us | 256 B | No |
ML-DSA-65 sign latency is 4x slower than Ed25519 but 19.7x faster than RSA-2048. ML-DSA-65 verify latency is 2.2x faster than Ed25519 verify and 1.6x faster than RSA-2048 verify. The signature size tradeoff (3,309 bytes vs 64 bytes for Ed25519) is the cost of post-quantum security.
Ed25519 and RSA-2048 are not quantum-resistant. The comparison is provided for migration planning purposes. Organizations transitioning from Ed25519 will observe a 4x sign latency increase and a 52x signature size increase, offset by a 2.2x verify latency improvement.
The H33 ML-DSA-65 implementation conforms to NIST FIPS 204 (Module-Lattice-Based Digital Signature Standard). Compliance is verified through:
In the H33 production authentication pipeline, ML-DSA-65 is one of three signature families applied to each governance node. The batch attestation stage (Stage 2 in the pipeline) includes SHA3-256 hashing, ML-DSA-65 sign, FALCON-512 sign, SLH-DSA-SHA2-128f sign, and triple verify. ML-DSA-65 contributes approximately 35% of the Stage 2 latency.
| Pipeline Stage | Component | Latency | ML-DSA-65 Contribution |
|---|---|---|---|
| Stage 2: Batch attest | SHA3-256 hash | 2 us | -- |
| Stage 2: Batch attest | ML-DSA-65 sign | 72 us | 72 us |
| Stage 2: Batch attest | FALCON-512 sign | 89 us | -- |
| Stage 2: Batch attest | SLH-DSA sign | 204 us | -- |
| Stage 2: Batch attest | ML-DSA-65 verify | 24 us | 24 us |
| Stage 2: Batch attest | Total | 391 us | 96 us (24.6%) |
See Agent Infrastructure Benchmarks for the full pipeline breakdown.
To reproduce these benchmarks:
# Clone and build (requires Rust 1.78+)
$ cargo build --release --features dilithium_bench
# Run single-operation benchmarks
$ ./target/release/examples/dilithium_bench --iterations 1000000
# Run batch benchmarks
$ ./target/release/examples/dilithium_bench --batch 32 --iterations 100000
# Run multi-core scaling test
$ ./target/release/examples/dilithium_bench --threads 192 --iterations 100000Benchmarks MUST be run with target-cpu=native in .cargo/config.toml. Without this flag, the compiler will not emit SVE2 instructions, and results will be approximately 35% slower.