BenchmarksVerificationPricingDemo
Log InGet API Key

ML-DSA-65 (Dilithium) Performance

Version: 1.0.0
Status: Production
Last Updated: 2026-05-23
Hardware: AWS c8g.metal-48xl (Graviton4, 192 vCPU, 371 GiB)
NIST Standard: FIPS 204 (ML-DSA)
Canonical URL: https://h33.ai/benchmarks/dilithium/

1. Scope

This document presents production benchmark data for ML-DSA-65 (formerly CRYSTALS-Dilithium Level 3) as implemented in the H33 cryptographic pipeline. All measurements were recorded on AWS Graviton4 hardware under sustained load conditions (minimum 30 seconds per measurement). Numbers represent median latency with p99 bounds.

2. Definitions

ML-DSA-65
Module-Lattice-Based Digital Signature Algorithm at NIST security level 3 (128-bit classical / 128-bit quantum). Standardized in NIST FIPS 204. Based on the hardness of Module-Learning With Errors (MLWE) and Module-Short Integer Solution (MSIS).
Sign Latency
The wall-clock time to generate a signature over a 32-byte message digest, excluding key generation and I/O.
Verify Latency
The wall-clock time to verify a signature against a 32-byte message digest and public key.
Batch Signing
The process of signing multiple message digests sequentially using the same key pair, with amortized key loading overhead.

3. Hardware and Environment

PropertyValue
InstanceAWS c8g.metal-48xl
ProcessorGraviton4 (Arm Neoverse V2)
vCPUs192
Memory371 GiB
OSAmazon Linux 2023 (kernel 6.1)
Rust1.78.0 (stable)
AllocatorSystem (glibc)
Build flagstarget-cpu=native
SIMDNEON + SVE2 (256-bit)

4. Core Operations

4.1. Single-Operation Latencies

OperationMedianp99MinSamples
Key generation28 us34 us26 us100,000
Sign (32B message)72 us91 us68 us1,000,000
Verify (32B message)24 us29 us22 us1,000,000
Sign + Verify96 us118 us92 us1,000,000

4.2. Key and Signature Sizes

ArtifactSize (bytes)
Public key1,952
Secret key4,032
Signature3,309

5. Batch Signing Throughput

Batch signing amortizes key loading overhead across multiple sign operations. The following table shows throughput at different batch sizes on a single core.

Batch SizeTotal TimePer-SignThroughput (signs/sec)
172 us72 us13,889
322,210 us69.1 us14,480
1006,840 us68.4 us14,620
1,00068,100 us68.1 us14,684

At batch sizes above 32, per-sign latency converges to approximately 68 us (amortized). Single-core throughput plateaus at approximately 14,700 signs/sec.

5.1. Multi-Core Scaling

With independent key pairs per thread (no shared mutable state), ML-DSA-65 signing scales linearly across cores:

CoresThroughput (signs/sec)Scaling Factor
114,6841.00x
16234,20015.95x
48698,40047.56x
961,389,00094.60x
1922,752,000187.43x

Near-linear scaling (97.6% efficiency at 192 cores) is achieved because ML-DSA-65 signing has no shared state and fits entirely within L1 cache. The slight sub-linearity at 192 cores is attributable to memory bandwidth saturation, not lock contention.

6. Comparison: ML-DSA-65 vs Classical Schemes

The following table compares ML-DSA-65 against Ed25519 and RSA-2048, all measured on the same Graviton4 hardware under identical conditions.

SchemeSecurity LevelSignVerifySignature SizePQ-Secure
ML-DSA-65NIST Level 372 us24 us3,309 BYes
Ed25519128-bit classical18 us52 us64 BNo
RSA-2048112-bit classical1,420 us38 us256 BNo

ML-DSA-65 sign latency is 4x slower than Ed25519 but 19.7x faster than RSA-2048. ML-DSA-65 verify latency is 2.2x faster than Ed25519 verify and 1.6x faster than RSA-2048 verify. The signature size tradeoff (3,309 bytes vs 64 bytes for Ed25519) is the cost of post-quantum security.

Ed25519 and RSA-2048 are not quantum-resistant. The comparison is provided for migration planning purposes. Organizations transitioning from Ed25519 will observe a 4x sign latency increase and a 52x signature size increase, offset by a 2.2x verify latency improvement.

7. NIST FIPS 204 Compliance

The H33 ML-DSA-65 implementation conforms to NIST FIPS 204 (Module-Lattice-Based Digital Signature Standard). Compliance is verified through:

  1. KAT vectors. All 100 Known Answer Test vectors from the NIST reference implementation pass. KAT validation runs as part of the CI pipeline on every commit.
  2. Parameter set. ML-DSA-65 uses (k=6, l=5, eta=4, gamma1=2^19, gamma2=(q-1)/32, tau=49, beta=196, omega=55). These match the FIPS 204 specification exactly.
  3. Deterministic signing. The implementation uses the deterministic variant (no hedging with additional randomness). Given identical inputs, identical signatures are produced.
  4. Side-channel mitigation. Constant-time NTT, constant-time rejection sampling, and constant-time polynomial arithmetic. No data-dependent branches in the hot path.

8. Integration in H33 Pipeline

In the H33 production authentication pipeline, ML-DSA-65 is one of three signature families applied to each governance node. The batch attestation stage (Stage 2 in the pipeline) includes SHA3-256 hashing, ML-DSA-65 sign, FALCON-512 sign, SLH-DSA-SHA2-128f sign, and triple verify. ML-DSA-65 contributes approximately 35% of the Stage 2 latency.

Pipeline StageComponentLatencyML-DSA-65 Contribution
Stage 2: Batch attestSHA3-256 hash2 us--
Stage 2: Batch attestML-DSA-65 sign72 us72 us
Stage 2: Batch attestFALCON-512 sign89 us--
Stage 2: Batch attestSLH-DSA sign204 us--
Stage 2: Batch attestML-DSA-65 verify24 us24 us
Stage 2: Batch attestTotal391 us96 us (24.6%)

See Agent Infrastructure Benchmarks for the full pipeline breakdown.

9. Reproducibility

To reproduce these benchmarks:

# Clone and build (requires Rust 1.78+) $ cargo build --release --features dilithium_bench # Run single-operation benchmarks $ ./target/release/examples/dilithium_bench --iterations 1000000 # Run batch benchmarks $ ./target/release/examples/dilithium_bench --batch 32 --iterations 100000 # Run multi-core scaling test $ ./target/release/examples/dilithium_bench --threads 192 --iterations 100000

Benchmarks MUST be run with target-cpu=native in .cargo/config.toml. Without this flag, the compiler will not emit SVE2 instructions, and results will be approximately 35% slower.