Two Ways to Sign a Message
Method one: install a crate and call a function.
let sig = pqcrypto_dilithium::dilithium3::detached_sign(msg, &sk);
Method two: implement ML-DSA-65 from the NIST FIPS 204 specification. Write the Number Theoretic Transform. Implement Montgomery multiplication without division. Build branchless polynomial arithmetic that doesn't leak secrets through timing. Optimize for ARM NEON on Graviton4. Test against Known Answer Test vectors. Measure constant-time execution with statistical timing analysis.
Method one takes 10 seconds. Method two took H33 three months and 3,127 lines of Rust across three files.
Why would anyone choose method two?
What's Inside a Dilithium Signature
CRYSTALS-Dilithium (ML-DSA, NIST FIPS 204) is a lattice-based signature scheme. The core operation is polynomial arithmetic in a quotient ring R_q = Z_q[X]/(X^n + 1), where q = 8,380,417 and n = 256.
Every signature requires:
Polynomial multiplication. Multiplying two degree-255 polynomials modulo X^256 + 1. Naive multiplication is O(n^2). The Number Theoretic Transform (NTT) reduces this to O(n log n) by converting to evaluation form, pointwise multiplying, and converting back.
Modular arithmetic. Every coefficient must stay in [0, q). The naive approach uses division: a % q. Division is the slowest arithmetic operation on every CPU. Montgomery reduction replaces division with multiplication and bit-shifting: a * q_inv >> 32. H33's NTT twiddle factors are pre-stored in Montgomery form. Zero divisions in the hot path.
Sampling. Key generation and signing require sampling polynomials from specific distributions. CBD (Centered Binomial Distribution) sampling in the pqcrypto crate generates one coefficient at a time. H33's batch CBD sampling generates 10 coefficients per RNG call — 5x fewer random number generator invocations.
Rejection sampling. Dilithium signing uses a "commit-then-open" paradigm with rejection. If the signature would leak information about the secret key, the algorithm restarts with fresh randomness. This makes signing non-deterministic in runtime. H33's implementation pre-allocates the rejection buffer and reuses NTT-form intermediate values across retries, eliminating redundant transforms.
Why Native Rust Matters
The pqcrypto crate is a Rust wrapper around PQClean, which is C code. The call chain is: Rust → FFI boundary → C function → C standard library. Every FFI call has overhead: argument marshaling, stack alignment, no inlining across the boundary.
H33's implementation is pure Rust. The compiler sees the entire call chain. It inlines the NTT butterfly into the signing function. It auto-vectorizes the polynomial addition. It eliminates bounds checks through slice pattern matching. The result: approximately 2x faster than the C wrapper on the same hardware.
More importantly: Rust's ownership system prevents the memory safety bugs that are the number one source of vulnerabilities in C cryptographic implementations. Use-after-free, buffer overflow, double-free — structurally impossible in safe Rust. The pqcrypto C code doesn't have this guarantee.
Branchless is Non-Negotiable
A timing side-channel attack measures how long an operation takes and infers secret key bits from the variance. If if (secret_bit) { expensive_operation(); } takes 3ns when the bit is 1 and 1ns when it's 0, an attacker with a high-resolution timer can extract the entire key.
H33's NTT butterfly is branchless on ARM NEON: 4-wide uint32x4_t vectorized operations with conditional subtract via bit masking, not branching. The conditional reduction caddq maps (-q, q) to [0, q) without any data-dependent branches. Montgomery reduction uses multiply-and-shift, never divide-and-branch.
The pqcrypto C implementation is also designed to be constant-time. But C compilers can and do introduce branches during optimization. A cmov instruction at -O0 can become a branch at -O3 if the compiler decides it's faster. Rust's LLVM backend has the same risk, but H33 validates constant-time execution with statistical timing analysis (26 tests, including a dudect-equivalent timing consistency check).
The Numbers
H33's native Dilithium ML-DSA-65 on Graviton4 (aarch64):
Batch attestation: 291 microseconds per 32-user batch (1 sign + 1 verify). That's 9.1 microseconds per user for the Dilithium portion of the pipeline.
Key sizes: PK 1,952 bytes, SK 4,032 bytes, Signature 3,309 bytes. Exactly matches FIPS 204 specification.
Tests: 49 tests including 8 FIPS 204 KAT tests, parameter validation, tamper resistance (single-byte flip detection), cross-key rejection, round-trip over 100 random messages, edge cases (empty message, 64KB message, all-zero), and timing consistency analysis.
Calling pqcrypto::dilithium3::detached_sign() gets you a correct signature. Building the implementation gets you a correct, fast, constant-time, formally-tested, NEON-optimized, batch-capable, Montgomery-form, rejection-resilient signature engine that runs inside a pipeline doing FHE + STARK + Dilithium at 2.17 million authentications per second.
That's the difference.
H33's Dilithium implementation carries a publicly verifiable HICS-PQ attestation. 100/100. STARK-proven. The algorithm is the authority.