Audit Findings to Frozen Protocol — Replay-Grade Crypto

There is a difference between shipping code that passes tests and shipping a protocol that cannot regress. Tests verify behavior at a point in time. A frozen protocol guarantees behavior for all time. This is the story of how we went from a set of audit findings to a verification protocol that produces replay-grade cryptographic evidence — identical outputs regardless of who runs the verifier, when they run it, or what implementation they use.

The journey starts with an audit. It ends with a specification that anyone can implement independently and get byte-identical results. Everything in between is engineering discipline applied at the protocol level.

The Audit

We ran a comprehensive security audit of our STARK verification infrastructure. Not a box-checking exercise. A systematic attempt to break our own system before anyone else could. The audit covered the full verification pipeline: AIR constraint evaluation, FRI commitment scheme, Fiat-Shamir transcript generation, biometric threshold verification, proof serialization, and domain separation.

We found four critical and high-severity findings. Each one was a real vulnerability that could have been exploited in production.

Finding 1: Biometric Threshold Always-True Bypass (CRITICAL)

The biometric threshold verification function contained a code path where out-of-bounds similarity scores would return true instead of return false. This meant that any input — regardless of how different it was from the enrolled template — could satisfy the biometric match by providing a score outside the expected range. An attacker could bypass biometric authentication entirely by submitting a crafted similarity score that triggered the out-of-bounds branch. Every biometric would match. Every authentication would succeed. The gate was open.

Finding 2: FRI Domain Separation Replay (HIGH)

The FRI (Fast Reed-Solomon Interactive Oracle Proof of Proximity) commitment scheme did not include sufficient domain separation between folding rounds. A proof generated for one FRI layer could be replayed as a valid commitment in a different layer. An attacker could take a legitimate proof, extract intermediate commitments, and present them at different positions in the verification tree. The verifier would accept the replayed commitments because it could not distinguish which layer they originated from. This would allow forged proofs to pass verification for computations that were never actually performed.

Finding 3: Fiat-Shamir Configuration Binding Missing (HIGH)

The Fiat-Shamir transcript — which converts the interactive STARK protocol into a non-interactive one — did not bind the proof configuration parameters (field size, expansion factor, number of queries, grinding bits) into the initial transcript state. This meant that a proof generated under one security parameter set could be verified under a different, weaker parameter set. An attacker could generate a proof with low security parameters (few queries, small expansion factor) and then present it to a verifier configured with high security parameters. The verifier would derive challenges from a transcript that did not commit to the expected configuration, accepting a proof that did not meet the claimed security level.

Finding 4: FRI Folding Factor Validation Missing (HIGH)

The FRI verification code did not validate that the folding factor was consistent across all layers of the proof. A malformed proof could specify different folding factors at different layers, causing the verifier to evaluate polynomials at incorrect degree bounds. This would allow a prover to submit a proof for a polynomial of degree D while the verifier believes it is checking degree D/2 or D/4. The soundness guarantee of the FRI protocol depends on consistent folding — without validation, the protocol provides no meaningful security bound.

Four findings. Each one a different class of vulnerability. Together they represented a comprehensive failure to enforce the invariants that the STARK verification protocol depends on for its security guarantees.

The Fixes

Each finding was fixed individually, tested in isolation, and then frozen as a permanent security regression test. The regression tests are not ordinary tests that verify correct behavior. They are adversarial tests that specifically attempt the exploit that the original vulnerability would have allowed. If the regression test ever passes (meaning the exploit succeeds), the build fails.

The biometric threshold fix is instructive because it demonstrates how a single character can be existential. The original code:

// Out-of-bounds score handling
if score > MAX_THRESHOLD || score < MIN_THRESHOLD {
    return true;  // BUG: should reject out-of-bounds
}

The fix:

// Out-of-bounds score handling
if score > MAX_THRESHOLD || score < MIN_THRESHOLD {
    return false;  // FIXED: reject out-of-bounds scores
}

One word changed. true became false. But the implications were existential. With return true, any biometric input — a random vector, a blank template, noise — would authenticate successfully if the similarity computation produced an out-of-bounds score. The biometric gate was decorative. It existed syntactically but provided zero security. Any system relying on this threshold for access control was operating without access control.

The regression test (SEC-REG-001) specifically crafts an out-of-bounds score and asserts that verification fails. If a future code change accidentally reintroduces the always-true path, this test catches it immediately.

The remaining fixes followed the same pattern:

SEC-REG-002: FRI layer commitments now include a domain separator encoding the layer index. Replaying a commitment from layer N at layer M produces a verification failure because the domain separators do not match.
SEC-REG-003: The Fiat-Shamir transcript initialization now absorbs all configuration parameters before deriving any challenges. A proof generated under config A cannot verify under config B because the transcript states diverge from the first byte.
SEC-REG-004: FRI folding factor is validated at every layer. Inconsistent folding factors produce an immediate rejection with a structured error code.
SEC-REG-005: Polynomial degree bounds are checked after FRI folding to ensure the claimed degree matches the actual polynomial structure.
SEC-REG-006: OOD (out-of-domain) evaluation points are validated against the trace domain to prevent evaluation at in-domain points that would trivially satisfy constraints.
SEC-REG-007: Query positions are validated against proof structure to ensure all indices reference valid positions in the committed Merkle tree.

Twelve regression tests total (SEC-REG-001 through SEC-REG-012, covering variants and edge cases of the four primary findings). These tests run on every build, in every environment, forever. They are not optional. They cannot be skipped. They are as permanent as the protocol itself.

From Fixes to Protocol Hardening

Fixed bugs are table stakes. Every competent engineering team fixes bugs when they find them. What matters is what you build after the fix. We used the audit findings as the foundation for a systematic protocol hardening effort that went far beyond patching individual vulnerabilities.

The question we asked was: how do we make it impossible for this class of bug to exist without detection? Not just these four bugs — this entire category of protocol-level failure.

We built five hardening layers:

Canonical Transcript Generation. Every step of the STARK verification process now produces a deterministic transcript. The transcript is not a log — it is a commitment chain. Each step absorbs its inputs, produces a challenge, and the challenge derivation is fully specified. Two implementations that follow the specification produce identical transcripts for the same proof. This makes divergence between implementations immediately detectable.

Deterministic Challenge Derivation. Every challenge (random coin) in the protocol is derived from a specified transcript state using SHA3-256. The derivation rule is frozen: domain_separator || transcript_state || counter. No implementation freedom exists. This eliminates the class of bugs where challenge derivation differs between prover and verifier or between two verifier implementations.

Serialization Freeze Vectors. Every data structure in the protocol has a canonical serialization. Field elements, polynomial evaluations, Merkle paths, FRI layers, constraint evaluations — each has exactly one valid byte representation. The serialization rules are frozen and tested against canonical vectors. A proof serialized by any conformant implementation produces identical bytes.

Structured Rejection Semantics. When verification fails, the verifier produces a structured rejection with a specific error code, the failing step, the expected value, the actual value, and the constraint that was violated. Rejections are not boolean — they carry forensic detail. Two verifiers rejecting the same invalid proof produce identical rejection structures.

Replay Integrity Classification. Every verified proof is classified into one of six replay integrity levels: Level 0 (transcript-only, no proof data), Level 1 (proof present but not verified), Level 2 (partially verified), Level 3 (fully verified, single implementation), Level 4 (cross-implementation verified), Level 5 (independently reproduced by third party). The classification is part of the verification output, not an afterthought.

The Adversarial Corpus

We did not stop at fixing known bugs. We built a comprehensive corpus of deliberately malformed proofs designed to exercise every rejection path in the verifier. The corpus is not a fuzzer output — it is a curated collection of adversarial inputs with known-correct expected outputs.

The corpus contains 26 canonical test vectors organized into seven categories:

Category 1: Tampered Commitments. Valid proofs with one or more Merkle commitments replaced by arbitrary hashes. Tests that the verifier detects commitment/decommitment inconsistencies at every layer.
Category 2: Swapped FRI Layers. Valid proofs with FRI layer ordering permuted. Tests that the verifier enforces layer sequencing and catches inconsistent polynomial degree reduction.
Category 3: Invalid Query Positions. Valid proofs with query indices set to out-of-range values, duplicate indices, or indices that reference non-existent tree leaves. Tests bounds validation and deduplication.
Category 4: Truncated Proofs. Valid proofs with trailing bytes removed at different cut points. Tests that the verifier handles incomplete data gracefully and produces a specific truncation error rather than panicking or producing undefined behavior.
Category 5: Malformed OOD Evaluations. Valid proofs with out-of-domain evaluation values replaced by incorrect field elements. Tests that the verifier catches constraint violations at the OOD sampling point.
Category 6: Configuration Mismatches. Valid proofs paired with incorrect verification configurations (wrong field, wrong expansion factor, wrong query count). Tests that configuration binding prevents cross-parameter verification.
Category 7: Replay Attempts. Components from one valid proof inserted into a different valid proof. Tests domain separation between independent proof instances.

Every expected output in the corpus was generated by running the actual verifier — not hand-typed, not approximated, not derived from the specification alone. The verifier produced the rejection, we recorded it, and that recording became the canonical expected output. This eliminates a class of testing error where the expected output is what the developer thinks should happen rather than what actually happens.

The 26 vectors are published. They are the conformance test suite. Any implementation that produces identical outputs for all 26 vectors is conformant. Any implementation that diverges on even one vector is not.

Freezing the Protocol

With the regression tests, hardening layers, and adversarial corpus in place, we froze the protocol. "Frozen" has a specific meaning in our engineering process. It does not mean "we stopped changing it." It means: any change to the frozen surface requires a major version increment and a 12-month migration window during which both the old and new behaviors must be supported simultaneously.

What we froze:

Verifier Output Schema. The JSON structure of verification results is frozen. Fields cannot be renamed, removed, or reordered. New fields may only be added at the end of objects with default values that preserve backward compatibility.
Error Code Namespace. 26 error codes, each with a severity level (Critical, High, Medium, Low, Info), a machine-readable identifier, and a human-readable description. No error code can be reassigned to a different meaning. New codes may only be added, never removed or redefined.
Replay Integrity Semantics. The six replay integrity levels (0 through 5) have frozen definitions. What constitutes each level cannot change. The boundary between "Level 3: fully verified, single implementation" and "Level 4: cross-implementation verified" is permanent.
Deterministic Ordering Rules. When the verification output contains multiple items (multiple constraint violations, multiple FRI layer results), the ordering is deterministic and frozen. Constraint violations are ordered by constraint index. FRI results are ordered by layer. Query results are ordered by position.
Domain Separator Registry. 27 domain separators used throughout the protocol. Each is a fixed byte string with a frozen assignment. The separator for "FRI layer 0" is permanently distinct from "FRI layer 1" and from "OOD evaluation" and from "constraint composition." No separator can be reassigned.
Proof Profiles. Seven proof profiles (Minimal, Standard, Extended, Full, Archival, Forensic, Compliance) define which verification steps are performed and what is included in the output. Each profile's requirements are frozen.
Hex Encoding Rules. All byte arrays in the output use lowercase hex encoding without a "0x" prefix. This is frozen. A verifier that outputs uppercase hex or includes a prefix is non-conformant.
Canonical Whitespace Behavior. JSON outputs use no unnecessary whitespace (compact encoding). When pretty-printed for human consumption, indentation is 2 spaces. This is frozen to enable byte-level comparison of outputs.

Freezing is not conservatism. It is the recognition that protocol stability is a feature. When a regulator replays a proof five years from now, the verification output must be identical to what was produced today. When an insurer validates a claim, the evidence format cannot have shifted. When an independent auditor implements the specification, they must be able to produce byte-identical results without contacting us.

Replay-Grade Evidence

We use the term "replay-grade" to describe a specific standard of cryptographic evidence. It is not a marketing phrase. It is a technical requirement with a precise definition.

Replay-grade means: the same proof, given to any conformant verifier implementation, produces:

Identical transcript derivation (same challenges at same steps)
Identical replay integrity classification (same level assignment)
Identical rejection semantics (same error code, same failing step, same expected/actual values)
Identical verification output (same JSON structure, same field values, same ordering)

This is a stronger guarantee than "the same proof verifies correctly in multiple implementations." Correct verification is necessary but not sufficient. Replay-grade requires that the entire verification process — including failures, partial results, and diagnostic information — is deterministic and reproducible.

Replay-grade evidence enables capabilities that non-replay-grade systems cannot provide:

Insurer-grade auditability. A cyber insurance claim can be validated by replaying the original proof through a conformant verifier. The insurer does not need to trust the claimant's infrastructure. The insurer does not need to access the claimant's systems. The proof and the specification are sufficient.

Regulator replay. A financial regulator can independently verify that a specific computation was performed correctly at a specific time. The replay produces the same result today that it produced when the computation was first verified. No degradation. No ambiguity.

Forensic reconstruction. In a security incident, the verification history can be replayed to identify exactly when and how a compromise occurred. The replay is deterministic — it cannot be disputed or reinterpreted.

Long-term archival verification. A proof verified today must still produce the same verification output in 10, 20, or 50 years. The frozen protocol guarantees this. The serialization format, the challenge derivation, the error codes — all are immutable.

Independent third-party validation. A third party can implement the specification from scratch, run the conformance vectors to validate their implementation, and then verify any proof in the ecosystem. No dependency on H33 code. No dependency on H33 infrastructure. The specification is the authority.

The Implementation Boundary

The canonical STARK engine, verifier, and security regression suite are implemented in Rust. This is not incidental. The implementation language was chosen for deterministic execution, absence of garbage collection pauses, control over memory layout, and compile-time enforcement of invariants that other languages check at runtime or not at all.

No JavaScript or browser runtime exists in the cryptographic hot path. The verification pipeline runs as native compiled code. There is no interpreter overhead, no JIT variability, no runtime that could introduce non-determinism through optimization decisions.

JSON is used in exactly one context: portable conformance vectors. The 26 test vectors are distributed as JSON files that any implementation in any language can parse. The JSON contains hex-encoded field elements, Merkle paths, and expected verification outputs. This is the interoperability layer — the minimum portable format for cross-implementation testing.

The test suite totals 524+ tests across the STARK verification infrastructure:

252 STARK verification tests (constraint evaluation, FRI, transcript, composition)
26 canonical conformance vectors (adversarial corpus)
12 security regression tests (SEC-REG-001 through SEC-REG-012)
234+ additional unit, integration, and property tests

Every test is deterministic. No randomness, no timing dependencies, no network calls. A test that passes on one machine passes on every machine. A test that fails on one machine fails on every machine. This is not a aspiration — it is enforced by the test infrastructure itself.

Independent Verification

We published the conformance vectors publicly. We did not publish them with caveats, restrictions, or requirements to use H33 code. The vectors are standalone. They contain everything needed to validate a conformant implementation: input proofs, verification configurations, and byte-exact expected outputs.

We issued an open challenge: reproduce the 26 vector outputs in any language. Rust, Go, Python, C, Haskell, JavaScript — any language that can perform SHA3-256 hashing and modular arithmetic can implement a conformant verifier. The success criterion is simple: byte-identical output hashes for all 26 vectors. No interpretation required. No judgment calls. Either the hashes match or they do not.

No H33 code is required for this challenge. The specification defines the transcript generation rules, the challenge derivation algorithm, the serialization format, the error code assignments, and the verification steps. An implementer reads the specification, builds a verifier, runs the vectors, and compares outputs. If all 26 match, the implementation is conformant.

We also published our cryptographic guarantees explicitly. What we prove:

That a specific computation was performed (computational integrity)
That the computation satisfies specific constraints (constraint satisfaction)
That the proof was generated for a specific configuration (configuration binding)
That the verification transcript is deterministic (replay integrity)

What we do NOT prove:

That the input data was correct (we verify computation, not truth)
That the prover was honest about their identity (we verify proofs, not identities)
That the constraints are appropriate for the use case (we verify against constraints, not against intent)
Zero-knowledge (our proofs are transparent STARKs; the prover's inputs are committed but the proof itself reveals the computation structure)

Every assumption is documented. Every boundary is explicit. This is how you build systems that can be trusted — not by claiming perfection, but by specifying exactly what is and is not guaranteed.

What This Means

For different stakeholders, replay-grade cryptographic evidence means different things. But the underlying capability is the same: deterministic, reproducible, independently verifiable proofs of computation.

For enterprises: Replay-grade audit trails mean that your compliance evidence is not trapped in a vendor's dashboard. It is portable. It is independently verifiable. It survives vendor transitions, platform migrations, and organizational changes. Your audit trail is a mathematical object, not a database record.

For regulators: Reconstructable evidence means you do not need to trust the regulated entity's self-reporting. You can replay. You can verify independently. You can compare outputs across institutions using the same protocol. The evidence format is standardized, frozen, and deterministic.

For insurers: Verifiable claims mean that when a policyholder claims they were running cryptographic verification at the time of a breach, you can verify that claim independently. The proof either replays correctly or it does not. The claim is mathematical, not testimonial.

For engineers: Deterministic, reproducible, independently verifiable proofs mean that you can build systems with confidence that the verification layer will not introduce non-determinism, ambiguity, or implementation-dependent behavior. The protocol is a fixed target. You implement against the specification. You validate against the vectors. You ship with confidence.

The operational principle

Operational systems should become replayable by default. Not as an afterthought. Not as a compliance add-on. As a fundamental architectural property. If you cannot replay the verification, you cannot prove the verification happened. And if you cannot prove it happened, you cannot claim it happened.

252 STARK tests. 26 canonical conformance vectors. 12 security regressions frozen forever. A protocol that produces identical outputs regardless of implementation. Evidence that survives time, disputes, and independent scrutiny.

This is not a higher standard of testing. It is a different category of assurance. The protocol is frozen. The evidence is replay-grade. The specification is public. Anyone can verify.

From Audit Findings to Frozen Protocol:
Building Replay-Grade Cryptographic Evidence

The Audit

Finding 1: Biometric Threshold Always-True Bypass (CRITICAL)

Finding 2: FRI Domain Separation Replay (HIGH)

Finding 3: Fiat-Shamir Configuration Binding Missing (HIGH)

Finding 4: FRI Folding Factor Validation Missing (HIGH)

The Fixes

From Fixes to Protocol Hardening

The Adversarial Corpus

Freezing the Protocol

Replay-Grade Evidence

The Implementation Boundary

Independent Verification

What This Means

The operational principle

Verify the Protocol Yourself

The Audit

Finding 1: Biometric Threshold Always-True Bypass (CRITICAL)

Finding 2: FRI Domain Separation Replay (HIGH)

Finding 3: Fiat-Shamir Configuration Binding Missing (HIGH)

Finding 4: FRI Folding Factor Validation Missing (HIGH)

The Fixes

From Fixes to Protocol Hardening

The Adversarial Corpus

Freezing the Protocol

Replay-Grade Evidence

The Implementation Boundary

Independent Verification

What This Means

The operational principle

Verify the Protocol Yourself

Related Articles