How to Prove an AI Decision Was Made Correctly: Audit Trails for Automated Systems

Eric Beans, CEO, H33.ai, Inc.

An algorithm denied your mortgage application this morning. Another algorithm flagged a wire transfer as potentially suspicious and blocked it for 72 hours. A third algorithm triaged an insurance claim and routed it to the fast-track denial queue. A fourth algorithm scored a job applicant and filtered them out before a human ever saw the resume.

Each of these decisions has material consequences for a real person. Each is subject to regulatory requirements for auditability, fairness, and explainability. And none of them can be proven. Not proven in the sense that regulators, litigants, and affected individuals increasingly demand: independently verifiable, tamper-evident proof that the correct model, with the correct authority, processed the correct input, and produced the specific output that determined the outcome.

The AI accountability gap is not a theoretical problem. It is a regulatory, legal, and operational crisis that is already producing enforcement actions, litigation losses, and consumer harm at scale. And it is widening every quarter as organizations deploy more automated decision systems with the same fundamentally broken approach to auditability.

Where AI Makes Material Decisions Today

To understand the scope of this problem, consider the categories of decisions that AI systems now make or materially influence across regulated industries. These are not experimental deployments. These are production systems processing millions of decisions per day.

Lending and Credit Decisions (ECOA / Fair Lending)

The Equal Credit Opportunity Act and the Fair Housing Act prohibit discrimination in lending. Regulation B requires creditors to provide specific reasons for adverse actions. When an AI model contributes to a lending decision, the creditor must be able to demonstrate that the model does not produce disparate impact across protected classes, and that the adverse action reasons provided to the applicant accurately reflect the factors that drove the decision.

Today, most lenders log the model's input features and output score. They retain the model version identifier and the decision timestamp. But the log is a database record that can be modified. The model version identifier points to a model registry entry that can be updated. The input preprocessing pipeline that transformed raw applicant data into model features is not captured in the audit record at all. And the authority chain, meaning the governance process that approved this specific model version for production use in lending decisions, exists only in meeting minutes and email threads that are not cryptographically linked to the model's deployment.

When a fair lending examiner asks "prove that the model that made this decision was the model that was approved through your model risk management process," the lender cannot provide mathematical proof. They can provide narrative: "We have a model governance committee that approved model version 3.2.1 on this date, and our deployment records show version 3.2.1 was in production on the date of the decision." But the narrative depends on the integrity of multiple systems, none of which produce tamper-evident records.

Sanctions Screening (BSA/AML)

The Bank Secrecy Act requires financial institutions to screen transactions and customer relationships against sanctions lists maintained by OFAC and other authorities. AI-powered sanctions screening systems process millions of transactions per day, making real-time decisions about whether a transaction should be blocked, held for review, or allowed to proceed.

The regulatory expectation is clear: for every transaction that was screened, the institution must be able to demonstrate which screening model was used, which sanctions lists were loaded at the time of screening, what the model's confidence score was, and how the disposition decision was reached. For transactions that were allowed to proceed, the institution must be able to prove that the screening was performed and that the transaction did not match any sanctioned party at the time of processing.

The current state of AI audit trails in sanctions screening is API call logs. The screening platform records that a transaction was submitted, a score was returned, and a disposition was assigned. But the sanctions list version is typically a reference to an external data feed that may have been updated since the screening occurred. The model version is a pointer, not a commitment. And the input normalization, the process of standardizing names, addresses, and entity identifiers before screening, is performed in a preprocessing layer that produces no audit record at all.

Insurance Claims Triage (State DOI)

State departments of insurance regulate claims handling practices, including the use of automated systems in claims triage and adjudication. AI systems that route claims, estimate reserves, or make initial coverage determinations are subject to unfair claims settlement practices acts and market conduct examinations.

When a state examiner reviews an insurer's claims handling practices and discovers that AI systems are making or influencing coverage decisions, the examiner will ask for evidence of the decision-making process. The insurer must demonstrate that the AI system was properly validated, that it does not produce unfair outcomes across demographic groups, and that individual claims were handled consistently with the policy terms and applicable law.

The audit trail for most AI-powered claims systems consists of a claims management system record showing the AI's recommendation and the final disposition. The model's internal reasoning, the features that drove the recommendation, the confidence level, the version of the model, and the authorization chain that approved the model for use in claims handling are either not recorded or recorded in systems that do not produce independently verifiable evidence.

Clinical Decision Support (FDA / HIPAA)

AI-powered clinical decision support systems assist clinicians in diagnosis, treatment planning, and risk assessment. The FDA regulates certain clinical decision support tools as medical devices, and HIPAA requires audit controls for systems that access protected health information. When an AI system contributes to a clinical decision that results in patient harm, the healthcare organization must be able to reconstruct exactly what the AI system recommended, based on what inputs, using what model version, and under what clinical authority.

The stakes in clinical AI are measured in patient outcomes, not just dollars. A clinical decision support system that recommends against further diagnostic testing, and that recommendation contributes to a delayed diagnosis, creates both malpractice liability and regulatory exposure. The ability to prove exactly what the AI system said, and that the recommendation was based on the correct patient data processed by the correct model version, is essential to both legal defense and clinical quality improvement.

Hiring Recommendations (EEOC / Local Bias Laws)

AI-powered hiring tools screen resumes, score candidates, and make recommendations about which applicants should advance in the hiring process. The EEOC has made clear that employers are liable for discriminatory outcomes produced by AI tools, regardless of whether the employer developed the tool or purchased it from a vendor. Local laws, including New York City's Local Law 144, impose specific audit requirements on automated employment decision tools.

The audit requirements for hiring AI are particularly demanding because the affected population is large (every applicant who was screened), the consequences of discrimination are severe (both for individuals and for the employer), and the regulatory scrutiny is intensifying. An employer must be able to demonstrate that the AI tool was audited for bias, that the version in production at the time of each decision was the audited version, and that the tool's recommendations were based on job-related criteria.

The common thread: In every one of these domains, AI systems are making decisions with material consequences for real people. In every one of them, regulators require auditability. And in every one of them, the current state of AI audit trails is a log file that proves nothing.

Why Current AI Audit Trails Prove Nothing

The fundamental problem with AI audit trails today is that they record claims, not proofs. A log entry that says "Model v3.2.1 processed input X and returned output Y at time T" is a claim. It claims that a specific model version was used. It claims that the input was what it says it was. It claims that the output was what it says it was. It claims that the timestamp is accurate.

But claims are only as trustworthy as the system that produces them. And the system that produces AI audit logs is the same system that runs the AI model. If that system is compromised, or if an administrator modifies the logs, or if the model registry is updated after the fact, the claims are worthless.

The Five Gaps in AI Audit Trails

There are five specific gaps that render current AI audit trails inadequate for regulatory, legal, and governance purposes.

Gap 1: The log is mutable. AI audit logs are stored in databases, file systems, or log aggregation platforms. All of these storage systems allow modification by authorized users. A database administrator can update a log record. A system administrator can delete a log file. A log aggregation platform can be configured to overwrite older entries. There is no cryptographic binding between the log entry and the event it describes. The log entry is a copy that can diverge from reality without detection.

Gap 2: The model version is not committed. When an AI audit log records that "model v3.2.1" was used, it records a string identifier that points to an entry in a model registry. But the model registry itself is mutable. The binary artifact associated with "v3.2.1" can be replaced. The metadata can be updated. The training data lineage can be altered. There is no cryptographic commitment that binds the model identifier in the audit log to a specific, immutable model artifact.

Gap 3: Input preprocessing is not captured. Between the raw input (a loan application, a transaction record, a medical image) and the model's input tensor, there is a preprocessing pipeline that normalizes, transforms, and encodes the data. This preprocessing can materially affect the model's output. A change in how names are normalized in a sanctions screening system can change whether a transaction is flagged. A change in how features are encoded in a lending model can change the credit score. But the preprocessing pipeline is rarely versioned, and its state at the time of each decision is almost never captured in the audit trail.

Gap 4: The authority chain is absent. Who authorized this model to make this type of decision? What governance process approved it? When was the approval granted, and was it still valid at the time of the decision? These questions have answers, but the answers exist in meeting minutes, email threads, Jira tickets, and policy documents that are not cryptographically linked to the model's deployment or its individual decisions. The authority chain is narrative, not proof.

Gap 5: The timestamp is self-reported. The timestamp on an AI audit log entry is generated by the system clock of the machine that produced the entry. System clocks can be changed. Timestamps can be backdated. There is no independent attestation of when the decision actually occurred. For decisions with regulatory deadlines (sanctions screening must occur before transaction settlement, adverse action notices must be sent within specific timeframes), the inability to independently verify timing is a critical gap.

Audit Trail Component Current State Regulatory Expectation
Log integrity Mutable database record Tamper-evident, independently verifiable
Model version String pointer to mutable registry Cryptographic commitment to specific artifact
Input data Raw input logged (maybe) Full preprocessing pipeline captured
Authority chain Meeting minutes and emails Cryptographic proof of authorization
Timestamp Self-reported system clock Independently attested time

What Regulators Actually Want

The regulatory landscape for AI auditability is converging across jurisdictions and frameworks. Despite differences in terminology and enforcement mechanisms, regulators are asking for the same fundamental capability: proof that the correct model, with the correct authority, processed the correct input, and produced the specific output, and that this chain of facts is independently verifiable and tamper-proof.

EU AI Act: Articles 13 and 14

The EU AI Act, which entered application for high-risk AI systems in August 2025, imposes specific transparency and human oversight requirements. Article 13 requires that high-risk AI systems be designed and developed in such a way that their operation is sufficiently transparent to enable deployers to interpret the system's output and use it appropriately. Article 14 requires that high-risk AI systems be designed to allow effective human oversight, including the ability to understand the system's capacities and limitations and to properly monitor its operation.

The implementing regulations make clear that these requirements are not satisfied by narrative documentation alone. The deployer must be able to demonstrate, for any specific decision, that the system operated within its intended purpose, that the appropriate human oversight was in place, and that the system's output was traceable to its inputs. This requires an audit trail that is more than a log. It requires proof.

FFIEC Model Risk Management (SR 11-7)

The Federal Reserve's SR 11-7 guidance on model risk management has been the foundational framework for AI governance in US financial services since 2011. It requires financial institutions to maintain effective model governance, including model validation, ongoing monitoring, and comprehensive documentation of model development, implementation, and use.

SR 11-7 explicitly requires that institutions be able to demonstrate that the model in production is the model that was validated. This requires a chain of evidence linking the validation results to the specific model artifact that is processing live data. Today, this chain is maintained through manual processes: change management tickets, deployment records, and periodic model inventory reviews. None of these produce independently verifiable proof that the production model matches the validated model at any given moment.

OCC Bulletin 2023-17: Third-Party AI Governance

The OCC's 2023 guidance on third-party risk management includes specific provisions for AI systems provided by third-party vendors. Banks that use third-party AI for decision-making must apply the same governance standards to vendor-provided models as they apply to internally developed models. This includes the ability to audit the model's behavior, validate its outputs, and maintain comprehensive records of its use.

For third-party AI, the audit trail challenge is compounded by the fact that the bank does not control the model's development or deployment. The bank relies on the vendor's representations about model versioning, training data, and performance. When an examiner asks the bank to prove that the vendor's model was performing as expected at the time of a specific decision, the bank must rely on evidence provided by the vendor, which the bank has no ability to independently verify.

Closing the Gap: Cryptographic Attestation for AI Decisions

The solution to the AI audit trail problem is not more logging. More logging produces more claims. What is needed is a mechanism that produces proofs: independently verifiable, tamper-evident, cryptographic commitments that bind together all of the elements of an AI decision into a single, unforgeable record.

What an Attested AI Decision Looks Like

When an AI system produces a decision, the attestation process captures five elements and commits them to a single cryptographic proof:

Input hash: A cryptographic hash of the complete input to the model, including the output of the preprocessing pipeline. This does not store the input data itself (which may be subject to privacy restrictions). It stores a commitment that can later be verified against the original input if the input is available. If anyone modifies the input data after the fact, the hash will not match.

Model version commitment: A cryptographic hash of the model artifact (weights, architecture, configuration) that processed the input. This is not a string identifier like "v3.2.1." It is a hash of the actual binary artifact. If the model registry is updated after the fact, the commitment will not match the new artifact. The proof binds the decision to the specific model that made it.

Output hash: A cryptographic hash of the model's complete output, including confidence scores, feature attributions, and any intermediate results that are relevant to the decision. This commitment proves that the output recorded in the audit trail is the output the model actually produced.

Authority chain: A cryptographic proof that the model was authorized to make this type of decision at the time the decision was made. This links the decision to the governance process that approved the model's deployment, using a chain of digital signatures from the approving authorities. If the model's authorization was revoked or expired at the time of the decision, the authority chain will reflect that.

Attested timestamp: A timestamp that is independently attested, not self-reported by the system making the decision. The attestation is hash-chained to the previous attestation, creating a temporal ordering that is tamper-evident. Inserting, deleting, or reordering decisions breaks the chain.

The result: A single 74-byte attestation that cryptographically commits all five elements. Hash-chained to every previous attestation. Secured by three independent post-quantum cryptographic families. Independently verifiable by any party with the verification key, without access to the AI system that made the decision.

H33-74: The Attestation Footprint

Each AI decision attestation produces an H33-74 proof totaling 74 bytes. Thirty-two bytes are committed on-chain as a permanent, immutable anchor. Forty-two bytes are stored in Cachee, the post-quantum caching layer, for high-performance retrieval during audits and examinations. The three post-quantum cryptographic families ensure the proof remains valid for the full regulatory retention period, even as quantum computing advances make classical cryptographic methods vulnerable.

Attestation Component What It Proves Gap It Closes
Input hash The exact input the model received Input preprocessing capture
Model version commitment The exact model artifact used Model version binding
Output hash The exact output the model produced Log integrity
Authority chain Who authorized the model for this decision type Governance linkage
Attested timestamp When the decision was actually made Timestamp independence

THAT vs. WHY: The Two Halves of AI Accountability

There is an important distinction that is frequently lost in discussions about AI auditability. There are two fundamentally different questions that regulators, litigants, and affected individuals ask about AI decisions:

"THAT" questions: Did this specific model process this specific input and produce this specific output at this specific time under this specific authorization? These are questions about what happened. They require proof of computation, not explanation of reasoning.

"WHY" questions: Why did the model produce this output? What features drove the decision? Would a different input have produced a different outcome? These are questions about the model's reasoning. They require explainability tools: SHAP values, LIME explanations, counterfactual analysis, and other techniques from the field of explainable AI (XAI).

Both halves are necessary. Neither is sufficient alone. An XAI explanation that says "the model denied the loan because of debt-to-income ratio" is meaningless if you cannot prove that the model that produced the explanation is the same model that made the decision, or that the input data used for the explanation is the same data that was processed in real time. Conversely, a cryptographic proof that a specific model processed a specific input is meaningless if you cannot explain why the model made the decision it made.

H33 provides the THAT. We do not replace XAI tools. We make them trustworthy by providing the cryptographic foundation that proves the XAI explanation is about the right model, the right input, and the right output. Without that foundation, XAI explanations are just more claims. With it, they become evidence.

Decision Replay: Proving Correctness After the Fact

One of the most powerful capabilities enabled by cryptographic AI decision attestation is decision replay. When a regulator, litigant, or internal auditor questions a specific AI decision, the organization needs to reproduce the decision: load the exact model version, provide the exact input, and demonstrate that the model produces the exact same output.

Today, decision replay is unreliable because there is no guarantee that the model version stored in the model registry is the version that was actually used, or that the input data in the audit log is the data that was actually processed. The replay might produce a different result, and the organization has no way to determine whether the discrepancy is because the model changed, the input changed, or both.

With H33-74 attestation, decision replay becomes deterministic. The model version commitment identifies the exact model artifact. The input hash verifies that the replay input matches the original input. The output hash verifies that the replay output matches the original output. If any element does not match, the discrepancy is immediately identified and localized. The organization can determine whether the model artifact changed, the input was modified, or both.

Cachee, the post-quantum caching layer, enables this replay at operational speed. The attestation proofs, model version commitments, and input hashes are cached for rapid retrieval, allowing an organization to replay any attested AI decision within milliseconds rather than the hours or days typically required to reconstruct a historical model environment. This is not theoretical. This is the same caching infrastructure that delivers sub-microsecond lookup latency in production.

The Regulatory Timeline Is Accelerating

Organizations that view AI audit trail requirements as a future concern are miscalculating the timeline. The EU AI Act's high-risk provisions are already in application. The FFIEC is actively examining AI governance practices in large financial institutions. The CFPB has issued guidance on adverse action notice requirements for AI-driven credit decisions. State insurance regulators are conducting market conduct examinations specifically focused on AI-driven claims practices. The EEOC has issued guidance on AI and employment discrimination.

The enforcement actions are not hypothetical. In the past 18 months, financial regulators have issued consent orders specifically citing inadequate AI model governance and audit trail deficiencies. Insurance regulators have imposed market conduct remediation requirements on carriers whose AI claims systems could not produce adequate audit records. The CFPB has taken enforcement action against lenders whose AI-driven adverse action notices did not accurately reflect the reasons for denial.

Each of these enforcement actions could have been prevented or significantly mitigated by a cryptographic AI audit trail. Not because the audit trail would have changed the AI's behavior, but because it would have provided the independently verifiable evidence that regulators require. An organization that can prove, with mathematical certainty, that the correct model processed the correct input and produced the specific output is in a fundamentally different position than an organization that can only produce mutable log files and narrative assertions.

Implementation: Where to Start

For organizations that recognize the AI audit trail gap and want to close it, the implementation path is straightforward. It does not require replacing existing AI infrastructure. It requires adding an attestation layer that captures the five elements described above at the point of each AI decision.

The highest-priority deployment targets are the decision categories with the most significant regulatory exposure: lending decisions subject to ECOA and fair lending requirements, sanctions screening decisions subject to BSA/AML requirements, insurance claims decisions subject to state unfair claims practices requirements, and any AI system that will be classified as high-risk under the EU AI Act.

The attestation layer integrates at the inference point: after the preprocessing pipeline has prepared the input and before the model's output is acted upon. The integration adds microseconds of latency, not milliseconds. For a lending decision that takes hundreds of milliseconds to produce, the 42-microsecond attestation overhead is immaterial. For a sanctions screening decision that must be made in real time, the overhead is below the noise floor of network latency.

The return on investment is measured not in the cost of the attestation infrastructure but in the cost of the enforcement actions, litigation losses, and regulatory remediation that it prevents. As we detailed in our companion post on the cost of failed audits, a single enforcement action triggered by audit trail deficiencies routinely costs $5 million to $15 million. The attestation infrastructure costs a fraction of that, and it produces value on every decision, every day, across every regulated AI system in the organization.

The AI accountability gap is real, it is widening, and it is producing consequences today. The organizations that close it with cryptographic proof will be the ones that can answer the question every regulator is now asking: prove it.

Make Every AI Decision Provable

H33 provides cryptographic attestation for automated decision systems. 74 bytes per decision. Three post-quantum families. Independently verifiable. Close your AI audit trail gap before regulators find it.

Schedule a Demo