AI action verification

Verify what AI agents actually did.

Veridatum compares AI-agent claims with your systems of record, so teams can separate vendor-reported success from business reality.

Read-only · export-first One workflow No production writes No live customer replies
Claim → evidence

An agent's claim is not a business outcome.

Your system of record is wherever the truth actually lives: Stripe, your order database, your billing system. Veridatum checks each reported action against that record and shows whether it held, meaning the change actually took effect and stayed true, not just that the agent reported it done.

What the agent reports
What your system of record shows

Vendor dashboards report their own status, counts, and completions. Veridatum's job is the independent comparison: does that reported status match the systems you already trust?

Where reports break

Questions a vendor dashboard can't settle on its own.

01 · payment state

Refund issued?

An agent says it refunded the customer. Did the payment record post, at the right amount, without reversal or duplication?

02 · account state

Cancellation completed?

An agent says it cancelled an account. Did billing actually stop, and did the entitlement change?

03 · ticket state

Ticket resolved?

An agent says the ticket was resolved. Did the backend state change, or did a human quietly reopen and fix it later?

Verified action ledger

Reported success, checked against business reality.

The output is an evidence-backed ledger: the claimed action, the proof from your system of record, whether it held, our verification status, and the decision. Refunds, credits, cancellations, and billing-state changes are useful first workflows. The artifact covers any backend state an agent can change.

SAMPLE DELIVERABLE PREVIEW · support actions
We only ever read your data, via an export you share or read-only access
Claimed actionSystem-of-record proofHold statusVeridatum statusDecision
Refund issued Stripe refund posted, £42.50 Held 7 days, not reversed Verified Accept
Cancellation completed No cancellation in billing record Still billing at 24h Disputed Escalate
Duplicate refund requested Already refunded once No change Excluded Good restraint
Order address changed Shipping address changed to new recipient before fulfilment Held Verified Monitor
Credit applied Wallet balance updated, £15.00 Held 3 days Verified Accept
Ticket closed Ticket marked resolved Reopened in 2h Disputed Dispute
Design-partner program

Three design-partner slots.

Veridatum is early. We're taking three design partners to run a Verified Action Sprint on one real workflow, and limiting the program so each sprint can directly shape the method. You bring a redacted sample and a short debrief. You get the kind of buyer-readable ledger previewed above, direct input into the method, and design-partner terms.

How a sprint runs

  1. 1Describe a workflowNo data or credentials to start: just one workflow where an agent changes backend state, and the decision attached to it.
  2. 2Sample & source-of-truth mappingA redacted export or scoped read-only sample, mapped to the records that actually hold the truth, with clean joins and messy ones noted.
  3. 3Verified ledger & recommendationAn evidence-backed ledger of what held, what didn't, and what to accept, dispute, escalate, or exclude.
See what a sprint involves
1–2 WEEKS · ONE WORKFLOW · 3 SLOTS

The Verified Action Sprint.

Start from a redacted export or a scoped read-only sample, as small as a few dozen actions. Leave with a verified action ledger showing what the agent claimed, what your system of record shows, whether it held, and what should be accepted, disputed, escalated, or excluded.

What you bring
  • One workflow where an AI agent changes backend state: refunds, cancellations, credits, order or account changes, entitlement changes, or ticket resolutions.
  • A redacted sample of agent/vendor claims and the matching system-of-record records. No production access or credentials.
  • A named decision the result would inform, and roughly the exposure attached to it.
What you get
  • A verified action ledger and an evidence-backed register of disputed, failed, and corrected actions.
  • Source-of-truth mapping notes and sample evidence packets you can inspect.
  • A recommendation (trust, gate, dispute, monitor, or stop), plus design-partner terms and direct input into the method.
Public method demos

On public benchmarks, Veridatum caught what a dashboard would miss.

We ran the same claim-vs-record verification over public, MIT-licensed agent benchmarks: their task specs, sandbox databases, and result traces. We have no production customers yet; the findings below are our own analysis of that public data.

Reported vs. verified · τ-bench retail
If every reported action is trusted100%
Actually held in the system of record74%

26% didn't match. Across 456 GPT-4.1 retail simulations, 118 disagreed with the database on final state. That's the gap a vendor dashboard wouldn't surface on its own.

Veridatum analysis · public MIT-licensed benchmark · not real customer data
METHOD DEMO · public benchmarks
Public benchmark data, not real customer data. Each row cites its source task
Claimed / attempted actionBackend (benchmark) proofHold statusVeridatum statusSource
Cancel order after price drop Order + item status = cancelled; refund £249 to card Held Verified STATE-Bench · 40
Cancel customer's order, refund promised Wrong order cancelled; db_match = false Mismatch Disputed τ-bench · 51
Second courtesy refund requested Already refunded; no mutation performed No change Excluded STATE-Bench · 134
τ-bench retail · Sierra Research

A quarter of runs didn't match the business state.

Across 456 GPT-4.1 retail simulations, 118 failed the final-state check: the agent's transcript and the database disagreed. In one, the agent cancelled the wrong order and told the customer a refund was on its way; in another it acted past an explicit "don't cancel anything else" instruction, inflating the refund. 104 of 114 tasks involve state-changing actions.

Microsoft STATE-Bench · enterprise support

Verifying the math, and the safe "no".

Across 150 support tasks with 952 explicit state assertions, the same method checks more than "resolved": refund amounts after clawbacks, waived fees, and multi-leg compound actions. It also verifies restraint: an agent correctly refusing a duplicate refund is a control worth confirming, not just successful actions.

Public synthetic benchmarks demonstrate the verification method and the shape of the artifact. They do not, by themselves, prove buyer demand, real-world error rates, or that production data is this clean. Buyer deployments run on buyer-authorized systems of record.

Who it's for

For teams letting AI agents change business state.

When an agent can move money, close accounts, or change entitlements, someone has to confirm it actually happened, before you trust it, pay for it, or take it into a vendor renewal.

State changes verified

  • Customer state
  • Order state
  • Payment state
  • Account state
  • Entitlement state
  • Support-ticket state

By team

  • Support · Ops · silent failures and cleanup
  • Finance · what to trust, pay, or dispute
  • Procurement · buyer-owned evidence for renewals
  • Risk · Compliance · reconstruct what agents changed
  • BizOps · Data · reliable joins and definitions
A good moment to start
An AI vendor renewal, true-up, or review coming up A new agent action type about to go live An incident that needs reconstructing A workflow that moves real money
Trust & data handling

What we touch, and what we don't.

Verification means looking at sensitive records, so the handling matters as much as the method. Here is how a sprint is scoped.

Read-only, export-first

We start from a redacted export or scoped read-only access. No write access, ever, to begin.

One workflow at a time

A sprint covers a single workflow and a small sample, often a few dozen actions, not your whole database.

Minimised and redacted

We ask for the smallest sample that proves the method, with identifiers redacted wherever the verification doesn't need them.

Not used to train anything

Your data is used to produce your ledger and nothing else. It is never used to train models.

Deleted on request

Sample data is kept only for the sprint and deleted afterwards on request. An NDA before anything is shared is fine.

No production path

Veridatum sits entirely outside your live systems, with no production writes and no live replies to your customers.

FAQ

Common questions.

Isn't this just our vendor's dashboard, or something we could build ourselves?

Neither is independent, and independence is the point. A vendor's dashboard reports what the vendor counted; your own team can reconcile it for internal use, but in a dispute the vendor contests your numbers first, and risk or compliance sign-offs are built to distrust a check run by the same side it's checking. A neutral, buyer-owned ledger both sides can reference is harder to wave away. The hard part isn't the reconciliation. It's knowing which record holds the truth, and what "actually held" means when two systems disagree, per action type and across messy systems.

Is this invoice reconciliation or AI-spend management?

No. Veridatum checks whether a business action actually became true in your systems, not whether usage arithmetic adds up. It can inform what's payable or what to take into a renewal, but it is not invoice reconciliation or spend management.

Do you need access to our production systems?

No. A sprint runs on a redacted export or scoped read-only access. The first step needs no data at all, just a description of the workflow.

What if our logs and records are messy?

Expected. Part of the sprint is mapping agent/vendor claims to the right source-of-truth records and noting where the join is clean and where it isn't. You get those mapping notes either way.

How much does a design-partner sprint cost?

Design-partner terms are deliberately light, and scoped to the workflow and the decision attached to it. We'll quote it on a short call once we've confirmed the workflow is a fit. There are three slots.

What do you need from us to start?

One workflow where an agent changes backend state, a redacted sample of the agent's claims and the matching records, and a named decision the result would inform. That's it to begin.

Three slots open

Start with one workflow.

Pick one AI-agent workflow where reported success needs checking against business reality. No data or credentials to start: just a description, and we'll tell you what evidence would verify it.

or email hello@veridatum.io