Refund issued?
An agent says it refunded the customer. Did the payment record post, at the right amount, without reversal or duplication?
Veridatum compares AI-agent claims with your systems of record, so teams can separate vendor-reported success from business reality.
Your system of record is wherever the truth actually lives: Stripe, your order database, your billing system. Veridatum checks each reported action against that record and shows whether it held, meaning the change actually took effect and stayed true, not just that the agent reported it done.
Vendor dashboards report their own status, counts, and completions. Veridatum's job is the independent comparison: does that reported status match the systems you already trust?
An agent says it refunded the customer. Did the payment record post, at the right amount, without reversal or duplication?
An agent says it cancelled an account. Did billing actually stop, and did the entitlement change?
An agent says the ticket was resolved. Did the backend state change, or did a human quietly reopen and fix it later?
The output is an evidence-backed ledger: the claimed action, the proof from your system of record, whether it held, our verification status, and the decision. Refunds, credits, cancellations, and billing-state changes are useful first workflows. The artifact covers any backend state an agent can change.
| Claimed action | System-of-record proof | Hold status | Veridatum status | Decision |
|---|---|---|---|---|
| Refund issued | Stripe refund posted, £42.50 | Held 7 days, not reversed | Verified | Accept |
| Cancellation completed | No cancellation in billing record | Still billing at 24h | Disputed | Escalate |
| Duplicate refund requested | Already refunded once | No change | Excluded | Good restraint |
| Order address changed | Shipping address changed to new recipient before fulfilment | Held | Verified | Monitor |
| Credit applied | Wallet balance updated, £15.00 | Held 3 days | Verified | Accept |
| Ticket closed | Ticket marked resolved | Reopened in 2h | Disputed | Dispute |
Veridatum is early. We're taking three design partners to run a Verified Action Sprint on one real workflow, and limiting the program so each sprint can directly shape the method. You bring a redacted sample and a short debrief. You get the kind of buyer-readable ledger previewed above, direct input into the method, and design-partner terms.
How a sprint runs
Start from a redacted export or a scoped read-only sample, as small as a few dozen actions. Leave with a verified action ledger showing what the agent claimed, what your system of record shows, whether it held, and what should be accepted, disputed, escalated, or excluded.
We ran the same claim-vs-record verification over public, MIT-licensed agent benchmarks: their task specs, sandbox databases, and result traces. We have no production customers yet; the findings below are our own analysis of that public data.
26% didn't match. Across 456 GPT-4.1 retail simulations, 118 disagreed with the database on final state. That's the gap a vendor dashboard wouldn't surface on its own.
| Claimed / attempted action | Backend (benchmark) proof | Hold status | Veridatum status | Source |
|---|---|---|---|---|
| Cancel order after price drop | Order + item status = cancelled; refund £249 to card | Held | Verified | STATE-Bench · 40 |
| Cancel customer's order, refund promised | Wrong order cancelled; db_match = false | Mismatch | Disputed | τ-bench · 51 |
| Second courtesy refund requested | Already refunded; no mutation performed | No change | Excluded | STATE-Bench · 134 |
Across 456 GPT-4.1 retail simulations, 118 failed the final-state check: the agent's transcript and the database disagreed. In one, the agent cancelled the wrong order and told the customer a refund was on its way; in another it acted past an explicit "don't cancel anything else" instruction, inflating the refund. 104 of 114 tasks involve state-changing actions.
Across 150 support tasks with 952 explicit state assertions, the same method checks more than "resolved": refund amounts after clawbacks, waived fees, and multi-leg compound actions. It also verifies restraint: an agent correctly refusing a duplicate refund is a control worth confirming, not just successful actions.
Public synthetic benchmarks demonstrate the verification method and the shape of the artifact. They do not, by themselves, prove buyer demand, real-world error rates, or that production data is this clean. Buyer deployments run on buyer-authorized systems of record.
When an agent can move money, close accounts, or change entitlements, someone has to confirm it actually happened, before you trust it, pay for it, or take it into a vendor renewal.
Verification means looking at sensitive records, so the handling matters as much as the method. Here is how a sprint is scoped.
We start from a redacted export or scoped read-only access. No write access, ever, to begin.
A sprint covers a single workflow and a small sample, often a few dozen actions, not your whole database.
We ask for the smallest sample that proves the method, with identifiers redacted wherever the verification doesn't need them.
Your data is used to produce your ledger and nothing else. It is never used to train models.
Sample data is kept only for the sprint and deleted afterwards on request. An NDA before anything is shared is fine.
Veridatum sits entirely outside your live systems, with no production writes and no live replies to your customers.
Neither is independent, and independence is the point. A vendor's dashboard reports what the vendor counted; your own team can reconcile it for internal use, but in a dispute the vendor contests your numbers first, and risk or compliance sign-offs are built to distrust a check run by the same side it's checking. A neutral, buyer-owned ledger both sides can reference is harder to wave away. The hard part isn't the reconciliation. It's knowing which record holds the truth, and what "actually held" means when two systems disagree, per action type and across messy systems.
No. Veridatum checks whether a business action actually became true in your systems, not whether usage arithmetic adds up. It can inform what's payable or what to take into a renewal, but it is not invoice reconciliation or spend management.
No. A sprint runs on a redacted export or scoped read-only access. The first step needs no data at all, just a description of the workflow.
Expected. Part of the sprint is mapping agent/vendor claims to the right source-of-truth records and noting where the join is clean and where it isn't. You get those mapping notes either way.
Design-partner terms are deliberately light, and scoped to the workflow and the decision attached to it. We'll quote it on a short call once we've confirmed the workflow is a fit. There are three slots.
One workflow where an agent changes backend state, a redacted sample of the agent's claims and the matching records, and a named decision the result would inform. That's it to begin.
For a buyer being asked to share records, who is behind the work is one of the strongest trust signals there is. This section is a placeholder. Fill it in before sharing the site.
[ One short paragraph: relevant background (verification, payments, audit, ML evaluation, support operations) and why you started Veridatum. Two or three concrete sentences is plenty. ]
[ Optional second profile. Delete this card if you're solo for now. ]
⚠ Placeholder: replace with real names and bios before sharing with prospective design partners.
Pick one AI-agent workflow where reported success needs checking against business reality. No data or credentials to start: just a description, and we'll tell you what evidence would verify it.
Describe a workflow →or email hello@veridatum.io