Veridatum: Verify what AI agents actually did

Claim → evidence

An agent's claim is not a business outcome.

Your system of record is wherever the truth actually lives: Stripe, your order database, your billing system. Veridatum checks each reported action against that record and shows whether it held, meaning the change actually took effect and stayed true, not just that the agent reported it done.

What the agent reports

What your system of record shows

Vendor dashboards report their own status, counts, and completions. Veridatum's job is the independent comparison: does that reported status match the systems you already trust?

Where reports break

Questions a vendor dashboard can't settle on its own.

01 · payment state

Refund issued?

An agent says it refunded the customer. Did the payment record post, at the right amount, without reversal or duplication?

02 · account state

Cancellation completed?

An agent says it cancelled an account. Did billing actually stop, and did the entitlement change?

03 · ticket state

Ticket resolved?

An agent says the ticket was resolved. Did the backend state change, or did a human quietly reopen and fix it later?

Verified action ledger

Reported success, checked against business reality.

The output is an evidence-backed ledger: the claimed action, the proof from your system of record, whether it held, our verification status, and the decision. Refunds, credits, cancellations, and billing-state changes are useful first workflows. The artifact covers any backend state an agent can change.

SAMPLE DELIVERABLE PREVIEW · support actions

We only ever read your data, via an export you share or read-only access

Claimed action	System-of-record proof	Hold status	Veridatum status	Decision
Refund issued	Stripe refund posted, £42.50	Held 7 days, not reversed	Verified	Accept
Cancellation completed	No cancellation in billing record	Still billing at 24h	Disputed	Escalate
Duplicate refund requested	Already refunded once	No change	Excluded	Good restraint
Order address changed	Shipping address changed to new recipient before fulfilment	Held	Verified	Monitor
Credit applied	Wallet balance updated, £15.00	Held 3 days	Verified	Accept
Ticket closed	Ticket marked resolved	Reopened in 2h	Disputed	Dispute

Design-partner program

Three design-partner slots.

Veridatum is early. We're taking three design partners to run a Verified Action Sprint on one real workflow, and limiting the program so each sprint can directly shape the method. You bring a redacted sample and a short debrief. You get the kind of buyer-readable ledger previewed above, direct input into the method, and design-partner terms.

How a sprint runs

1Describe a workflowNo data or credentials to start: just one workflow where an agent changes backend state, and the decision attached to it.
2Sample & source-of-truth mappingA redacted export or scoped read-only sample, mapped to the records that actually hold the truth, with clean joins and messy ones noted.
3Verified ledger & recommendationAn evidence-backed ledger of what held, what didn't, and what to accept, dispute, escalate, or exclude.

See what a sprint involves →

1–2 WEEKS · ONE WORKFLOW · 3 SLOTS

The Verified Action Sprint.

Start from a redacted export or a scoped read-only sample, as small as a few dozen actions. Leave with a verified action ledger showing what the agent claimed, what your system of record shows, whether it held, and what should be accepted, disputed, escalated, or excluded.

What you bring

→One workflow where an AI agent changes backend state: refunds, cancellations, credits, order or account changes, entitlement changes, or ticket resolutions.
→A redacted sample of agent/vendor claims and the matching system-of-record records. No production access or credentials.
→A named decision the result would inform, and roughly the exposure attached to it.

What you get

✓A verified action ledger and an evidence-backed register of disputed, failed, and corrected actions.
✓Source-of-truth mapping notes and sample evidence packets you can inspect.
✓A recommendation (trust, gate, dispute, monitor, or stop), plus design-partner terms and direct input into the method.

Public method demos

On public benchmarks, Veridatum caught what a dashboard would miss.

We ran the same claim-vs-record verification over public, MIT-licensed agent benchmarks: their task specs, sandbox databases, and result traces. We have no production customers yet; the findings below are our own analysis of that public data.

Reported vs. verified · τ-bench retail

If every reported action is trusted100%

Actually held in the system of record74%

26% didn't match. Across 456 GPT-4.1 retail simulations, 118 disagreed with the database on final state. That's the gap a vendor dashboard wouldn't surface on its own.

Veridatum analysis · public MIT-licensed benchmark · not real customer data

METHOD DEMO · public benchmarks

Public benchmark data, not real customer data. Each row cites its source task

Claimed / attempted action	Backend (benchmark) proof	Hold status	Veridatum status	Source
Cancel order after price drop	Order + item status = cancelled; refund £249 to card	Held	Verified	STATE-Bench · 40
Cancel customer's order, refund promised	Wrong order cancelled; db_match = false	Mismatch	Disputed	τ-bench · 51
Second courtesy refund requested	Already refunded; no mutation performed	No change	Excluded	STATE-Bench · 134

τ-bench retail · Sierra Research

A quarter of runs didn't match the business state.

Across 456 GPT-4.1 retail simulations, 118 failed the final-state check: the agent's transcript and the database disagreed. In one, the agent cancelled the wrong order and told the customer a refund was on its way; in another it acted past an explicit "don't cancel anything else" instruction, inflating the refund. 104 of 114 tasks involve state-changing actions.

Microsoft STATE-Bench · enterprise support

Verifying the math, and the safe "no".

Across 150 support tasks with 952 explicit state assertions, the same method checks more than "resolved": refund amounts after clawbacks, waived fees, and multi-leg compound actions. It also verifies restraint: an agent correctly refusing a duplicate refund is a control worth confirming, not just successful actions.

Public synthetic benchmarks demonstrate the verification method and the shape of the artifact. They do not, by themselves, prove buyer demand, real-world error rates, or that production data is this clean. Buyer deployments run on buyer-authorized systems of record.

Who it's for

For teams letting AI agents change business state.

When an agent can move money, close accounts, or change entitlements, someone has to confirm it actually happened, before you trust it, pay for it, or take it into a vendor renewal.

State changes verified

Customer state
Order state
Payment state
Account state
Entitlement state
Support-ticket state

By team

Support · Ops · silent failures and cleanup
Finance · what to trust, pay, or dispute
Procurement · buyer-owned evidence for renewals
Risk · Compliance · reconstruct what agents changed
BizOps · Data · reliable joins and definitions

A good moment to start

An AI vendor renewal, true-up, or review coming up A new agent action type about to go live An incident that needs reconstructing A workflow that moves real money

Trust & data handling

What we touch, and what we don't.

Verification means looking at sensitive records, so the handling matters as much as the method. Here is how a sprint is scoped.

Read-only, export-first

We start from a redacted export or scoped read-only access. No write access, ever, to begin.

One workflow at a time

A sprint covers a single workflow and a small sample, often a few dozen actions, not your whole database.

Minimised and redacted

We ask for the smallest sample that proves the method, with identifiers redacted wherever the verification doesn't need them.

Not used to train anything

Your data is used to produce your ledger and nothing else. It is never used to train models.

Deleted on request

Sample data is kept only for the sprint and deleted afterwards on request. An NDA before anything is shared is fine.

No production path

Veridatum sits entirely outside your live systems, with no production writes and no live replies to your customers.

FAQ

Common questions.

Isn't this just our vendor's dashboard, or something we could build ourselves?

Neither is independent, and independence is the point. A vendor's dashboard reports what the vendor counted; your own team can reconcile it for internal use, but in a dispute the vendor contests your numbers first, and risk or compliance sign-offs are built to distrust a check run by the same side it's checking. A neutral, buyer-owned ledger both sides can reference is harder to wave away. The hard part isn't the reconciliation. It's knowing which record holds the truth, and what "actually held" means when two systems disagree, per action type and across messy systems.

Is this invoice reconciliation or AI-spend management?

No. Veridatum checks whether a business action actually became true in your systems, not whether usage arithmetic adds up. It can inform what's payable or what to take into a renewal, but it is not invoice reconciliation or spend management.

Do you need access to our production systems?

No. A sprint runs on a redacted export or scoped read-only access. The first step needs no data at all, just a description of the workflow.

What if our logs and records are messy?

Expected. Part of the sprint is mapping agent/vendor claims to the right source-of-truth records and noting where the join is clean and where it isn't. You get those mapping notes either way.

How much does a design-partner sprint cost?

Design-partner terms are deliberately light, and scoped to the workflow and the decision attached to it. We'll quote it on a short call once we've confirmed the workflow is a fit. There are three slots.

What do you need from us to start?

One workflow where an agent changes backend state, a redacted sample of the agent's claims and the matching records, and a named decision the result would inform. That's it to begin.

Who's behind Veridatum

The people behind Veridatum.

For a buyer being asked to share records, who is behind the work is one of the strongest trust signals there is. This section is a placeholder. Fill it in before sharing the site.

[ Founder name ]

Founder

[ One short paragraph: relevant background (verification, payments, audit, ML evaluation, support operations) and why you started Veridatum. Two or three concrete sentences is plenty. ]

[ LinkedIn · email ]

[ Co-founder name (optional) ]

Co-founder

[ Optional second profile. Delete this card if you're solo for now. ]