AgentDiligence — Pre-procurement AI agent evaluation

The gap

A demo is not evidence. A benchmark is not your business.

Evaluating an AI agent answers three different questions. Only one of them tells you whether it's safe to put on your work.

Vendor demo

Can the vendor show a convincing happy path?

Public benchmark

Which model wins on shared, generic tasks?

AgentDiligence

Can this agent safely do your work, to your standards, with evidence you can inspect?

How it works

Your support history is the exam.

We don't hand you a generic test set. We replay the work your team already did — the tickets, timelines, and actions behind it — and turn it into proof.

Ingest your work traces

Tickets, timelines, status and assignee changes, internal notes, backend actions, and outcomes — exported, read-only, and handled privately. No prior eval set required.

Build Task Specs primitive

Each trace becomes a replayable AgentDiligence Task Spec: the situation, the hidden business state, your policies, the allowed and forbidden actions, the expected final state, and the evidence bar it has to clear.

Run the candidate field

Packaged vendors, a cheaper challenger, a reference agent, and your current process all face the same specs under identical constraints, with repeated runs where consistency matters. We test the agents — we never sit between you and your vendors.

Capture Evidence Packets primitive

Every run produces an Evidence Packet — transcript, vendor source and handoff records, and independent backend final-state proof — scored from Level 2 to Level 4.

Decide

A plain-English recommendation: which categories are safe to automate, assist, human-gate, or avoid — and which agent, if any, is worth buying or piloting.

Evidence

Decision output, not dashboard noise.

Every run becomes a decision-ready report: what passed, what failed, what's safe to automate, and how to route the rest. Every figure links to the evidence behind it.

Sample report · Refunds & billing · 60 task specs · 3 runs each Recommendation Pilot — routed

ArmResolvedIncidentsCost / resVerdict

Premium vendor agent packaged 47 / 60 0 £0.79 Pilot

Challenger agent low-cost 41 / 60 3 £0.18 Avoid · refunds

Reference agent frontier model 44 / 60 1 £0.31 Assist

Your current process human baseline 52 / 60 0 £5.20 Baseline

3 serious-incident clusters · Human-gate refunds > £250 · Evidence verified to L4

View transcripts →Final-state proof →

Illustrative — sample figures, anonymised arms. In a real engagement, arms are named, every figure links to its transcript, tool calls, policy match, and final-state proof, and cost is per safe resolution.

Level 1

Claim or score only

A pass-rate or a slick demo — the vendor's word, nothing behind it. Not evidence; packets start at Level 2.

Level 2

Transcript & screenshots

What the agent said and showed the customer, captured verbatim.

Level 3

Source, ticket & handoff records

Vendor-side evidence: where answers came from, what was logged, how it escalated.

Level 4

Independent final-state proof

We check the backend itself: was the refund actually issued, the case actually resolved?

Engagements

Start with a snapshot. Scale to assurance.

Productized evaluations, scoped to the decision in front of you. Most teams start with a Vendor Decision Report.

Work-Trace Snapshot

What's safe, risky, or worth evaluating?

A sample of your support history, normalized into Task Specs
An automation-readiness map across ticket categories
Where the evidence is strong — and where it's missing

Start here →

Most teams start here

Vendor Decision Report

Which vendor or route should we buy or pilot?

2–4 vendor, reference, and baseline arms, head to head
50–150 Task Specs built from your own work
Serious-incident table and quality-adjusted cost
Buy · pilot · constrain · avoid — with the evidence

Run a comparison →

Enterprise Evaluation

What's safe to automate across the org?

Multiple workflows, markets, and languages
Deeper Level 4 evidence and blind review
Board- and procurement-ready synthesis

Talk to us →

Regression & Assurance ongoing

Does the deployed agent still clear the bar?

Quarterly retests as models and policies change
Renewal and expansion evidence on demand
Incident monitoring against your Task Spec library

Talk to us →

Engagements are priced to the decision — typically a small fraction of the first-year agent spend. Ask for a scope.

Who it's for

Built for the team that owns the support decision.

Support and service leaders at scaleups choosing — or second-guessing — an AI agent on customer-facing work.

You're a fit if

You handle 500–5,000+ tickets a month, with history worth replaying
You run a mainstream helpdesk — Zendesk, Intercom, Gorgias, Freshdesk, Front, Jira Service Management, ServiceNow
You're considering, piloting, renewing, or replacing an AI support agent
Service quality is tied to revenue, retention, and trust

Right when you're

Weighing Intercom Fin, Zendesk AI, Gorgias AI, Ada, or Freddy
Mid-migration between helpdesks
Heading into a renewal negotiation or building a vendor shortlist
Once burned by a bot rollout that overpromised

Support is the first proving ground. The same Task Spec → Evidence Packet engine applies anywhere humans leave a work trail — finance ops, claims handling, contract review.

Don't buy the demo. Prove the agent.

A demo is not evidence. A benchmark is not your business.

Your support history is the exam.

Ingest your work traces

Build Task Specs primitive

Run the candidate field

Capture Evidence Packets primitive

Decide

Decision output, not dashboard noise.

Claim or score only

Transcript & screenshots

Source, ticket & handoff records

Independent final-state proof

Start with a snapshot. Scale to assurance.

Work-Trace Snapshot

Vendor Decision Report

Enterprise Evaluation

Regression & Assurance ongoing

Built for the team that owns the support decision.

You're a fit if

Right when you're

Prove it on your own work — before you sign.