CS Ops Scorecard for AI Support Agent Vendor Claims

ranked [TRIANGULATED] filter 8.5/15 spread ±2.0 signals: 3 independent

What is this?

A monthly subscription tool for Customer Success Operations leads at 50-500 person SaaS companies that have deployed (or are piloting) third-party AI support agent vendors (Decagon, Ada, Forethought, Sierra, Cresta, etc.). CS Ops manually enters the vendor's claimed monthly performance figures from the standard QBR PDF (deflection rate, CSAT, escalation accuracy, hallucination incidents, contained-conversation rate) alongside aggregate ground-truth pulls from Zendesk/Intercom (no ticket bodies, just bucketed counts). AE's adversarial multi-model debate runs each claim against resolved aggregates using the 6-pattern autopsy taxonomy — flagging Cosmetic Confidence (CSAT cherry-picked from contained subset), Premise-Conclusion Severing (claimed deflection ignores escalation backlog), Concession Laundering (vendor concedes minor misses while inflating headline). Output: a quarter-by-quarter ledger CS Ops walks into renewal and QBR conversations. Today these reviews happen on faith — vendor sends a deck, CS Ops has no calibration tooling, and Zendesk aggregates tell a different story they can't structure into pushback. AE's portable autopsy taxonomy plus 24h grading turn the buyer's existing aggregates into renewal leverage.

Why did we consider it?

AE's autopsy taxonomy and 24h adversarial grading turn buyer-side Zendesk aggregates into the missing renewal-leverage instrument that the AI support market — per Opus, Twig, and DigitalApplied 2026 — has not produced.

What breaks?

Data-resolution mismatch: You cannot run a 6-pattern LLM autopsy on bucketed aggregate counts; detecting metric manipulation requires raw ticket bodies.
Over-engineered solution: Comparing vendor PDF claims against Zendesk aggregate counts requires basic arithmetic, not an adversarial multi-model debate.
Subscription frequency mismatch: QBRs and renewals are quarterly/annual events, making a monthly SaaS highly susceptible to churn.

What did we learn?

Still in evaluation (phase: ranked). No verdict yet.

Filter scores

Five axes, each scored 0-3. Three independent runs by different model perspectives. Median shown.

Axis	What it measures
data moat	Does this product accumulate proprietary data that compounds?
10x model test	Does a better model make this more valuable, or redundant?
fast feedback loops	Can outputs be graded against reality in <30 days?
solo founder feasible	Can a solo operator build and run this without a team?
AI providers cant eat it	Do hyperscalers have structural reasons NOT to build this?

Composite median: 8.5 / 15. Graduation threshold: 9.0. IQR across runs: 2.0.

Evidence

Signal A — Primary source

https://arxiv.org/pdf/2601.14242 credibility: low

To assess whether AI agents can execute highly complex professional services work, we present APEX–Agents, a new benchmark for frontier AI.

Signal B — Competitor with documented gap

https://www.swept.ai/post/ai-customer-service-agent-evaluation-framework

Swept.ai offers an AI agent evaluation framework measuring five dimensions, but the snippet states 'Most vendor scorecards cover one or two' — indicating the market focuses on pre-purchase selection criteria, not ongoing adversarial auditing of vendor QBR performance claims against ground-truth helpdesk aggregates (deflection vs. escalation backlog, CSAT cherry-picking, etc.).

Signal D — Demand proxy

{"found":true,"summary":"Active demand for vendor claim verification in CS/support: a Reddit user in r/CustomerSuccess built a free tool that checks CS platform vendor claims by interrogating their AI agents; LinkedIn posts circulate AI vendor evaluation scorecards as essential procurement artifacts; multiple blog posts (notch.cx, regal.ai) indicate the market is actively wrestling with which AI support metrics to trust and how to measure them independently.","sources":["https://www.reddit.com/r/CustomerSuccess/comments/1st9wog/built_a_free_tool_that_evaluates_cs_platforms/","https://www.linke…

Evaluation history

When	Stage	Phase
2026-05-09 12:42	filter_score	scored
2026-05-09 12:36	filter_score	scored
2026-05-09 12:24	filter_score	scored
2026-05-09 12:19	evidence_search	argument
2026-05-09 12:12	audience_simulation	argument
2026-05-09 12:06	red_team_kill	argument
2026-05-09 12:00	steelman	argument
2026-05-09 11:56	genesis	argument