← all hypotheses

CS Ops Scorecard for AI Support Agent Vendor Claims

ranked [TRIANGULATED] filter 8.5/15 spread ±2.0 signals: 3 independent
What is this?
A monthly subscription tool for Customer Success Operations leads at 50-500 person SaaS companies that have deployed (or are piloting) third-party AI support agent vendors (Decagon, Ada, Forethought, Sierra, Cresta, etc.). CS Ops manually enters the vendor's claimed monthly performance figures from the standard QBR PDF (deflection rate, CSAT, escalation accuracy, hallucination incidents, contained-conversation rate) alongside aggregate ground-truth pulls from Zendesk/Intercom (no ticket bodies, just bucketed counts). AE's adversarial multi-model debate runs each claim against resolved aggregates using the 6-pattern autopsy taxonomy — flagging Cosmetic Confidence (CSAT cherry-picked from contained subset), Premise-Conclusion Severing (claimed deflection ignores escalation backlog), Concession Laundering (vendor concedes minor misses while inflating headline). Output: a quarter-by-quarter ledger CS Ops walks into renewal and QBR conversations. Today these reviews happen on faith — vendor sends a deck, CS Ops has no calibration tooling, and Zendesk aggregates tell a different story they can't structure into pushback. AE's portable autopsy taxonomy plus 24h grading turn the buyer's existing aggregates into renewal leverage.
Why did we consider it?
AE's autopsy taxonomy and 24h adversarial grading turn buyer-side Zendesk aggregates into the missing renewal-leverage instrument that the AI support market — per Opus, Twig, and DigitalApplied 2026 — has not produced.
What breaks?
  • Data-resolution mismatch: You cannot run a 6-pattern LLM autopsy on bucketed aggregate counts; detecting metric manipulation requires raw ticket bodies.
  • Over-engineered solution: Comparing vendor PDF claims against Zendesk aggregate counts requires basic arithmetic, not an adversarial multi-model debate.
  • Subscription frequency mismatch: QBRs and renewals are quarterly/annual events, making a monthly SaaS highly susceptible to churn.
What did we learn?
Still in evaluation (phase: ranked). No verdict yet.

Filter scores

Five axes, each scored 0-3. Three independent runs by different model perspectives. Median shown.

AxisWhat it measures
data moatDoes this product accumulate proprietary data that compounds?
10x model testDoes a better model make this more valuable, or redundant?
fast feedback loopsCan outputs be graded against reality in <30 days?
solo founder feasibleCan a solo operator build and run this without a team?
AI providers cant eat itDo hyperscalers have structural reasons NOT to build this?
Composite median: 8.5 / 15. Graduation threshold: 9.0. IQR across runs: 2.0.

Evidence

Signal A — Primary source

To assess whether AI agents can execute highly complex professional services work, we present APEX–Agents, a new benchmark for frontier AI.

Signal B — Competitor with documented gap

Swept.ai offers an AI agent evaluation framework measuring five dimensions, but the snippet states 'Most vendor scorecards cover one or two' — indicating the market focuses on pre-purchase selection criteria, not ongoing adversarial auditing of vendor QBR performance claims against ground-truth helpdesk aggregates (deflection vs. escalation backlog, CSAT cherry-picking, etc.).

Signal D — Demand proxy

{"found":true,"summary":"Active demand for vendor claim verification in CS/support: a Reddit user in r/CustomerSuccess built a free tool that checks CS platform vendor claims by interrogating their AI agents; LinkedIn posts circulate AI vendor evaluation scorecards as essential procurement artifacts; multiple blog posts (notch.cx, regal.ai) indicate the market is actively wrestling with which AI support metrics to trust and how to measure them independently.","sources":["https://www.reddit.com/r/CustomerSuccess/comments/1st9wog/built_a_free_tool_that_evaluates_cs_platforms/","https://www.linke…

Evaluation history

WhenStagePhase
2026-05-09 12:42filter_scorescored
2026-05-09 12:36filter_scorescored
2026-05-09 12:24filter_scorescored
2026-05-09 12:19evidence_searchargument
2026-05-09 12:12audience_simulationargument
2026-05-09 12:06red_team_killargument
2026-05-09 12:00steelmanargument
2026-05-09 11:56genesisargument