AI Tool Claim Verification for Agencies

graduated [S] filter 10.5/15 spread ±2.0 signals: 2 independent

What is this?

A written decision service for boutique agencies evaluating a specific AI tool, but narrowed to one thing AE can genuinely do well: verify whether the vendor's concrete performance claims hold on a small set of agency-relevant benchmark tasks with fast, objective scoring. Instead of forecasting team-wide rollout success, the deliverable is a claim-verification dossier: which promises are supported, which fail under adversarial testing, what constraints matter, and whether the tool merits a limited pilot. Inputs are public vendor claims, sample non-sensitive task specs, and the agency's evaluation criteria. AE runs adversarial multi-model debate over expected outcomes, executes a code-enforced grading loop on benchmark results, and classifies reasoning failures using its six-pattern taxonomy where appropriate: in the tool's outputs, in vendor positioning, and in evaluator overreach. This is not change-management consulting and not a broad software audit. It is a bounded pre-purchase evidence service for agencies deciding whether an AI writing, research, summarization, or workflow assistant actually performs on the tasks it claims to handle before the team wastes time piloting it.

Why did we consider it?

The best case is that this is a narrow, credible, high-value evidence service for agencies: verify vendor AI claims on real tasks fast, objectively, and repeatably before they waste time and money on weak pilots.

What breaks?

The 'Free Trial' Squeeze: Agencies won't pay premium fees to audit cheap SaaS, and enterprise vendors offer free POCs for expensive tools.
Manual Labor Bottleneck: AE cannot autonomously operate third-party UIs/APIs; the Commander must manually execute the benchmarks before AE can grade them.
Misaligned Target Market: Boutique agencies are budget-constrained and lack the capital to support a £100-300K/year recurring revenue target for pre-purchase audits.

What did we learn?

Commander override: KILL. Commander kill: manual-labor bottleneck destroys solo feasibility. AE cannot autonomously operate third-party AI tool UIs/APIs, so the Commander must manually acquire access, learn each interface, input agency benchmarks, and extract outputs before AE can grade anything. This inverts the value proposition to "AE-as-grading-assistant-to-Commander-consulting," which is exactly the operating mode Commander is trying to escape. Also suffers from missing-middle economic squeeze (£30/mo tools do not justify £2k audits; enterprise tools get free vendor POCs) and buyer-budget constraints in the boutique-agency ICP. First Corpus S graduation; kill is diagnostic for Corpus S quality.

Filter scores

Five axes, each scored 0-3. Three independent runs by different model perspectives. Median shown.

Axis	What it measures
data moat	Does this product accumulate proprietary data that compounds?
10x model test	Does a better model make this more valuable, or redundant?
fast feedback loops	Can outputs be graded against reality in <30 days?
solo founder feasible	Can a solo operator build and run this without a team?
AI providers cant eat it	Do hyperscalers have structural reasons NOT to build this?

Composite median: 10.5 / 15. Graduation threshold: 9.0. IQR across runs: 2.0.

Evidence

Signal B — Competitor with documented gap

https://www.vals.ai/product

Competitors shown are software platforms for ongoing LLM evaluation/testing, but the hypothesis is a bounded, done-for-you pre-purchase decision service for agencies. The evidence provided shows competitor existence, but only a partial gap: these tools appear productized for internal AI teams and enterprise applications rather than a fast written verification dossier for boutique agencies deciding whether to run a limited pilot.

Signal D — Demand proxy

{"summary":"Indirect evidence suggests active market interest in comparing AI tool performance and benchmarking models, plus strong ecosystem activity around evaluation frameworks.","sources":["https://github.com/openai/evals","https://github.com/eleutherai/lm-evaluation-harness","https://www.reddit.com/r/ClaudeAI/comments/1r9jf2j/i_benchmarked_opus_46_vs_sonnet_46_on_agentic_pr/","https://www.reddit.com/r/GithubCopilot/comments/1r7yxqa/your_experience_with_new_claude_sonnet_46_vs_45/","https://www.reddit.com/r/ChatGPT/comments/1r3kkl8/after_3_years_with_chatgpt_i_tried_claude_and/"]}

Evaluation history

When	Stage	Phase
2026-04-19 10:27	deep_council_verdict	graduated
2026-04-19 10:17	deep_claude_take	graduated
2026-04-19 10:14	deep_90day_plan	graduated
2026-04-19 09:36	deep_risk	graduated
2026-04-19 09:28	deep_distribution	graduated
2026-04-19 09:15	deep_pricing	graduated
2026-04-19 09:06	deep_moat	graduated
2026-04-19 08:52	deep_buyer_sim	graduated
2026-04-19 08:46	deep_icp	graduated
2026-04-19 08:36	deep_competitor	graduated
2026-04-19 08:27	deep_market_reality	graduated
2026-04-19 08:10	filter_score	scored
2026-04-19 08:00	filter_score	scored
2026-04-19 07:50	filter_score	scored
2026-04-19 07:40	evidence_search	argument
2026-04-19 07:30	audience_simulation	argument
2026-04-19 07:20	red_team_kill	argument
2026-04-19 07:10	steelman	argument
2026-04-19 07:00	genesis	argument