← all hypotheses

AI Tool Claim Verification for Agencies

graduated [S] filter 10.5/15 spread ±2.0 signals: 2 independent
What is this?
A written decision service for boutique agencies evaluating a specific AI tool, but narrowed to one thing AE can genuinely do well: verify whether the vendor's concrete performance claims hold on a small set of agency-relevant benchmark tasks with fast, objective scoring. Instead of forecasting team-wide rollout success, the deliverable is a claim-verification dossier: which promises are supported, which fail under adversarial testing, what constraints matter, and whether the tool merits a limited pilot. Inputs are public vendor claims, sample non-sensitive task specs, and the agency's evaluation criteria. AE runs adversarial multi-model debate over expected outcomes, executes a code-enforced grading loop on benchmark results, and classifies reasoning failures using its six-pattern taxonomy where appropriate: in the tool's outputs, in vendor positioning, and in evaluator overreach. This is not change-management consulting and not a broad software audit. It is a bounded pre-purchase evidence service for agencies deciding whether an AI writing, research, summarization, or workflow assistant actually performs on the tasks it claims to handle before the team wastes time piloting it.
Why did we consider it?
The best case is that this is a narrow, credible, high-value evidence service for agencies: verify vendor AI claims on real tasks fast, objectively, and repeatably before they waste time and money on weak pilots.
What breaks?
  • The 'Free Trial' Squeeze: Agencies won't pay premium fees to audit cheap SaaS, and enterprise vendors offer free POCs for expensive tools.
  • Manual Labor Bottleneck: AE cannot autonomously operate third-party UIs/APIs; the Commander must manually execute the benchmarks before AE can grade them.
  • Misaligned Target Market: Boutique agencies are budget-constrained and lack the capital to support a £100-300K/year recurring revenue target for pre-purchase audits.
What did we learn?
Commander override: KILL. Commander kill: manual-labor bottleneck destroys solo feasibility. AE cannot autonomously operate third-party AI tool UIs/APIs, so the Commander must manually acquire access, learn each interface, input agency benchmarks, and extract outputs before AE can grade anything. This inverts the value proposition to "AE-as-grading-assistant-to-Commander-consulting," which is exactly the operating mode Commander is trying to escape. Also suffers from missing-middle economic squeeze (£30/mo tools do not justify £2k audits; enterprise tools get free vendor POCs) and buyer-budget constraints in the boutique-agency ICP. First Corpus S graduation; kill is diagnostic for Corpus S quality.

Filter scores

Five axes, each scored 0-3. Three independent runs by different model perspectives. Median shown.

AxisWhat it measures
data moatDoes this product accumulate proprietary data that compounds?
10x model testDoes a better model make this more valuable, or redundant?
fast feedback loopsCan outputs be graded against reality in <30 days?
solo founder feasibleCan a solo operator build and run this without a team?
AI providers cant eat itDo hyperscalers have structural reasons NOT to build this?
Composite median: 10.5 / 15. Graduation threshold: 9.0. IQR across runs: 2.0.

Evidence

Signal B — Competitor with documented gap

Competitors shown are software platforms for ongoing LLM evaluation/testing, but the hypothesis is a bounded, done-for-you pre-purchase decision service for agencies. The evidence provided shows competitor existence, but only a partial gap: these tools appear productized for internal AI teams and enterprise applications rather than a fast written verification dossier for boutique agencies deciding whether to run a limited pilot.

Signal D — Demand proxy

{"summary":"Indirect evidence suggests active market interest in comparing AI tool performance and benchmarking models, plus strong ecosystem activity around evaluation frameworks.","sources":["https://github.com/openai/evals","https://github.com/eleutherai/lm-evaluation-harness","https://www.reddit.com/r/ClaudeAI/comments/1r9jf2j/i_benchmarked_opus_46_vs_sonnet_46_on_agentic_pr/","https://www.reddit.com/r/GithubCopilot/comments/1r7yxqa/your_experience_with_new_claude_sonnet_46_vs_45/","https://www.reddit.com/r/ChatGPT/comments/1r3kkl8/after_3_years_with_chatgpt_i_tried_claude_and/"]}

Evaluation history

WhenStagePhase
2026-04-19 10:27deep_council_verdictgraduated
2026-04-19 10:17deep_claude_takegraduated
2026-04-19 10:14deep_90day_plangraduated
2026-04-19 09:36deep_riskgraduated
2026-04-19 09:28deep_distributiongraduated
2026-04-19 09:15deep_pricinggraduated
2026-04-19 09:06deep_moatgraduated
2026-04-19 08:52deep_buyer_simgraduated
2026-04-19 08:46deep_icpgraduated
2026-04-19 08:36deep_competitorgraduated
2026-04-19 08:27deep_market_realitygraduated
2026-04-19 08:10filter_scorescored
2026-04-19 08:00filter_scorescored
2026-04-19 07:50filter_scorescored
2026-04-19 07:40evidence_searchargument
2026-04-19 07:30audience_simulationargument
2026-04-19 07:20red_team_killargument
2026-04-19 07:10steelmanargument
2026-04-19 07:00genesisargument