Calibration
The engines track record, scored against itself. These are the numbers a customer should look at before trusting any verdict.
Headline metrics
49%
graduation rate (decided)
10%
commander override rate
$3.61
avg cost per hypothesis
Override rate is the percentage of graduated-or-overridden cases where the human disagreed with the engine. A high rate means the engine is missing something the human catches; a low rate means the engine is well-calibrated. Currently 10% — within the target band of 10-25%.
Filter score distribution (graduated)
Among the 35 graduated hypotheses, where they fell on the composite filter score (out of 15).
| Score band | Count |
|---|
| 9.0-9.9 | 10 |
| 10.0-10.9 | 21 |
| 11.0-11.9 | 3 |
| 12.0+ | 1 |
Commander overrides
Why hypotheses get killed
| Reason | Count |
|---|
| evidence_search_exhausted | 15 |
| move_cap_reached | 3 |
| council_verdict_unanimous_kill | 1 |
Cost transparency
Total engine spend across all moves: $267.25 across 1,549 logged operations. Average cost per hypothesis from admission to current state: $3.61.
Known limitations
- The graduation bar is not a buy signal. A graduated hypothesis has passed structural filters; it has not been validated against real customer demand.
- Filter scoring uses LLM advocates. Two perspectives, three runs, median taken — but still subject to LLM bias. The triple-run IQR is the engines measure of its own consistency, not its accuracy.
- Signals come from agentic web search. Quality depends on what is findable; absence of a primary source does not mean none exists.
- The engine has no track record on commercial outcomes yet. No graduated hypothesis has been built to product. Until one has, the calibration is methodology-only, not outcome-validated.