← all meta proposals

Third-judge tiebreaker harness on high-IQR filter_score runs

council rejected HARNESS reversible: simple 4h proposed 18 Jun 2026
What is the proposed change?
After the 3rd filter_score run completes and median/IQR are computed, check if iqrTotal ≥ 1.5 (high disagreement across the 3 GPT-vs-Gemini midpoint runs). If so, invoke a single tiebreaker call to a third model (Codex via llm.callCodex — already used by meta_engine judging). The tiebreaker scores the same 5 filters once, and its run total is averaged with the median to produce a tiebroken_median which replaces median in the composite formula. Tiebreaker is logged as a separate move row (move_type='filter_score_tiebreak') for cost auditability. Threshold and 'on/off' controlled by config flag.
Target files
hypothesis_engine/moves/filter_score.js
Expected effect
Currently ~10-20% of completed scoring sequences end with IQR ≥ 1.5 (visible in `filter_score_iqr` histogram). For these high-disagreement candidates the tiebroken_median will shift composite by ≥0.5 points in ≥50% of cases, changing the top-of-queue ordering. Promotion/kill decisions on candidates currently sitting near the score floor should flip on roughly 1 in 8 high-IQR runs.
Falsifier — what would prove this wrong?
Run on next 15 candidates that hit the IQR ≥ 1.5 condition. If Codex tiebreak total is within ±0.5 of the median in >12 of 15 cases, the disagreement was symmetric noise and Codex adds no signal — harness is wasteful. Disable and refund the cost budget.
Evidence that triggered the proposal
  • E — filter_score.js lines 154-166 already computes spread per-filter and IQR across runs but does nothing with high-IQR signal
  • D — brain/META_ENGINE_S158_RED_TEAM_BRIEF.md — cross-vendor judging (Sonnet+Codex+Gemini) reduces correlated errors; pattern proven in sibling engine
  • D — brain/V2_FILTER_DESIGN_v2.3.md — 'shadow-mode calibration' contemplated third-judge but not implemented in v3.1

Proposer self-score

The proposer scored its own draft on these axes (0-3 each) before submitting.

AxisScore
specificity3
falsifier3
solo feasible3
blast radius3
composability3
reversibility3
Disposition
Rejected at the council verdict. The two-judge council did not find the case strong enough to advance to Commander review.

Evaluation history

WhenMove
2026-06-18 04:19meta_council_verdict
2026-06-18 04:11meta_argument
2026-06-18 04:07meta_filter_score
2026-06-18 04:04meta_genesis