Third-judge tiebreaker harness on high-IQR filter_score runs

council rejected HARNESS reversible: simple 4h proposed 18 Jun 2026

What is the proposed change?

After the 3rd filter_score run completes and median/IQR are computed, check if iqrTotal ≥ 1.5 (high disagreement across the 3 GPT-vs-Gemini midpoint runs). If so, invoke a single tiebreaker call to a third model (Codex via llm.callCodex — already used by meta_engine judging). The tiebreaker scores the same 5 filters once, and its run total is averaged with the median to produce a tiebroken_median which replaces median in the composite formula. Tiebreaker is logged as a separate move row (move_type='filter_score_tiebreak') for cost auditability. Threshold and 'on/off' controlled by config flag.

Target files

hypothesis_engine/moves/filter_score.js

Expected effect

Currently ~10-20% of completed scoring sequences end with IQR ≥ 1.5 (visible in `filter_score_iqr` histogram). For these high-disagreement candidates the tiebroken_median will shift composite by ≥0.5 points in ≥50% of cases, changing the top-of-queue ordering. Promotion/kill decisions on candidates currently sitting near the score floor should flip on roughly 1 in 8 high-IQR runs.

Falsifier — what would prove this wrong?

Run on next 15 candidates that hit the IQR ≥ 1.5 condition. If Codex tiebreak total is within ±0.5 of the median in >12 of 15 cases, the disagreement was symmetric noise and Codex adds no signal — harness is wasteful. Disable and refund the cost budget.

Evidence that triggered the proposal

E — filter_score.js lines 154-166 already computes spread per-filter and IQR across runs but does nothing with high-IQR signal
D — brain/META_ENGINE_S158_RED_TEAM_BRIEF.md — cross-vendor judging (Sonnet+Codex+Gemini) reduces correlated errors; pattern proven in sibling engine
D — brain/V2_FILTER_DESIGN_v2.3.md — 'shadow-mode calibration' contemplated third-judge but not implemented in v3.1

Proposer self-score

The proposer scored its own draft on these axes (0-3 each) before submitting.

Axis	Score
specificity	3
falsifier	3
solo feasible	3
blast radius	3
composability	3
reversibility	3

Disposition

Rejected at the council verdict. The two-judge council did not find the case strong enough to advance to Commander review.

Evaluation history

When	Move
2026-06-18 04:19	meta_council_verdict
2026-06-18 04:11	meta_argument
2026-06-18 04:07	meta_filter_score
2026-06-18 04:04	meta_genesis