← all meta proposals

Rotate judge model pair across the 3 filter_score runs

council rejected SKILL reversible: medium 4h proposed 12 Jun 2026
What is the proposed change?
filter_score.js lines 118-124 hardcode (Opus or Sonnet) + Gemini for all 3 runs per hypothesis. The IQR measure conflates 'judges agree' with 'identical sampling noise across identical prompts'. Replace with a 3-run rotation table: run 1 = (Opus, Gemini), run 2 = (Sonnet, Gemini), run 3 = (Opus, Sonnet). Use runsAlreadyCompleted to select the pair (already tracked at line 56). High caller = first model in pair, low caller = second. Persist judge_pair in the move output JSON so meta_sweep can audit per-pair score distributions.
Target files
hypothesis_engine/moves/filter_score.js
Expected effect
Cross-vendor doctrine S160 says generator+judge must be different families to decorrelate errors; filter_score already decorrelates within a run, but NOT across runs. Expect median IQR across the 5 axes to rise by 0.4-0.8 (currently the 3 runs are too consistent because they use the identical judge pair). Pairs that consistently disagree on a single axis surface model-family bias for that axis.
Falsifier — what would prove this wrong?
Rerun rotation on 20 hypotheses with full 3-run scoring. If the mean IQR (currently ~0.5 from existing data) does not increase by at least 0.3, judge-pair rotation is not surfacing additional disagreement — the current single-pair setup is already capturing real signal and rotation is wasted compute.
Evidence that triggered the proposal
  • D — hypothesis_engine/moves/filter_score.js:118-124 — same (Opus|Sonnet)+Gemini pair on every run; composite uses median - 0.5*IQR but IQR is sampling-only
  • D — ARCHITECT_MEMORY S160 — cross-vendor judging principle: generator and judge must be different model families to decorrelate errors

Proposer self-score

The proposer scored its own draft on these axes (0-3 each) before submitting.

AxisScore
specificity3
falsifier3
solo feasible3
blast radius2
composability2
reversibility2
Disposition
Rejected at the council verdict. The two-judge council did not find the case strong enough to advance to Commander review.

Evaluation history

WhenMove
2026-06-12 05:28meta_council_verdict
2026-06-12 05:09meta_argument
2026-06-12 04:45meta_filter_score
2026-06-12 04:05meta_genesis