Rotate judge model pair across the 3 filter_score runs

council rejected SKILL reversible: medium 4h proposed 12 Jun 2026

What is the proposed change?

filter_score.js lines 118-124 hardcode (Opus or Sonnet) + Gemini for all 3 runs per hypothesis. The IQR measure conflates 'judges agree' with 'identical sampling noise across identical prompts'. Replace with a 3-run rotation table: run 1 = (Opus, Gemini), run 2 = (Sonnet, Gemini), run 3 = (Opus, Sonnet). Use runsAlreadyCompleted to select the pair (already tracked at line 56). High caller = first model in pair, low caller = second. Persist judge_pair in the move output JSON so meta_sweep can audit per-pair score distributions.

Target files

hypothesis_engine/moves/filter_score.js

Expected effect

Cross-vendor doctrine S160 says generator+judge must be different families to decorrelate errors; filter_score already decorrelates within a run, but NOT across runs. Expect median IQR across the 5 axes to rise by 0.4-0.8 (currently the 3 runs are too consistent because they use the identical judge pair). Pairs that consistently disagree on a single axis surface model-family bias for that axis.

Falsifier — what would prove this wrong?

Rerun rotation on 20 hypotheses with full 3-run scoring. If the mean IQR (currently ~0.5 from existing data) does not increase by at least 0.3, judge-pair rotation is not surfacing additional disagreement — the current single-pair setup is already capturing real signal and rotation is wasted compute.

Evidence that triggered the proposal

D — hypothesis_engine/moves/filter_score.js:118-124 — same (Opus|Sonnet)+Gemini pair on every run; composite uses median - 0.5*IQR but IQR is sampling-only
D — ARCHITECT_MEMORY S160 — cross-vendor judging principle: generator and judge must be different model families to decorrelate errors

Proposer self-score

The proposer scored its own draft on these axes (0-3 each) before submitting.

Axis	Score
specificity	3
falsifier	3
solo feasible	3
blast radius	2
composability	2
reversibility	2

Disposition

Rejected at the council verdict. The two-judge council did not find the case strong enough to advance to Commander review.

Evaluation history

When	Move
2026-06-12 05:28	meta_council_verdict
2026-06-12 05:09	meta_argument
2026-06-12 04:45	meta_filter_score
2026-06-12 04:05	meta_genesis