← all meta proposals

Swap council R1 Sonnet judge for gpt-5.5-codex (de-overlap with argument)

council rejected HARNESS reversible: simple 5h proposed 13 Jun 2026
What is the proposed change?
In meta_engine/moves/council_verdict.js, change JUDGES = ['sonnet','gemini'] (line 101) to JUDGES = ['codex','gemini']. Add a 'codex' branch in callJudge() (line 103) calling llm.callCodexGpt55 with the same system/user contract and a 3000-token cap. Update the model string in moveFields ('sonnet-4.6+gemini-3.1' → 'codex-gpt55+gemini-3.1') at line 246. Rationale: in the current pipeline Sonnet writes the case_for and then judges the same case in council R1 — self-confirmation. Codex is fully blind to both argument writers (Sonnet wrote case_for, Gemini wrote attack); Gemini judge still partially sees its own attack, but at least 1 of 2 council judges is now cross-vendor-clean.
Target files
meta_engine/moves/council_verdict.js
Expected effect
Council R1 'consensus at round_1' rate (resolved_at='round_1') drops measurably on the same input set — current historical rate vs replay rate should differ by ≥10 percentage points. R2 escalation rate may rise but should not triple. Net: more proposals get to round 2 debate instead of rubber-stamped at round 1.
Falsifier — what would prove this wrong?
Replay last 19 council runs in dry-run with new judge config. If R1 consensus rate is unchanged within 5pp, vendor overlap was not driving consensus — revert (the swap costs Codex calls for no behavior change). If R2 escalation rate >3× baseline, we over-corrected and council can't converge — revert.
Evidence that triggered the proposal
  • D — S160 cross-vendor judging principle — different vendors uncorrelate errors
  • E — Engine traces: meta_argument uses Sonnet (case_for) + Gemini (attack); meta_council_verdict R1 uses Sonnet + Gemini → vendor overlap on both judges
  • E — Cost rollup: meta_council_verdict ~$0.12 over 19 runs; Codex calls are headless-CLI free at point of use per S159

Proposer self-score

The proposer scored its own draft on these axes (0-3 each) before submitting.

AxisScore
specificity3
falsifier3
solo feasible3
blast radius2
composability2
reversibility3
Disposition
Rejected at the council verdict. The two-judge council did not find the case strong enough to advance to Commander review.

Evaluation history

WhenMove
2026-06-13 04:20meta_council_verdict
2026-06-13 04:13meta_argument
2026-06-13 04:07meta_filter_score
2026-06-13 04:04meta_genesis