Swap council R1 Sonnet judge for gpt-5.5-codex (de-overlap with argument)

council rejected HARNESS reversible: simple 5h proposed 13 Jun 2026

What is the proposed change?

In meta_engine/moves/council_verdict.js, change JUDGES = ['sonnet','gemini'] (line 101) to JUDGES = ['codex','gemini']. Add a 'codex' branch in callJudge() (line 103) calling llm.callCodexGpt55 with the same system/user contract and a 3000-token cap. Update the model string in moveFields ('sonnet-4.6+gemini-3.1' → 'codex-gpt55+gemini-3.1') at line 246. Rationale: in the current pipeline Sonnet writes the case_for and then judges the same case in council R1 — self-confirmation. Codex is fully blind to both argument writers (Sonnet wrote case_for, Gemini wrote attack); Gemini judge still partially sees its own attack, but at least 1 of 2 council judges is now cross-vendor-clean.

Target files

meta_engine/moves/council_verdict.js

Expected effect

Council R1 'consensus at round_1' rate (resolved_at='round_1') drops measurably on the same input set — current historical rate vs replay rate should differ by ≥10 percentage points. R2 escalation rate may rise but should not triple. Net: more proposals get to round 2 debate instead of rubber-stamped at round 1.

Falsifier — what would prove this wrong?

Replay last 19 council runs in dry-run with new judge config. If R1 consensus rate is unchanged within 5pp, vendor overlap was not driving consensus — revert (the swap costs Codex calls for no behavior change). If R2 escalation rate >3× baseline, we over-corrected and council can't converge — revert.

Evidence that triggered the proposal

D — S160 cross-vendor judging principle — different vendors uncorrelate errors
E — Engine traces: meta_argument uses Sonnet (case_for) + Gemini (attack); meta_council_verdict R1 uses Sonnet + Gemini → vendor overlap on both judges
E — Cost rollup: meta_council_verdict ~$0.12 over 19 runs; Codex calls are headless-CLI free at point of use per S159

Proposer self-score

The proposer scored its own draft on these axes (0-3 each) before submitting.

Axis	Score
specificity	3
falsifier	3
solo feasible	3
blast radius	2
composability	2
reversibility	3

Disposition

Rejected at the council verdict. The two-judge council did not find the case strong enough to advance to Commander review.

Evaluation history

When	Move
2026-06-13 04:20	meta_council_verdict
2026-06-13 04:13	meta_argument
2026-06-13 04:07	meta_filter_score
2026-06-13 04:04	meta_genesis