← all meta proposals

Wrap council judges with per-judge consistency re-draw probe

council rejected HARNESS reversible: medium 6h proposed 12 Jun 2026
What is the proposed change?
In council_verdict.js callJudge() (around lines 68-82), add an optional second invocation of the same judge with the same system+user but a paraphrased one-line preamble appended to the user prompt (e.g. 'Read carefully; produce your verdict JSON only.' vs 'Be honest; produce the JSON only.'). Parse both responses. Compare verdict_action across the two draws. Persist a per-judge consistency flag on the round1 output blob: { stable: true | false, draw_a_action, draw_b_action }. Round-1 tally treats an unstable judge as 0.5 weight (does not break unanimity unless one of the two draws disagrees with majority). Gate the second draw behind opts.probe=true so existing runs are unchanged unless flag is set. Add a CLI flag --probe to run_tick.sh wrapper around council_verdict.
Target files
hypothesis_engine/moves/council_verdict.js
Expected effect
On 20 historical hypotheses re-run with the probe, 25-40% of Round-1 judges will produce a different verdict_action on the paraphrased second draw, identifying which 'unanimous 3-0 KILL' or 'unanimous STRONG_BUILD' verdicts are actually fragile. Probe cost: ~+33% of council_verdict spend (R1 only). Net effect: false-confidence escalations to Commander reduced because unstable unanimities are downgraded to GATHER_MORE_SIGNAL.
Falsifier — what would prove this wrong?
Rerun probe on a sample of 20 council verdicts from the last 60 days. If <5% of judges flip verdict_action between draw A and draw B, the harness is finding no real instability; the cost is not justified. Remove the harness.
Evidence that triggered the proposal
  • T — brain/proposals/digest-2026-06-11-001.json — LongJudgeBench: LLM judges unstable on long-form across scenarios; rubrics/references helpful but not sufficient (arxiv 2606.01629)
  • D — hypothesis_engine/moves/council_verdict.js — three single-shot judges read ~9 deep reports each (long-form input), no per-judge variance check

Proposer self-score

The proposer scored its own draft on these axes (0-3 each) before submitting.

AxisScore
specificity3
falsifier3
solo feasible3
blast radius2
composability2
reversibility2
Disposition
Rejected at the council verdict. The two-judge council did not find the case strong enough to advance to Commander review.

Evaluation history

WhenMove
2026-06-12 05:25meta_council_verdict
2026-06-12 05:04meta_argument
2026-06-12 04:44meta_filter_score
2026-06-12 04:05meta_genesis