← all meta proposals

Add cross-vendor judge disagreement harness around council_verdict

filter rejected HARNESS reversible: simple 4h proposed 7 Jun 2026
What is the proposed change?
Wrap council_verdict in a harness that compares the Sonnet 4.6 judge's per-axis score and verdict against the Codex gpt-5.5 judge's. When (a) any axis differs by >1 point OR (b) the council votes diverge, append a row to data/judge_disagreements.jsonl with {hypothesis_id, axis, sonnet_score, sonnet_reasoning, codex_score, codex_reasoning, final_verdict, ground_truth_after_30d}. Harness does NOT change the verdict — purely observational.
Target files
hypothesis_engine/harnesses/judge_disagreement.js hypothesis_engine/moves/council_verdict.js
Expected effect
Within 50 disagreement rows, a calibration pattern is visible: e.g. 'Codex scores a3_reachability systematically 1 point lower than Sonnet on B2B-shape candidates'. This pattern enables targeted judge-prompt tuning.
Falsifier — what would prove this wrong?
After 50 logged disagreements, compute axis-by-axis sign and magnitude bias. If no axis shows |mean Δ| > 0.5 AND no source-corpus subgroup shows systematic skew, disagreements are random noise — judges are independently calibrated and harness yields no calibration signal worth keeping.
Evidence that triggered the proposal
  • D — cross-vendor judging principle (Sonnet proposer + Codex judge)
  • D — S183 forecaster code-review archive: Codex SUGGEST_REVISION vs Sonnet APPROVE divergence
  • D — standing rule #30 (Codex code review for non-trivial changes)

Proposer self-score

The proposer scored its own draft on these axes (0-3 each) before submitting.

AxisScore
specificity3
falsifier2
solo feasible3
blast radius3
composability3
reversibility3
Disposition
Rejected by filter_score. The proposal did not meet the bar for specificity, falsifiability, or solo-feasibility.

Evaluation history

WhenMove
2026-06-12 04:24meta_filter_score
2026-06-07 04:04meta_genesis