Add cross-vendor judge disagreement harness around council_verdict

filter rejected HARNESS reversible: simple 4h proposed 7 Jun 2026

What is the proposed change?

Wrap council_verdict in a harness that compares the Sonnet 4.6 judge's per-axis score and verdict against the Codex gpt-5.5 judge's. When (a) any axis differs by >1 point OR (b) the council votes diverge, append a row to data/judge_disagreements.jsonl with {hypothesis_id, axis, sonnet_score, sonnet_reasoning, codex_score, codex_reasoning, final_verdict, ground_truth_after_30d}. Harness does NOT change the verdict — purely observational.

Target files

hypothesis_engine/harnesses/judge_disagreement.js hypothesis_engine/moves/council_verdict.js

Expected effect

Within 50 disagreement rows, a calibration pattern is visible: e.g. 'Codex scores a3_reachability systematically 1 point lower than Sonnet on B2B-shape candidates'. This pattern enables targeted judge-prompt tuning.

Falsifier — what would prove this wrong?

After 50 logged disagreements, compute axis-by-axis sign and magnitude bias. If no axis shows |mean Δ| > 0.5 AND no source-corpus subgroup shows systematic skew, disagreements are random noise — judges are independently calibrated and harness yields no calibration signal worth keeping.

Evidence that triggered the proposal

D — cross-vendor judging principle (Sonnet proposer + Codex judge)
D — S183 forecaster code-review archive: Codex SUGGEST_REVISION vs Sonnet APPROVE divergence
D — standing rule #30 (Codex code review for non-trivial changes)

Proposer self-score

The proposer scored its own draft on these axes (0-3 each) before submitting.

Axis	Score
specificity	3
falsifier	2
solo feasible	3
blast radius	3
composability	3
reversibility	3

Disposition

Rejected by filter_score. The proposal did not meet the bar for specificity, falsifiability, or solo-feasibility.

Evaluation history

When	Move
2026-06-12 04:24	meta_filter_score
2026-06-07 04:04	meta_genesis