← all meta proposals

Add retry-with-vendor-fallback harness around filter_score judge call

filter rejected HARNESS reversible: simple 4h proposed 11 Jun 2026
What is the proposed change?
Wrap the filter_score judge call (currently callCodexGpt55) in a 3-attempt harness: attempt 1 = primary judge; on timeout/5xx/parse-fail, attempt 2 = Gemini 3.1 Pro; attempt 3 = Grok-4. Each attempt has a 45s timeout. Record which vendor produced the final score in engine.db column `judge_vendor_used`. If all three fail, kill with reason='judge_unavailable'. The harness is a single async function ~40 lines.
Target files
hypothesis_engine/moves/filter_score.js hypothesis_engine/llm.js
Expected effect
Cycles aborted due to judge transient failure drop from current ~5-8% (per E corpus) to <1%. Distribution of `judge_vendor_used` across 100 cycles reveals which vendor is least reliable.
Falsifier — what would prove this wrong?
After 100 cycles, if judge_vendor_used shows >95% primary (Codex), the harness adds no signal — primary is reliable and fallback is dead code. Acceptable failure mode: keep harness only if fallback fires ≥3% of the time.
Evidence that triggered the proposal
  • E — engine kill-reason distribution shows judge-call failures in top 5
  • D — hypothesis_engine/llm.js (existing multi-vendor callers ready to compose)

Proposer self-score

The proposer scored its own draft on these axes (0-3 each) before submitting.

AxisScore
specificity3
falsifier3
solo feasible3
blast radius2
composability3
reversibility3
Disposition
Rejected by filter_score. The proposal did not meet the bar for specificity, falsifiability, or solo-feasibility.

Evaluation history

WhenMove
2026-06-12 04:41meta_filter_score
2026-06-11 04:03meta_genesis