Skip to content

Add A/B experiment runner and results tracker (#94/#59)#227

Open
neoneye wants to merge 1 commit intofeature/94-task-output-scorerfrom
feature/94-experiment-runner
Open

Add A/B experiment runner and results tracker (#94/#59)#227
neoneye wants to merge 1 commit intofeature/94-task-output-scorerfrom
feature/94-experiment-runner

Conversation

@neoneye
Copy link
Member

@neoneye neoneye commented Mar 9, 2026

Summary

  • Adds scoring/experiment_config.py — configuration dataclass for A/B experiments
  • Adds scoring/experiment_runner.py — runs baseline vs candidate system prompts on task functions, scores both, compares with configurable threshold
  • Adds scoring/results_tracker.py — JSONL append-only results log with summary/filtering
  • Adds scoring/run_experiment.py — CLI entry point for running experiments
  • Depends on PR Add LLM-as-judge task output scorer (#94/#59 foundation) #226 (task output scorer)

Test plan

  • ast.parse() all new files
  • Verify imports succeed in project venv
  • _compute_summary() correctly classifies keep/discard/inconclusive
  • ResultsTracker self-test passes (append, load, recent, summary)
  • Manual end-to-end test with a real LLM

🤖 Generated with Claude Code

Minimal experiment infrastructure for prompt optimization (#94) and
A/B testing promotion (#59). Runs baseline vs candidate system prompts
on a task function, scores both outputs with the task output scorer,
and logs results to a JSONL tracker.

Includes experiment config, runner with task registry, results tracker,
and CLI entry point.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant