Skip to content

Add dual-segment progress bar for in-progress vs completed rollouts#901

Open
mikasenghaas wants to merge 11 commits intomainfrom
feat/dual-progress-bar
Open

Add dual-segment progress bar for in-progress vs completed rollouts#901
mikasenghaas wants to merge 11 commits intomainfrom
feat/dual-progress-bar

Conversation

@mikasenghaas
Copy link
Member

@mikasenghaas mikasenghaas commented Feb 12, 2026

Summary

  • Adds a dual-segment progress bar that visually distinguishes between completed rollouts (green) and actively executing rollouts (amber), with remaining rollouts shown in dark gray
  • Adds an on_acquire callback to with_sem() that fires when a rollout acquires the semaphore and starts executing, enabling accurate tracking of in-flight work
  • Wires a new RolloutStartCallback through the generate()evaluate()run_evaluation() callback chain so the display knows how many rollouts are currently active
  • Shows an "active" count in the progress text (e.g., (5/100 rollouts, 8 active)) when rollouts are in flight

How it works

The bar has three visual segments:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  completed (green)   active (amber)   remaining (gray)

in_progress = started - completed, where started is incremented each time a rollout acquires the concurrency semaphore and completed comes from the existing on_progress callback.

Files changed

File Change
verifiers/utils/async_utils.py Add optional on_acquire callback to with_sem()
verifiers/types.py Add RolloutStartCallback type alias
verifiers/envs/environment.py Wire on_rollout_start through generate() and evaluate()
verifiers/utils/eval_utils.py Wire on_rollout_start through run_evaluation() and run_with_progress()
verifiers/utils/eval_display.py Add DualBarColumn / _DualBar renderable, track started count in EnvEvalState

Test plan

  • All existing tests pass (uv run pytest tests/ - 608 tests)
  • uv run pre-commit run --all-files passes (ruff check + format)
  • Manual verification: run prime eval run with a real environment and observe the dual-segment bar during execution
  • Verify the bar correctly shows active rollouts ramping up when the evaluation starts, and draining to zero as it finishes

Generated with Claude Code


Note

Medium Risk
Changes the public callback surface (on_progress -> on_task_done plus new on_task_start) and adds semaphore-acquire hooks, which could affect integrations and progress accounting under concurrency/resume scenarios.

Overview
Adds rollout in-flight tracking and a new Rich UI to visualize it: the evaluation display now renders a three-segment progress bar (completed vs running vs remaining) and maintains a started counter to compute in_progress.

Refactors the generation/evaluation callback API by replacing on_progress with on_task_done and introducing on_task_start; Environment.generate() now triggers on_task_start when a task acquires the concurrency semaphore via a new with_sem(..., on_acquire=...) hook, and this signal is threaded through evaluate(), run_evaluation(), and the GEPA adapter.

Written by Cursor Bugbot for commit ad1f673. This will update automatically on new commits. Configure here.

The progress bar now visually distinguishes between completed rollouts (green)
and actively executing rollouts (amber), making it easy to see concurrency at
a glance. The text also shows an "active" count when rollouts are in flight.

Changes:
- Add on_acquire callback to with_sem for tracking when rollouts start executing
- Add RolloutStartCallback type and wire it through generate/evaluate/run_evaluation
- Replace BarColumn with custom DualBarColumn showing three segments:
  completed (green), in-progress (amber), remaining (gray)
- Track started count in EnvEvalState to derive in-progress = started - completed

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mikasenghaas and others added 2 commits February 12, 2026 16:30
In grouped scoring, each task covers multiple rollouts (rollouts_per_example).
The on_acquire callback was incrementing started_count by 1 per group, but
progress counts individual rollouts, causing in_progress = started - progress
to be wrong. Now uses _make_on_acquire(len(group_input)) so started_count
tracks rollouts consistently. Also removes "active" text from progress bar.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Initialize started_count to len(builder.outputs) so it matches the
resumed progress count. Without this, in_progress = started - progress
is clamped to 0 until started_count catches up to the resumed count,
making the amber bar segment invisible for resumed evals.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mikasenghaas and others added 6 commits February 12, 2026 16:54
- _DualBar now subclasses ProgressBar, inheriting __rich_measure__
  and pulse animation instead of reimplementing them
- Render via Segment (like ProgressBar) instead of Text with appends
- Half-char precision using half-bar chars for smoother transitions
- Use Rich theme styles ("bar.complete", "bar.back") as defaults
  instead of hardcoded RGB for completed and remaining segments
- DualBarColumn follows BarColumn's constructor pattern with
  configurable styles
- Handle ASCII fallback and console.no_color like ProgressBar does
- Fix text truncation by not overriding default table_column

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Define COLOR_PENDING/RUNNING/COMPLETED/FAILED as module-level constants
and use them for both panel border styles and progress bar segments.
Completed bar is now green, in-progress is yellow — matching the
border colors for completed and running states respectively.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pass Column(no_wrap=True, ratio=2) as the default table_column,
matching what Rich's BarColumn uses when bar_width=None. Without
ratio, the bar column wouldn't claim proportional space in the
Progress table, resulting in a narrow bar instead of spanning the
terminal width.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
generate() now just fires on_rollout_start(num_rollouts) with the
batch size when rollouts acquire the semaphore — no cumulative state.
The TUI layer (eval_utils.py) owns the started_count accumulator,
initializing it from the resumed count in on_start. This keeps
generate() free of display-specific bookkeeping.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

# Remaining segment
used = c_full + c_half + p_full + p_half
remaining = width - used
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dual-segment bar overflows width by one character

Medium Severity

The _DualBar rendering can overflow the declared width by one character when both the completed and in-progress segments have an odd number of half-chars that sum to total_halves. Each half-char () occupies a full terminal cell, so two trailing halves (one per segment) consume 2 cells but the halves-based clamping only budgets for 1 cell. For example, with width=40, total=100, completed=99, in_progress=1: c_halves=79, p_halves=1 yields used = 39+1+0+1 = 41 > 40. The character-position total needs to be checked and adjusted after divmod, not just the halves sum.

Fix in Cursor Fix in Web

mikasenghaas and others added 2 commits February 12, 2026 17:27
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion

- on_progress → on_task_done (fires when a task completes)
- on_rollout_start → on_task_start (fires when a task begins executing)
- ProgressCallback → TaskDoneCallback
- RolloutStartCallback → TaskStartCallback

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

on_progress: ProgressCallback | None = None,
on_task_done: TaskDoneCallback | None = None,
on_log: LogCallback | None = None,
on_task_start: TaskStartCallback | None = None,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed parameter breaks existing test caller

High Severity

Renaming the on_progress parameter to on_task_done in generate() breaks an existing test in tests/test_environment_extra.py (line 327) that still calls generate(on_progress=no_op). This will raise a TypeError at runtime since on_progress is no longer a valid keyword argument. The refactoring is incomplete — all callers need to be updated to use the new name.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant