Skip to content

Add prefill-stage concurrency limiter for PD mode#2

Open
james0zan wants to merge 1 commit intomainfrom
codex/add-max-prefill-concurrent-requests-parameter
Open

Add prefill-stage concurrency limiter for PD mode#2
james0zan wants to merge 1 commit intomainfrom
codex/add-max-prefill-concurrent-requests-parameter

Conversation

@james0zan
Copy link
Copy Markdown

Motivation

  • Introduce a separate concurrency limiter for the prefill stage in PD (prefill+decode) routing to avoid prefill overload affecting the decode stage and to support queuing/timeouts for prefill requests.
  • Expose the limiter via configuration and CLI so deployments can tune or disable prefill limiting independently from overall concurrency limiting.

Description

  • Add max_prefill_concurrent_requests to RouterConfig with default -1, a builder setter, CLI flag --max-prefill-concurrent-requests, and validation in config/validation.rs to ensure the value is -1 or > 0.
  • Add prefill_rate_limiter: Option<Arc<TokenBucket>> to AppContext and its builder with maybe_prefill_rate_limiter to initialize the limiter from config.
  • Extend PDRouter with prefill_rate_limiter and queue_timeout_secs, implement a PrefillLimiterGuard helper that safely returns tokens on drop or explicit release, and integrate limiter acquisition/release around dual-dispatch logic.
  • Change the dual dispatch send logic to use pinned futures and tokio::select! so the code can release the prefill token as soon as one of the requests completes (avoiding deadlocks/over-retention).
  • Add unit tests for limiter guard behavior: test_prefill_limiter_guard_releases_on_drop, test_prefill_limiter_guard_release_now_idempotent, and test_prefill_limiter_guard_drop_unblocks_waiter, and update tests/fixtures to account for the new prefill_rate_limiter field on AppContext.

Testing

  • Ran unit tests for the gateway package with cargo test -p sgl-model-gateway including the new pd_router tests; all tests passed.
  • Existing PD router tests were verified to still pass after integrating the limiter and guard changes.

Codex Task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant