Skip to content

[WIP] add perf comparison on non-main brances with run-sweep workflow#880

Open
cquil11 wants to merge 6 commits intomainfrom
feat/run-sweep-compare
Open

[WIP] add perf comparison on non-main brances with run-sweep workflow#880
cquil11 wants to merge 6 commits intomainfrom
feat/run-sweep-compare

Conversation

@cquil11
Copy link
Collaborator

@cquil11 cquil11 commented Mar 6, 2026

Summary

This PR adds automated performance comparison for PR sweep runs and includes a minor test config change.

What it accomplishes

  1. New compare-results job in run-sweep.yml: When a sweep runs on a PR (non-main branch), this job automatically compares the benchmark results from the PR against the most recent baseline results from main. The comparison is rendered as a throughput table in the GitHub Actions Step Summary, showing current vs. baseline tok/s/gpu with delta and percentage change.

  2. New utils/compare_results.py script: Implements the comparison logic:

    • Reads benchmark result JSON artifacts produced by the sweep
    • Queries the Neon PostgreSQL database for the most recent matching baseline result on main (matching by hardware, model, framework, precision, parallelism, ISL/OSL, and concurrency)
    • Computes throughput deltas and formats a markdown table using tabulate
    • Supports both single-node and multi-node (disaggregated) configurations
  3. Test perf-changelog.yaml entry: Adds a test entry for gptoss-fp4-b200-vllm to trigger the sweep workflow on this PR branch.

  4. Config change in nvidia-master.yaml: Comments out the 1k8k and 8k1k search-space entries for gptoss-fp4-b200-vllm (keeping only 1k1k) to reduce the test sweep scope.

How it works

  • The compare-results job runs only on pull_request events after collect-results succeeds
  • It downloads the aggregated benchmark artifacts, then runs compare_results.py against the results database
  • For each benchmark result, it looks up the latest matching main branch result by config dimensions (hardware, model, framework, precision, TP/EP, ISL, OSL, concurrency)
  • Outputs a formatted comparison table to $GITHUB_STEP_SUMMARY so reviewers can see performance impact at a glance

@github-actions
Copy link
Contributor

github-actions bot commented Mar 6, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

2 similar comments
@github-actions
Copy link
Contributor

github-actions bot commented Mar 6, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

@github-actions
Copy link
Contributor

github-actions bot commented Mar 6, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

@jgangani
Copy link
Collaborator

jgangani commented Mar 6, 2026

ACK. @claude can you add short MR description? what this MR accomplishes and how?

@Klaud-Cold
Copy link
Collaborator

Klaud-Cold commented Mar 6, 2026

Claude finished @jgangani's task in 54s —— View job


Adding PR Description

  • Read changed files to understand the PR
  • Write PR description
  • Update PR with description

Done — added a description to the PR covering:

  • What: New compare-results job in run-sweep.yml and utils/compare_results.py script for automated throughput comparison on PR sweeps
  • How: Downloads benchmark artifacts, queries the Neon DB for the latest matching main baseline, computes deltas, and renders a markdown table in the GitHub Actions Step Summary
  • Also: Test perf-changelog entry and scoped-down gptoss-fp4-b200-vllm config for testing

@functionstackx
Copy link
Contributor

@cquil11 can u add baseline interactivity too in addition to baseline throughput?

similar to what we have on the dashboard

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants