Skip to content

[NVIDIA] Update NVIDIA GPT-OSS vLLM image from v0.15.1 to v0.16.0#800

Open
cquil11 wants to merge 16 commits intomainfrom
claude/issue-798-20260226-0534
Open

[NVIDIA] Update NVIDIA GPT-OSS vLLM image from v0.15.1 to v0.16.0#800
cquil11 wants to merge 16 commits intomainfrom
claude/issue-798-20260226-0534

Conversation

@cquil11
Copy link
Collaborator

@cquil11 cquil11 commented Feb 26, 2026

Bump vllm/vllm-openai image tag for all 3 NVIDIA GPT-OSS configs (B200, H100, H200). All existing BKC flags preserved — no config changes beyond the image tag.

v0.16.0 notable changes for GPT-OSS/MXFP4:

  • Async scheduling + pipeline parallelism (30.8% throughput improvement)
  • New MXFP4 backends: SM90 FlashInfer BF16, SM100 CUTLASS
  • MoE cold start optimization
  • Triton backend now default non-FlashInfer fallback on SM90/SM100

Closes #798

@cquil11
Copy link
Collaborator Author

cquil11 commented Feb 26, 2026

Completed sweep: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/22429605694

Normal variance +/- 2%

CleanShot 2026-02-26 at 12 00 00 CleanShot 2026-02-26 at 12 00 13 CleanShot 2026-02-26 at 12 00 28 CleanShot 2026-02-26 at 13 09 58

Copy link
Contributor

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@functionstackx
Copy link
Contributor

gonna merge this soon

@kedarpotdar-nv
Copy link
Collaborator

Looks like small perf regression on B200 1k/1k @ankursingh-nv is investigating

@functionstackx
Copy link
Contributor

functionstackx commented Mar 1, 2026

v0.17 is coming out wednesday, probably gonna merge this v0.16 in soon before then since we doing best effort on gptoss

@jgangani
Copy link
Collaborator

jgangani commented Mar 2, 2026

@functionstackx @ankursingh-nv, Should we then just wait for 0.17 to land and update this PR before merging?

@ankursingh-nv
Copy link
Collaborator

In generally we should have the version that results in best performance today.
We are investigating it but in the meantime, if v0.17 is released and the out-of-box performance is good, we can skip v0.16

@cquil11
Copy link
Collaborator Author

cquil11 commented Mar 5, 2026

@ankursingh-nv in general tho, we think its useful to update images as they are released (even if perf is not improved) for posterity and to track perf across all images publicly

fwiw, it appears the "regression" in this PR is just natural variance

github-actions bot and others added 11 commits March 6, 2026 16:59
Bump vllm/vllm-openai image tag for all 3 NVIDIA GPT-OSS configs
(B200, H100, H200). All existing BKC flags preserved — no config
changes beyond the image tag.

v0.16.0 notable changes for GPT-OSS/MXFP4:
- Async scheduling + pipeline parallelism (30.8% throughput improvement)
- New MXFP4 backends: SM90 FlashInfer BF16, SM100 CUTLASS
- MoE cold start optimization
- Triton backend now default non-FlashInfer fallback on SM90/SM100

Closes #798

Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com>
Removed outdated configuration entries and added new vLLM image update details for NVIDIA GPT-OSS. Updated pull request links for changes.
@cquil11 cquil11 force-pushed the claude/issue-798-20260226-0534 branch from e7264f5 to 4ca13fc Compare March 6, 2026 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[NVIDIA] update H100, H200, B200 GPT OSS vLLM image to latest 0.16.0

5 participants