Skip to content

Add MiniMax-M2.5 FP8 vLLM benchmark for B200#757

Open
functionstackx wants to merge 19 commits intomainfrom
claude/issue-756-20260219-0235
Open

Add MiniMax-M2.5 FP8 vLLM benchmark for B200#757
functionstackx wants to merge 19 commits intomainfrom
claude/issue-756-20260219-0235

Conversation

@functionstackx
Copy link
Contributor

@functionstackx functionstackx commented Feb 19, 2026

Add MiniMax-M2.5 FP8 vLLM benchmark configuration for B200 GPUs.

  • New benchmark script benchmarks/single_node/minimaxm2.5_fp8_b200.sh
  • Config entry minimaxm2.5-fp8-b200-vllm in nvidia-master.yaml
  • Image: vllm/vllm-openai:v0.16.0-cu130
  • TP=2 and TP=4, concurrency 4-64
  • Sequence lengths: 1k1k, 1k8k, 8k1k

Closes #756

Generated with Claude Code

github-actions bot and others added 2 commits February 19, 2026 13:46
- Add benchmark script benchmarks/minimaxm2.5_fp8_b200.sh
- Add config entry minimaxm2.5-fp8-b200-vllm to nvidia-master.yaml
  - Image: vllm/vllm-openai:v0.15.1-cu130
  - Model: MiniMaxAI/MiniMax-M2.5
  - TP=2 and TP=4, concurrency 4-64
  - Sequence lengths: 1k1k, 1k8k, 8k1k
- Update perf-changelog.yaml

Closes #756

Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>
@functionstackx functionstackx force-pushed the claude/issue-756-20260219-0235 branch from 34f7756 to cbc5d0b Compare February 19, 2026 18:46
@kedarpotdar-nv kedarpotdar-nv self-assigned this Feb 19, 2026
@kedarpotdar-nv
Copy link
Collaborator

@claude please review. i made a few changes

@Klaud-Cold
Copy link
Collaborator

Klaud-Cold commented Feb 20, 2026

Claude finished @kedarpotdar-nv's task in 2m 40s —— View job


Review of PR #757

  • Gather context and read changed files
  • Review benchmark script
  • Review nvidia-master.yaml config
  • Review perf-changelog.yaml
  • Post review feedback

Bug: Wrong benchmark_lib.sh source path

benchmarks/single_node/minimaxm2.5_fp8_b200.sh:3 sources benchmark_lib.sh with:

source "$(dirname "$0")/benchmark_lib.sh"

This resolves to benchmarks/single_node/benchmark_lib.sh, which does not exist. The file is at benchmarks/benchmark_lib.sh. Every other script in single_node/ uses /../:

source "$(dirname "$0")/../benchmark_lib.sh"

This will cause the script to fail immediately at runtime.

Fix this →


Benchmark Script (minimaxm2.5_fp8_b200.sh)

Good changes:

  • --enable-expert-parallel (line 32) is appropriate — MiniMax-M2.5 is a MoE model (456B total, ~47B active params). The MI355X version (minimaxm2.5_fp8_mi355x.sh) does NOT have this flag, so this is a deliberate addition for B200.
  • --trust-remote-code on both the vLLM serve and benchmark client side is correct for this model.
  • --block-size=32, --gpu-memory-utilization 0.95, and the overall script structure match existing patterns well.

No explicit FP8 quantization flags: The script doesn't pass --quantization fp8 or --kv-cache-dtype fp8. This is fine since MiniMax-M2.5 appears to ship with native FP8 weights, matching the approach in the MI355X version.


Config (nvidia-master.yaml)

The config entry at line 3018 looks correct:

  • runner: b200-nvs — matches the appropriate B200 runner
  • image: vllm/vllm-openai:v0.15.1-cu130 — new vLLM version, first config to use this tag
  • Search space (TP=2 and TP=4, conc 4-64) across all three seq-len configs is reasonable

No issues found here.


Changelog (perf-changelog.yaml)

Content is correct. Minor: trailing whitespace on line 690 (pr-link: line) and trailing blank line at 692. Not a blocker.


Summary

One blocking issue: the benchmark_lib.sh source path bug will prevent the script from running. Everything else looks good — the --enable-expert-parallel addition and overall config are solid.
| Branch

@functionstackx
Copy link
Contributor Author

@claude implement this fix in this PR's runners/launch_b200-dgxc-slurm.sh too

image

@Klaud-Cold
Copy link
Collaborator

Klaud-Cold commented Feb 24, 2026

Claude finished @functionstackx's task in 1m 5s —— View job


Implementing fix in runners/launch_b200-dgxc-slurm.sh

  • Gather context and understand the fix from the screenshot
  • Implement the fix — change HF_HUB_CACHE_MOUNT path from /home/sa-shared/gharunners/hf-hub-cache/ to /scratch/fsw/gharunners/hf-hub-cache
  • Commit and push changes

Updated HF_HUB_CACHE_MOUNT path in runners/launch_b200-dgxc-slurm.sh:216 from /home/sa-shared/gharunners/hf-hub-cache/ to /scratch/fsw/gharunners/hf-hub-cache to match the correct mount point.

View job | Branch

Change HF_HUB_CACHE_MOUNT from /home/sa-shared/gharunners/hf-hub-cache/
to /scratch/fsw/gharunners/hf-hub-cache

Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>
@cquil11 cquil11 requested a review from a team February 25, 2026 19:06
@cquil11
Copy link
Collaborator

cquil11 commented Feb 25, 2026

@functionstackx
Copy link
Contributor Author

@ankursingh-nv i cancelled the workflow as it seems to be hard erroring out

@ankursingh-nv ankursingh-nv force-pushed the claude/issue-756-20260219-0235 branch from 66a55e0 to b56773c Compare March 4, 2026 23:06
@ankursingh-nv ankursingh-nv requested a review from jgangani as a code owner March 4, 2026 23:06
--gpu-memory-utilization 0.95 \
--max-model-len $MAX_MODEL_LEN \
--block-size=32 \
--enable-expert-parallel \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ankursingh-nv can u set the ep setting in nvidia-master.yaml and then in this script do if ep is set then enable-experrt-parallel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude create an new PR from remote main that talks situtations where vllm serve doesnt use expert-parallel-size directly --enable-expert-parallel in CLAUDE.md, claude.yml, claude-review.yml

Copy link
Collaborator

@Klaud-Cold Klaud-Cold Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude finished @functionstackx's task in 2m 7s —— View job


Creating PR for --enable-expert-parallel documentation

  • Gather context: read CLAUDE.md, claude.yml, claude-pr-review.yml
  • Create new branch from remote main
  • Add documentation about --enable-expert-parallel vs --expert-parallel-size usage
  • Commit and push changes
  • Create PR

Created PR #864: Document expert parallelism flag differences in Claude configs

Changes made to claude.yml and claude-pr-review.yml:

  • vLLM uses --enable-expert-parallel (boolean flag) — does NOT accept --expert-parallel-size
  • SGLang uses --expert-parallel-size N (explicit integer)
  • ATOM (AMD vLLM fork) uses --enable-expert-parallel (same as vLLM)
  • Scripts must conditionally enable --enable-expert-parallel based on EP_SIZE env var, not hardcode it
  • Added PR review validation rule to flag hardcoded --enable-expert-parallel

Note: There is no CLAUDE.md file in this repo — the Claude instructions live in the workflow prompt: fields in claude.yml and claude-pr-review.yml.
| Branch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ankursingh-nv @kedarpotdar-nv can u also link vllm recipes for minimax and kimi k2.5 in the official nvidia docs too

https://docs.nvidia.com/llm-inference-quick-start-recipes/index.html

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me address the ep setting.

@functionstackx
Copy link
Contributor Author

@claude update PR description, seems like the image is now 0.16 and not 0.15.1

@Klaud-Cold
Copy link
Collaborator

Klaud-Cold commented Mar 5, 2026

Claude finished @functionstackx's task in 32s —— View job


Updating PR Description

  • Check current image version in config
  • Update PR description

Updated the PR description to reflect the correct image version: vllm/vllm-openai:v0.16.0-cu130 (was v0.15.1-cu130).

| Branch

Comment on lines +26 to +27
export VLLM_USE_FLASHINFER_MOE_FP8=0
export VLLM_MOE_USE_DEEP_GEMM=0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document this too please in the vllm recipes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

b200 MiniMax M2.5 FP8

5 participants