Add MiniMax-M2.5 FP8 vLLM benchmark for B200#757
Add MiniMax-M2.5 FP8 vLLM benchmark for B200#757functionstackx wants to merge 19 commits intomainfrom
Conversation
- Add benchmark script benchmarks/minimaxm2.5_fp8_b200.sh - Add config entry minimaxm2.5-fp8-b200-vllm to nvidia-master.yaml - Image: vllm/vllm-openai:v0.15.1-cu130 - Model: MiniMaxAI/MiniMax-M2.5 - TP=2 and TP=4, concurrency 4-64 - Sequence lengths: 1k1k, 1k8k, 8k1k - Update perf-changelog.yaml Closes #756 Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>
34f7756 to
cbc5d0b
Compare
|
@claude please review. i made a few changes |
|
Claude finished @kedarpotdar-nv's task in 2m 40s —— View job Review of PR #757
Bug: Wrong
|
|
@claude implement this fix in this PR's runners/launch_b200-dgxc-slurm.sh too
|
|
Claude finished @functionstackx's task in 1m 5s —— View job Implementing fix in
|
Change HF_HUB_CACHE_MOUNT from /home/sa-shared/gharunners/hf-hub-cache/ to /scratch/fsw/gharunners/hf-hub-cache Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>
|
@ankursingh-nv i cancelled the workflow as it seems to be hard erroring out |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
66a55e0 to
b56773c
Compare
| --gpu-memory-utilization 0.95 \ | ||
| --max-model-len $MAX_MODEL_LEN \ | ||
| --block-size=32 \ | ||
| --enable-expert-parallel \ |
There was a problem hiding this comment.
@ankursingh-nv can u set the ep setting in nvidia-master.yaml and then in this script do if ep is set then enable-experrt-parallel
There was a problem hiding this comment.
@claude create an new PR from remote main that talks situtations where vllm serve doesnt use expert-parallel-size directly --enable-expert-parallel in CLAUDE.md, claude.yml, claude-review.yml
There was a problem hiding this comment.
Claude finished @functionstackx's task in 2m 7s —— View job
Creating PR for --enable-expert-parallel documentation
- Gather context: read CLAUDE.md, claude.yml, claude-pr-review.yml
- Create new branch from remote main
- Add documentation about
--enable-expert-parallelvs--expert-parallel-sizeusage - Commit and push changes
- Create PR
Created PR #864: Document expert parallelism flag differences in Claude configs
Changes made to claude.yml and claude-pr-review.yml:
- vLLM uses
--enable-expert-parallel(boolean flag) — does NOT accept--expert-parallel-size - SGLang uses
--expert-parallel-size N(explicit integer) - ATOM (AMD vLLM fork) uses
--enable-expert-parallel(same as vLLM) - Scripts must conditionally enable
--enable-expert-parallelbased onEP_SIZEenv var, not hardcode it - Added PR review validation rule to flag hardcoded
--enable-expert-parallel
Note: There is no CLAUDE.md file in this repo — the Claude instructions live in the workflow prompt: fields in claude.yml and claude-pr-review.yml.
| Branch
There was a problem hiding this comment.
@ankursingh-nv @kedarpotdar-nv can u also link vllm recipes for minimax and kimi k2.5 in the official nvidia docs too
https://docs.nvidia.com/llm-inference-quick-start-recipes/index.html
There was a problem hiding this comment.
Sure, let me address the ep setting.
|
@claude update PR description, seems like the image is now 0.16 and not 0.15.1 |
|
Claude finished @functionstackx's task in 32s —— View job Updating PR Description
Updated the PR description to reflect the correct image version: | Branch |
| export VLLM_USE_FLASHINFER_MOE_FP8=0 | ||
| export VLLM_MOE_USE_DEEP_GEMM=0 |
There was a problem hiding this comment.
document this too please in the vllm recipes

Add MiniMax-M2.5 FP8 vLLM benchmark configuration for B200 GPUs.
benchmarks/single_node/minimaxm2.5_fp8_b200.shminimaxm2.5-fp8-b200-vllmin nvidia-master.yamlvllm/vllm-openai:v0.16.0-cu130Closes #756
Generated with Claude Code