Skip to content

[WIP] B200 MTP configs for Qwen3.5 Bf16, FP4, FP8 SGLang#897

Closed
ankursingh-nv wants to merge 2 commits intomainfrom
qwen-sglang-b200-mtp
Closed

[WIP] B200 MTP configs for Qwen3.5 Bf16, FP4, FP8 SGLang#897
ankursingh-nv wants to merge 2 commits intomainfrom
qwen-sglang-b200-mtp

Conversation

@ankursingh-nv
Copy link
Collaborator

No description provided.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

add newline
@ankursingh-nv ankursingh-nv force-pushed the qwen-sglang-b200-mtp branch from e52c707 to ff781e6 Compare March 9, 2026 21:56
@ankursingh-nv ankursingh-nv deleted the qwen-sglang-b200-mtp branch March 9, 2026 22:06
Comment on lines +44 to +45

# MTP (Multi-Token Prediction) Config - EAGLE speculative decoding
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 MAX_RUNNING_REQUESTS=128 is hardcoded in all three MTP scripts but the YAML configs set conc-end: 512, so at concurrency >128 the server queues excess requests instead of running them concurrently -- benchmarks will measure queuing behavior, not true high-concurrency performance. Change MAX_RUNNING_REQUESTS to 512 (matching dsr1_fp8_b200_mtp.sh which already does this correctly), or set it to $CONC to stay consistent with CUDA_GRAPH_MAX_BATCH_SIZE=$CONC.

Extended reasoning...

What the bug is: All three new Qwen3.5 MTP benchmark scripts (qwen3.5_bf16_b200_mtp.sh, qwen3.5_fp4_b200_mtp.sh, qwen3.5_fp8_b200_mtp.sh) hardcode MAX_RUNNING_REQUESTS=128 on line 42, while their corresponding YAML configs in nvidia-master.yaml specify conc-end: 512. The --max-running-requests flag limits how many requests the SGLang server will process simultaneously.

How it manifests: When the benchmark client sends requests at concurrency levels above 128 (e.g., 256 or 512), the server will only run 128 concurrently and queue the rest. This means the benchmark is measuring server queuing + processing behavior rather than actual 512-concurrent-request throughput, which defeats the purpose of testing at higher concurrency levels.

Step-by-step proof: The YAML config sets conc-end: 512, so the benchmark harness will sweep concurrency from 4 up to 512. At CONC=512, the script sets CUDA_GRAPH_MAX_BATCH_SIZE=512 (line 41) and passes --max-concurrency 512 to the benchmark client (line 81). However, the server is started with --max-running-requests 128 (line 56). So out of 512 concurrent requests, only 128 run simultaneously -- the remaining 384 sit in the server queue. The CUDA graphs compiled for batch sizes 129-512 are never used, wasting GPU memory.

Why existing code does not prevent it: The non-MTP Qwen3.5 scripts also use MAX_RUNNING_REQUESTS=128, but their YAML configs only go up to conc-end: 64, so there is no mismatch. The MTP scripts were likely copy-pasted from the non-MTP versions without updating this value to match the expanded concurrency range.

Comparison with correct implementation: The existing dsr1_fp8_b200_mtp.sh script (line 43) correctly sets MAX_RUNNING_REQUESTS=512 to match its conc-end: 512 config, confirming this is an oversight in the new scripts.

Fix: Change MAX_RUNNING_REQUESTS=128 to MAX_RUNNING_REQUESTS=512 (or MAX_RUNNING_REQUESTS=$CONC) in all three files: qwen3.5_bf16_b200_mtp.sh:42, qwen3.5_fp4_b200_mtp.sh:42, and qwen3.5_fp8_b200_mtp.sh:42.

Comment on lines +1866 to +1870
- { tp: 4, ep: 1, conc-start: 4, conc-end: 512, spec-decoding: mtp }
- isl: 1024
osl: 8192
search-space:
- { tp: 4, ep: 1, conc-start: 4, conc-end: 512, spec-decoding: mtp }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The qwen3.5-fp8-b200-sglang-mtp and qwen3.5-fp4-b200-sglang-mtp configs only specify tp:4, ep:1 in their search space, which means only 4 of the 8 available B200 GPUs are utilized (total GPUs = TP * DP = 4 * 1 = 4). Every comparable config includes tp:8 entries — the non-MTP FP8 config has tp:8, ep:1 entries, the BF16 MTP config uses tp:8, ep:1, and the DSR1 MTP config uses tp:8, ep:1. Consider adding tp:8, ep:1 search-space entries so the benchmark can explore full-node utilization.

Extended reasoning...

What the bug is

The newly added qwen3.5-fp8-b200-sglang-mtp and qwen3.5-fp4-b200-sglang-mtp configs exclusively use tp: 4, ep: 1 across all three sequence-length configurations (1k1k, 1k8k, 8k1k). The benchmark scripts hardcode --data-parallel-size=1, so the total GPU count is TP * DP = 4 * 1 = 4. This leaves 4 of the 8 B200 GPUs on the node completely idle.

Concrete comparison with other configs

Walking through the comparable configs in this file:

  1. Non-MTP FP8 (qwen3.5-fp8-b200-sglang, lines 1807-1830): Uses both tp:8, ep:1 (8 GPUs) and tp:4, ep:4 (4 GPUs) — the search space explores both parallelism strategies.
  2. BF16 MTP (qwen3.5-bf16-b200-sglang-mtp, lines 1832-1852): Uses tp:8, ep:1 for all entries — all 8 GPUs.
  3. DSR1 FP8 MTP (dsr1-fp8-b200-sglang-mtp, lines 1942-1962): Uses tp:8, ep:1 for all entries — all 8 GPUs.
  4. FP8 MTP (qwen3.5-fp8-b200-sglang-mtp, lines 1854-1874): Uses only tp:4, ep:1 — 4 GPUs.
  5. FP4 MTP (qwen3.5-fp4-b200-sglang-mtp, lines 1876-1896): Uses only tp:4, ep:1 — 4 GPUs.

The FP8/FP4 MTP configs are the only ones in this family without any tp:8 search-space entries.

Addressing the intentionality argument

One could argue that TP=4 is intentional for quantized models because they fit on fewer GPUs and lower TP reduces communication overhead. However, this argument is weakened by several facts: (a) the non-MTP FP8 config already includes tp:8 entries alongside tp:4, so the project clearly considers TP=8 worth benchmarking for this quantized model; (b) the BF16 MTP config uses TP=8 — if BF16 (which is larger than FP8/FP4) works with MTP at TP=8, then FP8/FP4 with MTP certainly should too; (c) the DSR1 MTP config uses TP=8; (d) the purpose of the search space is to explore configurations, so omitting TP=8 entirely means the benchmark never even tests whether full-node utilization would be faster.

Note: EP (expert parallelism) distributes MoE experts within the TP group — it does not add additional GPUs beyond TP. So tp:4, ep:4 still uses 4 GPUs, not 8. The key comparison is that the non-MTP FP8 config has tp:8 entries (using all 8 GPUs) while the MTP variants do not.

Impact

Running these benchmarks wastes half the B200 hardware on the node. The resulting throughput numbers will not reflect what the model can achieve when utilizing the full 8-GPU node, making the MTP benchmark results non-comparable with the non-MTP FP8 results (which do use 8 GPUs at TP=8).

Suggested fix

Add tp:8, ep:1 search-space entries to both qwen3.5-fp8-b200-sglang-mtp and qwen3.5-fp4-b200-sglang-mtp, matching the pattern used by the BF16 MTP config. The existing tp:4, ep:1 entries could be kept as additional search points (similar to how the non-MTP FP8 config has both), or changed to tp:4, ep:4 to at least distribute experts optimally when using 4 GPUs.

- "EAGLE speculative decoding: num-steps 3, draft-tokens 4, topk 1"
- "New scripts: benchmarks/single_node/qwen3.5_bf16_b200_mtp.sh, qwen3.5_fp8_b200_mtp.sh, qwen3.5_fp4_b200_mtp.sh"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/897

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new perf-changelog entry has pr-link set to /pull/XXX (placeholder) instead of /pull/897. Please update to the actual PR number before merging.

Extended reasoning...

What the bug is

The new perf-changelog.yaml entry added by this PR at line 944 sets pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX, which is a placeholder URL. Since this is PR #897, the link should be https://github.com/SemiAnalysisAI/InferenceX/pull/897.

How it manifests

Any tooling or documentation that consumes perf-changelog.yaml to generate changelogs, dashboards, or PR cross-references will produce a broken link pointing to a non-existent /pull/XXX page. Anyone clicking the link will get a 404.

Step-by-step proof

  1. Look at the diff for perf-changelog.yaml — the new entry starting at line 935 adds three config keys: qwen3.5-bf16-b200-sglang-mtp, qwen3.5-fp8-b200-sglang-mtp, qwen3.5-fp4-b200-sglang-mtp.
  2. Line 944 reads: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX
  3. The PR metadata confirms this is PR [WIP] B200 MTP configs for Qwen3.5 Bf16, FP4, FP8 SGLang #897 (title: "[WIP] B200 MTP configs for Qwen3.5 Bf16, FP4, FP8 SGLang").
  4. Therefore XXX should be replaced with 897.

Note on pre-existing instances

There are a handful of other entries in perf-changelog.yaml that also have /pull/XXX (e.g., the GLM-5, MiniMax H200, Qwen3.5 MI325X/MI300x FP8 entries). Those are pre-existing placeholders from other PRs and are not introduced by this PR. Only the new entry added here needs to be fixed.

Fix

Change line 944 from:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

to:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/897

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

1 participant