[WIP] B200 MTP configs for Qwen3.5 Bf16, FP4, FP8 SGLang by ankursingh-nv · Pull Request #897 · SemiAnalysisAI/InferenceX

ankursingh-nv · 2026-03-09T21:53:48Z

No description provided.

github-actions · 2026-03-09T21:53:58Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

add newline

claude · 2026-03-09T22:28:14Z

benchmarks/single_node/qwen3.5_bf16_b200_mtp.sh

+
+# MTP (Multi-Token Prediction) Config - EAGLE speculative decoding


🔴 MAX_RUNNING_REQUESTS=128 is hardcoded in all three MTP scripts but the YAML configs set conc-end: 512, so at concurrency >128 the server queues excess requests instead of running them concurrently -- benchmarks will measure queuing behavior, not true high-concurrency performance. Change MAX_RUNNING_REQUESTS to 512 (matching dsr1_fp8_b200_mtp.sh which already does this correctly), or set it to $CONC to stay consistent with CUDA_GRAPH_MAX_BATCH_SIZE=$CONC.

Extended reasoning...

What the bug is: All three new Qwen3.5 MTP benchmark scripts (qwen3.5_bf16_b200_mtp.sh, qwen3.5_fp4_b200_mtp.sh, qwen3.5_fp8_b200_mtp.sh) hardcode MAX_RUNNING_REQUESTS=128 on line 42, while their corresponding YAML configs in nvidia-master.yaml specify conc-end: 512. The --max-running-requests flag limits how many requests the SGLang server will process simultaneously.

How it manifests: When the benchmark client sends requests at concurrency levels above 128 (e.g., 256 or 512), the server will only run 128 concurrently and queue the rest. This means the benchmark is measuring server queuing + processing behavior rather than actual 512-concurrent-request throughput, which defeats the purpose of testing at higher concurrency levels.

Step-by-step proof: The YAML config sets conc-end: 512, so the benchmark harness will sweep concurrency from 4 up to 512. At CONC=512, the script sets CUDA_GRAPH_MAX_BATCH_SIZE=512 (line 41) and passes --max-concurrency 512 to the benchmark client (line 81). However, the server is started with --max-running-requests 128 (line 56). So out of 512 concurrent requests, only 128 run simultaneously -- the remaining 384 sit in the server queue. The CUDA graphs compiled for batch sizes 129-512 are never used, wasting GPU memory.

Why existing code does not prevent it: The non-MTP Qwen3.5 scripts also use MAX_RUNNING_REQUESTS=128, but their YAML configs only go up to conc-end: 64, so there is no mismatch. The MTP scripts were likely copy-pasted from the non-MTP versions without updating this value to match the expanded concurrency range.

Comparison with correct implementation: The existing dsr1_fp8_b200_mtp.sh script (line 43) correctly sets MAX_RUNNING_REQUESTS=512 to match its conc-end: 512 config, confirming this is an oversight in the new scripts.

Fix: Change MAX_RUNNING_REQUESTS=128 to MAX_RUNNING_REQUESTS=512 (or MAX_RUNNING_REQUESTS=$CONC) in all three files: qwen3.5_bf16_b200_mtp.sh:42, qwen3.5_fp4_b200_mtp.sh:42, and qwen3.5_fp8_b200_mtp.sh:42.

claude · 2026-03-09T22:28:14Z

.github/configs/nvidia-master.yaml

+    - { tp: 4, ep: 1, conc-start: 4, conc-end: 512, spec-decoding: mtp }
+  - isl: 1024
+    osl: 8192
+    search-space:
+    - { tp: 4, ep: 1, conc-start: 4, conc-end: 512, spec-decoding: mtp }


🔴 The qwen3.5-fp8-b200-sglang-mtp and qwen3.5-fp4-b200-sglang-mtp configs only specify tp:4, ep:1 in their search space, which means only 4 of the 8 available B200 GPUs are utilized (total GPUs = TP * DP = 4 * 1 = 4). Every comparable config includes tp:8 entries — the non-MTP FP8 config has tp:8, ep:1 entries, the BF16 MTP config uses tp:8, ep:1, and the DSR1 MTP config uses tp:8, ep:1. Consider adding tp:8, ep:1 search-space entries so the benchmark can explore full-node utilization.

Extended reasoning...

What the bug is

The newly added qwen3.5-fp8-b200-sglang-mtp and qwen3.5-fp4-b200-sglang-mtp configs exclusively use tp: 4, ep: 1 across all three sequence-length configurations (1k1k, 1k8k, 8k1k). The benchmark scripts hardcode --data-parallel-size=1, so the total GPU count is TP * DP = 4 * 1 = 4. This leaves 4 of the 8 B200 GPUs on the node completely idle.

Concrete comparison with other configs

Walking through the comparable configs in this file:

Non-MTP FP8 (qwen3.5-fp8-b200-sglang, lines 1807-1830): Uses both tp:8, ep:1 (8 GPUs) and tp:4, ep:4 (4 GPUs) — the search space explores both parallelism strategies.

BF16 MTP (qwen3.5-bf16-b200-sglang-mtp, lines 1832-1852): Uses tp:8, ep:1 for all entries — all 8 GPUs.

DSR1 FP8 MTP (dsr1-fp8-b200-sglang-mtp, lines 1942-1962): Uses tp:8, ep:1 for all entries — all 8 GPUs.

FP8 MTP (qwen3.5-fp8-b200-sglang-mtp, lines 1854-1874): Uses only tp:4, ep:1 — 4 GPUs.

FP4 MTP (qwen3.5-fp4-b200-sglang-mtp, lines 1876-1896): Uses only tp:4, ep:1 — 4 GPUs.

The FP8/FP4 MTP configs are the only ones in this family without any tp:8 search-space entries.

Addressing the intentionality argument

One could argue that TP=4 is intentional for quantized models because they fit on fewer GPUs and lower TP reduces communication overhead. However, this argument is weakened by several facts: (a) the non-MTP FP8 config already includes tp:8 entries alongside tp:4, so the project clearly considers TP=8 worth benchmarking for this quantized model; (b) the BF16 MTP config uses TP=8 — if BF16 (which is larger than FP8/FP4) works with MTP at TP=8, then FP8/FP4 with MTP certainly should too; (c) the DSR1 MTP config uses TP=8; (d) the purpose of the search space is to explore configurations, so omitting TP=8 entirely means the benchmark never even tests whether full-node utilization would be faster.

Note: EP (expert parallelism) distributes MoE experts within the TP group — it does not add additional GPUs beyond TP. So tp:4, ep:4 still uses 4 GPUs, not 8. The key comparison is that the non-MTP FP8 config has tp:8 entries (using all 8 GPUs) while the MTP variants do not.

Impact

Running these benchmarks wastes half the B200 hardware on the node. The resulting throughput numbers will not reflect what the model can achieve when utilizing the full 8-GPU node, making the MTP benchmark results non-comparable with the non-MTP FP8 results (which do use 8 GPUs at TP=8).

Suggested fix

Add tp:8, ep:1 search-space entries to both qwen3.5-fp8-b200-sglang-mtp and qwen3.5-fp4-b200-sglang-mtp, matching the pattern used by the BF16 MTP config. The existing tp:4, ep:1 entries could be kept as additional search points (similar to how the non-MTP FP8 config has both), or changed to tp:4, ep:4 to at least distribute experts optimally when using 4 GPUs.

claude · 2026-03-09T22:28:14Z

perf-changelog.yaml

+    - "EAGLE speculative decoding: num-steps 3, draft-tokens 4, topk 1"
+    - "New scripts: benchmarks/single_node/qwen3.5_bf16_b200_mtp.sh, qwen3.5_fp8_b200_mtp.sh, qwen3.5_fp4_b200_mtp.sh"
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/897
+


🟡 The new perf-changelog entry has pr-link set to /pull/XXX (placeholder) instead of /pull/897. Please update to the actual PR number before merging.

Extended reasoning...

What the bug is

The new perf-changelog.yaml entry added by this PR at line 944 sets pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX, which is a placeholder URL. Since this is PR #897, the link should be https://github.com/SemiAnalysisAI/InferenceX/pull/897.

How it manifests

Any tooling or documentation that consumes perf-changelog.yaml to generate changelogs, dashboards, or PR cross-references will produce a broken link pointing to a non-existent /pull/XXX page. Anyone clicking the link will get a 404.

Step-by-step proof

Look at the diff for perf-changelog.yaml — the new entry starting at line 935 adds three config keys: qwen3.5-bf16-b200-sglang-mtp, qwen3.5-fp8-b200-sglang-mtp, qwen3.5-fp4-b200-sglang-mtp.

Line 944 reads: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

The PR metadata confirms this is PR [WIP] B200 MTP configs for Qwen3.5 Bf16, FP4, FP8 SGLang #897 (title: "[WIP] B200 MTP configs for Qwen3.5 Bf16, FP4, FP8 SGLang").

Therefore XXX should be replaced with 897.

Note on pre-existing instances

There are a handful of other entries in perf-changelog.yaml that also have /pull/XXX (e.g., the GLM-5, MiniMax H200, Qwen3.5 MI325X/MI300x FP8 entries). Those are pre-existing placeholders from other PRs and are not introduced by this PR. Only the new entry added here needs to be fixed.

Fix

Change line 944 from:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

to:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/897

b200 MTP configs for Qwen3.5 Bf16, FP4, FP8 SGLang

2ecc5ce

ankursingh-nv requested a review from a team March 9, 2026 21:53

ankursingh-nv requested review from jgangani and kedarpotdar-nv as code owners March 9, 2026 21:53

github-project-automation bot added this to InferenceMAX Board Mar 9, 2026

add pr number

ff781e6

add newline

ankursingh-nv force-pushed the qwen-sglang-b200-mtp branch from e52c707 to ff781e6 Compare March 9, 2026 21:56

ankursingh-nv added the sweep-enabled label Mar 9, 2026

ankursingh-nv closed this Mar 9, 2026

github-project-automation bot moved this to Done in InferenceMAX Board Mar 9, 2026

ankursingh-nv deleted the qwen-sglang-b200-mtp branch March 9, 2026 22:06

claude bot reviewed Mar 9, 2026

View reviewed changes

claude bot mentioned this pull request Mar 9, 2026

[WIP] Add Qwen3.5 FP8 B200 SGLang MTP config #898

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] B200 MTP configs for Qwen3.5 Bf16, FP4, FP8 SGLang#897

[WIP] B200 MTP configs for Qwen3.5 Bf16, FP4, FP8 SGLang#897
ankursingh-nv wants to merge 2 commits intomainfrom
qwen-sglang-b200-mtp

ankursingh-nv commented Mar 9, 2026

Uh oh!

github-actions bot commented Mar 9, 2026

Uh oh!

claude bot Mar 9, 2026

Uh oh!

claude bot Mar 9, 2026

Uh oh!

claude bot Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		# MTP (Multi-Token Prediction) Config - EAGLE speculative decoding

Conversation

ankursingh-nv commented Mar 9, 2026

Uh oh!

github-actions bot commented Mar 9, 2026

Uh oh!

claude bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 9, 2026

Choose a reason for hiding this comment

What the bug is

Concrete comparison with other configs

Addressing the intentionality argument

Impact

Suggested fix

Uh oh!

claude bot Mar 9, 2026

Choose a reason for hiding this comment

What the bug is

How it manifests

Step-by-step proof

Note on pre-existing instances

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant