-
Notifications
You must be signed in to change notification settings - Fork 99
[WIP] B200 MTP configs for Qwen3.5 Bf16, FP4, FP8 SGLang #897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,90 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME \ | ||
| EP_SIZE | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| nvidia-smi | ||
|
|
||
| hf download "$MODEL" | ||
|
|
||
| export NCCL_NVLS_ENABLE=1 | ||
| export SGL_ENABLE_JIT_DEEPGEMM=false | ||
| export SGLANG_ENABLE_FLASHINFER_GEMM=true | ||
| export PYTHONUNBUFFERED=1 | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
|
|
||
| # Default: recv every ~10 requests; if CONC >= 16, relax to ~30 requests between scheduler recv polls. | ||
| if [[ $CONC -ge 16 ]]; then | ||
| SCHEDULER_RECV_INTERVAL=30 | ||
| else | ||
| SCHEDULER_RECV_INTERVAL=10 | ||
| fi | ||
|
|
||
| MEM_FRAC_STATIC=0.8 | ||
| CHUNKED_PREFILL_SIZE=32768 | ||
| MAX_PREFILL_TOKENS=32768 | ||
| CUDA_GRAPH_MAX_BATCH_SIZE=$CONC | ||
| MAX_RUNNING_REQUESTS=128 | ||
| CONTEXT_LENGTH=$((ISL + OSL + 20)) | ||
|
|
||
| # MTP (Multi-Token Prediction) Config - EAGLE speculative decoding | ||
|
Comment on lines
+44
to
+45
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 MAX_RUNNING_REQUESTS=128 is hardcoded in all three MTP scripts but the YAML configs set conc-end: 512, so at concurrency >128 the server queues excess requests instead of running them concurrently -- benchmarks will measure queuing behavior, not true high-concurrency performance. Change MAX_RUNNING_REQUESTS to 512 (matching dsr1_fp8_b200_mtp.sh which already does this correctly), or set it to $CONC to stay consistent with CUDA_GRAPH_MAX_BATCH_SIZE=$CONC. Extended reasoning...What the bug is: All three new Qwen3.5 MTP benchmark scripts (qwen3.5_bf16_b200_mtp.sh, qwen3.5_fp4_b200_mtp.sh, qwen3.5_fp8_b200_mtp.sh) hardcode MAX_RUNNING_REQUESTS=128 on line 42, while their corresponding YAML configs in nvidia-master.yaml specify conc-end: 512. The --max-running-requests flag limits how many requests the SGLang server will process simultaneously. How it manifests: When the benchmark client sends requests at concurrency levels above 128 (e.g., 256 or 512), the server will only run 128 concurrently and queue the rest. This means the benchmark is measuring server queuing + processing behavior rather than actual 512-concurrent-request throughput, which defeats the purpose of testing at higher concurrency levels. Step-by-step proof: The YAML config sets conc-end: 512, so the benchmark harness will sweep concurrency from 4 up to 512. At CONC=512, the script sets CUDA_GRAPH_MAX_BATCH_SIZE=512 (line 41) and passes --max-concurrency 512 to the benchmark client (line 81). However, the server is started with --max-running-requests 128 (line 56). So out of 512 concurrent requests, only 128 run simultaneously -- the remaining 384 sit in the server queue. The CUDA graphs compiled for batch sizes 129-512 are never used, wasting GPU memory. Why existing code does not prevent it: The non-MTP Qwen3.5 scripts also use MAX_RUNNING_REQUESTS=128, but their YAML configs only go up to conc-end: 64, so there is no mismatch. The MTP scripts were likely copy-pasted from the non-MTP versions without updating this value to match the expanded concurrency range. Comparison with correct implementation: The existing dsr1_fp8_b200_mtp.sh script (line 43) correctly sets MAX_RUNNING_REQUESTS=512 to match its conc-end: 512 config, confirming this is an oversight in the new scripts. Fix: Change MAX_RUNNING_REQUESTS=128 to MAX_RUNNING_REQUESTS=512 (or MAX_RUNNING_REQUESTS=$CONC) in all three files: qwen3.5_bf16_b200_mtp.sh:42, qwen3.5_fp4_b200_mtp.sh:42, and qwen3.5_fp8_b200_mtp.sh:42. |
||
| SPECULATIVE_NUM_STEPS=3 | ||
| SPECULATIVE_DRAFT_TOKENS=4 | ||
| SPECULATIVE_EAGLE_TOPK=1 | ||
|
|
||
| echo "SCHEDULER_RECV_INTERVAL: $SCHEDULER_RECV_INTERVAL, CONC: $CONC, ISL: $ISL, OSL: $OSL" | ||
|
|
||
| set -x | ||
| PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \ | ||
| --trust-remote-code \ | ||
| --tensor-parallel-size=$TP --data-parallel-size=1 --ep-size $EP_SIZE \ | ||
| --cuda-graph-max-bs $CUDA_GRAPH_MAX_BATCH_SIZE --max-running-requests $MAX_RUNNING_REQUESTS \ | ||
| --mem-fraction-static $MEM_FRAC_STATIC --chunked-prefill-size $CHUNKED_PREFILL_SIZE --max-prefill-tokens $MAX_PREFILL_TOKENS \ | ||
| --context-length $CONTEXT_LENGTH --disable-radix-cache \ | ||
| --attention-backend trtllm_mha --moe-runner-backend flashinfer_trtllm \ | ||
| --enable-flashinfer-allreduce-fusion --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL \ | ||
| --tokenizer-worker-num 6 --stream-interval 30 \ | ||
| --speculative-algorithm EAGLE --speculative-num-steps $SPECULATIVE_NUM_STEPS --speculative-eagle-topk $SPECULATIVE_EAGLE_TOPK --speculative-num-draft-tokens $SPECULATIVE_DRAFT_TOKENS \ | ||
| > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| # Wait for server to be ready | ||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| pip install -q datasets pandas | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ \ | ||
| --use-chat-template | ||
|
|
||
| # After throughput, run evaluation only if RUN_EVAL is true | ||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" --concurrent-requests $CONC | ||
| append_lm_eval_summary | ||
| fi | ||
| set +x | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,90 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME \ | ||
| EP_SIZE | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| nvidia-smi | ||
|
|
||
| hf download "$MODEL" | ||
|
|
||
| export NCCL_NVLS_ENABLE=1 | ||
| export SGL_ENABLE_JIT_DEEPGEMM=false | ||
| export SGLANG_ENABLE_FLASHINFER_GEMM=true | ||
| export PYTHONUNBUFFERED=1 | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
|
|
||
| # Default: recv every ~10 requests; if CONC >= 16, relax to ~30 requests between scheduler recv polls. | ||
| if [[ $CONC -ge 16 ]]; then | ||
| SCHEDULER_RECV_INTERVAL=30 | ||
| else | ||
| SCHEDULER_RECV_INTERVAL=10 | ||
| fi | ||
|
|
||
| MEM_FRAC_STATIC=0.8 | ||
| CHUNKED_PREFILL_SIZE=32768 | ||
| MAX_PREFILL_TOKENS=32768 | ||
| CUDA_GRAPH_MAX_BATCH_SIZE=$CONC | ||
| MAX_RUNNING_REQUESTS=128 | ||
| CONTEXT_LENGTH=$((ISL + OSL + 20)) | ||
|
|
||
| # MTP (Multi-Token Prediction) Config - EAGLE speculative decoding | ||
| SPECULATIVE_NUM_STEPS=3 | ||
| SPECULATIVE_DRAFT_TOKENS=4 | ||
| SPECULATIVE_EAGLE_TOPK=1 | ||
|
|
||
| echo "SCHEDULER_RECV_INTERVAL: $SCHEDULER_RECV_INTERVAL, CONC: $CONC, ISL: $ISL, OSL: $OSL" | ||
|
|
||
| set -x | ||
| PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \ | ||
| --trust-remote-code \ | ||
| --tensor-parallel-size=$TP --data-parallel-size=1 --ep-size $EP_SIZE \ | ||
| --cuda-graph-max-bs $CUDA_GRAPH_MAX_BATCH_SIZE --max-running-requests $MAX_RUNNING_REQUESTS \ | ||
| --mem-fraction-static $MEM_FRAC_STATIC --chunked-prefill-size $CHUNKED_PREFILL_SIZE --max-prefill-tokens $MAX_PREFILL_TOKENS \ | ||
| --context-length $CONTEXT_LENGTH --disable-radix-cache \ | ||
| --attention-backend trtllm_mha --moe-runner-backend flashinfer_trtllm --fp4-gemm-backend flashinfer_cutlass --kv-cache-dtype fp8_e4m3 \ | ||
| --enable-flashinfer-allreduce-fusion --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL \ | ||
| --tokenizer-worker-num 6 --stream-interval 30 \ | ||
| --speculative-algorithm EAGLE --speculative-num-steps $SPECULATIVE_NUM_STEPS --speculative-eagle-topk $SPECULATIVE_EAGLE_TOPK --speculative-num-draft-tokens $SPECULATIVE_DRAFT_TOKENS \ | ||
| > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| # Wait for server to be ready | ||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| pip install -q datasets pandas | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ \ | ||
| --use-chat-template | ||
|
|
||
| # After throughput, run evaluation only if RUN_EVAL is true | ||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" --concurrent-requests $CONC | ||
| append_lm_eval_summary | ||
| fi | ||
| set +x |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,91 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME \ | ||
| EP_SIZE | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| nvidia-smi | ||
|
|
||
| hf download "$MODEL" | ||
|
|
||
| export NCCL_NVLS_ENABLE=1 | ||
| export SGL_ENABLE_JIT_DEEPGEMM=false | ||
| export SGLANG_ENABLE_FLASHINFER_GEMM=true | ||
| export PYTHONUNBUFFERED=1 | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
|
|
||
| # Default: recv every ~10 requests; if CONC >= 16, relax to ~30 requests between scheduler recv polls. | ||
| if [[ $CONC -ge 16 ]]; then | ||
| SCHEDULER_RECV_INTERVAL=30 | ||
| else | ||
| SCHEDULER_RECV_INTERVAL=10 | ||
| fi | ||
|
|
||
| MEM_FRAC_STATIC=0.8 | ||
| CHUNKED_PREFILL_SIZE=32768 | ||
| MAX_PREFILL_TOKENS=32768 | ||
| CUDA_GRAPH_MAX_BATCH_SIZE=$CONC | ||
| MAX_RUNNING_REQUESTS=128 | ||
| CONTEXT_LENGTH=$((ISL + OSL + 20)) | ||
|
|
||
| # MTP (Multi-Token Prediction) Config - EAGLE speculative decoding | ||
| SPECULATIVE_NUM_STEPS=3 | ||
| SPECULATIVE_DRAFT_TOKENS=4 | ||
| SPECULATIVE_EAGLE_TOPK=1 | ||
|
|
||
| echo "SCHEDULER_RECV_INTERVAL: $SCHEDULER_RECV_INTERVAL, CONC: $CONC, ISL: $ISL, OSL: $OSL" | ||
|
|
||
| set -x | ||
| PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \ | ||
| --trust-remote-code \ | ||
| --tensor-parallel-size=$TP --data-parallel-size=1 --ep-size $EP_SIZE \ | ||
| --quantization fp8 --kv-cache-dtype fp8_e4m3 --mamba-ssm-dtype bfloat16 \ | ||
| --cuda-graph-max-bs $CUDA_GRAPH_MAX_BATCH_SIZE --max-running-requests $MAX_RUNNING_REQUESTS \ | ||
| --mem-fraction-static $MEM_FRAC_STATIC --chunked-prefill-size $CHUNKED_PREFILL_SIZE --max-prefill-tokens $MAX_PREFILL_TOKENS \ | ||
| --context-length $CONTEXT_LENGTH --disable-radix-cache \ | ||
| --attention-backend trtllm_mha --moe-runner-backend flashinfer_trtllm \ | ||
| --enable-flashinfer-allreduce-fusion --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL \ | ||
| --tokenizer-worker-num 6 --stream-interval 30 \ | ||
| --speculative-algorithm EAGLE --speculative-num-steps $SPECULATIVE_NUM_STEPS --speculative-eagle-topk $SPECULATIVE_EAGLE_TOPK --speculative-num-draft-tokens $SPECULATIVE_DRAFT_TOKENS \ | ||
| > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| # Wait for server to be ready | ||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| pip install -q datasets pandas | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ \ | ||
| --use-chat-template | ||
|
|
||
| # After throughput, run evaluation only if RUN_EVAL is true | ||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" --concurrent-requests $CONC | ||
| append_lm_eval_summary | ||
| fi | ||
| set +x |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -931,4 +931,14 @@ | |
| - "Switch to --attention-backend ROCM_AITER_UNIFIED_ATTN and add fuse_rope_kvcache compilation pass" | ||
| - "Remove deprecated VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION/VLLM_ROCM_USE_AITER_MHA env vars and compilation-config cudagraph_mode" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/867 | ||
|
|
||
|
|
||
| - config-keys: | ||
| - qwen3.5-bf16-b200-sglang-mtp | ||
| - qwen3.5-fp8-b200-sglang-mtp | ||
| - qwen3.5-fp4-b200-sglang-mtp | ||
| description: | ||
| - "Add Single Node Agg MTP configs for Qwen3.5 B200 SGLang (bf16, fp8, fp4)" | ||
| - "EAGLE speculative decoding: num-steps 3, draft-tokens 4, topk 1" | ||
| - "New scripts: benchmarks/single_node/qwen3.5_bf16_b200_mtp.sh, qwen3.5_fp8_b200_mtp.sh, qwen3.5_fp4_b200_mtp.sh" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/897 | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟡 The new perf-changelog entry has Extended reasoning...What the bug isThe new How it manifestsAny tooling or documentation that consumes Step-by-step proof
Note on pre-existing instancesThere are a handful of other entries in FixChange line 944 from: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXto: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/897 |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 The
qwen3.5-fp8-b200-sglang-mtpandqwen3.5-fp4-b200-sglang-mtpconfigs only specifytp:4, ep:1in their search space, which means only 4 of the 8 available B200 GPUs are utilized (total GPUs = TP * DP = 4 * 1 = 4). Every comparable config includestp:8entries — the non-MTP FP8 config hastp:8, ep:1entries, the BF16 MTP config usestp:8, ep:1, and the DSR1 MTP config usestp:8, ep:1. Consider addingtp:8, ep:1search-space entries so the benchmark can explore full-node utilization.Extended reasoning...
What the bug is
The newly added
qwen3.5-fp8-b200-sglang-mtpandqwen3.5-fp4-b200-sglang-mtpconfigs exclusively usetp: 4, ep: 1across all three sequence-length configurations (1k1k, 1k8k, 8k1k). The benchmark scripts hardcode--data-parallel-size=1, so the total GPU count isTP * DP = 4 * 1 = 4. This leaves 4 of the 8 B200 GPUs on the node completely idle.Concrete comparison with other configs
Walking through the comparable configs in this file:
qwen3.5-fp8-b200-sglang, lines 1807-1830): Uses bothtp:8, ep:1(8 GPUs) andtp:4, ep:4(4 GPUs) — the search space explores both parallelism strategies.qwen3.5-bf16-b200-sglang-mtp, lines 1832-1852): Usestp:8, ep:1for all entries — all 8 GPUs.dsr1-fp8-b200-sglang-mtp, lines 1942-1962): Usestp:8, ep:1for all entries — all 8 GPUs.qwen3.5-fp8-b200-sglang-mtp, lines 1854-1874): Uses onlytp:4, ep:1— 4 GPUs.qwen3.5-fp4-b200-sglang-mtp, lines 1876-1896): Uses onlytp:4, ep:1— 4 GPUs.The FP8/FP4 MTP configs are the only ones in this family without any
tp:8search-space entries.Addressing the intentionality argument
One could argue that TP=4 is intentional for quantized models because they fit on fewer GPUs and lower TP reduces communication overhead. However, this argument is weakened by several facts: (a) the non-MTP FP8 config already includes
tp:8entries alongsidetp:4, so the project clearly considers TP=8 worth benchmarking for this quantized model; (b) the BF16 MTP config uses TP=8 — if BF16 (which is larger than FP8/FP4) works with MTP at TP=8, then FP8/FP4 with MTP certainly should too; (c) the DSR1 MTP config uses TP=8; (d) the purpose of the search space is to explore configurations, so omitting TP=8 entirely means the benchmark never even tests whether full-node utilization would be faster.Note: EP (expert parallelism) distributes MoE experts within the TP group — it does not add additional GPUs beyond TP. So
tp:4, ep:4still uses 4 GPUs, not 8. The key comparison is that the non-MTP FP8 config hastp:8entries (using all 8 GPUs) while the MTP variants do not.Impact
Running these benchmarks wastes half the B200 hardware on the node. The resulting throughput numbers will not reflect what the model can achieve when utilizing the full 8-GPU node, making the MTP benchmark results non-comparable with the non-MTP FP8 results (which do use 8 GPUs at TP=8).
Suggested fix
Add
tp:8, ep:1search-space entries to bothqwen3.5-fp8-b200-sglang-mtpandqwen3.5-fp4-b200-sglang-mtp, matching the pattern used by the BF16 MTP config. The existingtp:4, ep:1entries could be kept as additional search points (similar to how the non-MTP FP8 config has both), or changed totp:4, ep:4to at least distribute experts optimally when using 4 GPUs.