Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1829,6 +1829,72 @@ qwen3.5-fp8-b200-sglang:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 4}
- { tp: 4, ep: 4, conc-start: 8, conc-end: 64 }

qwen3.5-bf16-b200-sglang-mtp:
image: lmsysorg/sglang:v0.5.9-cu130
model: Qwen/Qwen3.5-397B-A17B
model-prefix: qwen3.5
runner: b200
precision: bf16
framework: sglang
multinode: false
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 512, spec-decoding: mtp }
- isl: 1024
osl: 8192
search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 512, spec-decoding: mtp }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 512, spec-decoding: mtp }

qwen3.5-fp8-b200-sglang-mtp:
image: lmsysorg/sglang:v0.5.9-cu130
model: Qwen/Qwen3.5-397B-A17B-FP8
model-prefix: qwen3.5
runner: b200
precision: fp8
framework: sglang
multinode: false
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- { tp: 4, ep: 1, conc-start: 4, conc-end: 512, spec-decoding: mtp }
- isl: 1024
osl: 8192
search-space:
- { tp: 4, ep: 1, conc-start: 4, conc-end: 512, spec-decoding: mtp }
Comment on lines +1866 to +1870
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The qwen3.5-fp8-b200-sglang-mtp and qwen3.5-fp4-b200-sglang-mtp configs only specify tp:4, ep:1 in their search space, which means only 4 of the 8 available B200 GPUs are utilized (total GPUs = TP * DP = 4 * 1 = 4). Every comparable config includes tp:8 entries — the non-MTP FP8 config has tp:8, ep:1 entries, the BF16 MTP config uses tp:8, ep:1, and the DSR1 MTP config uses tp:8, ep:1. Consider adding tp:8, ep:1 search-space entries so the benchmark can explore full-node utilization.

Extended reasoning...

What the bug is

The newly added qwen3.5-fp8-b200-sglang-mtp and qwen3.5-fp4-b200-sglang-mtp configs exclusively use tp: 4, ep: 1 across all three sequence-length configurations (1k1k, 1k8k, 8k1k). The benchmark scripts hardcode --data-parallel-size=1, so the total GPU count is TP * DP = 4 * 1 = 4. This leaves 4 of the 8 B200 GPUs on the node completely idle.

Concrete comparison with other configs

Walking through the comparable configs in this file:

  1. Non-MTP FP8 (qwen3.5-fp8-b200-sglang, lines 1807-1830): Uses both tp:8, ep:1 (8 GPUs) and tp:4, ep:4 (4 GPUs) — the search space explores both parallelism strategies.
  2. BF16 MTP (qwen3.5-bf16-b200-sglang-mtp, lines 1832-1852): Uses tp:8, ep:1 for all entries — all 8 GPUs.
  3. DSR1 FP8 MTP (dsr1-fp8-b200-sglang-mtp, lines 1942-1962): Uses tp:8, ep:1 for all entries — all 8 GPUs.
  4. FP8 MTP (qwen3.5-fp8-b200-sglang-mtp, lines 1854-1874): Uses only tp:4, ep:1 — 4 GPUs.
  5. FP4 MTP (qwen3.5-fp4-b200-sglang-mtp, lines 1876-1896): Uses only tp:4, ep:1 — 4 GPUs.

The FP8/FP4 MTP configs are the only ones in this family without any tp:8 search-space entries.

Addressing the intentionality argument

One could argue that TP=4 is intentional for quantized models because they fit on fewer GPUs and lower TP reduces communication overhead. However, this argument is weakened by several facts: (a) the non-MTP FP8 config already includes tp:8 entries alongside tp:4, so the project clearly considers TP=8 worth benchmarking for this quantized model; (b) the BF16 MTP config uses TP=8 — if BF16 (which is larger than FP8/FP4) works with MTP at TP=8, then FP8/FP4 with MTP certainly should too; (c) the DSR1 MTP config uses TP=8; (d) the purpose of the search space is to explore configurations, so omitting TP=8 entirely means the benchmark never even tests whether full-node utilization would be faster.

Note: EP (expert parallelism) distributes MoE experts within the TP group — it does not add additional GPUs beyond TP. So tp:4, ep:4 still uses 4 GPUs, not 8. The key comparison is that the non-MTP FP8 config has tp:8 entries (using all 8 GPUs) while the MTP variants do not.

Impact

Running these benchmarks wastes half the B200 hardware on the node. The resulting throughput numbers will not reflect what the model can achieve when utilizing the full 8-GPU node, making the MTP benchmark results non-comparable with the non-MTP FP8 results (which do use 8 GPUs at TP=8).

Suggested fix

Add tp:8, ep:1 search-space entries to both qwen3.5-fp8-b200-sglang-mtp and qwen3.5-fp4-b200-sglang-mtp, matching the pattern used by the BF16 MTP config. The existing tp:4, ep:1 entries could be kept as additional search points (similar to how the non-MTP FP8 config has both), or changed to tp:4, ep:4 to at least distribute experts optimally when using 4 GPUs.

- isl: 8192
osl: 1024
search-space:
- { tp: 4, ep: 1, conc-start: 4, conc-end: 512, spec-decoding: mtp }

qwen3.5-fp4-b200-sglang-mtp:
image: lmsysorg/sglang:v0.5.9-cu130
model: nvidia/Qwen3.5-397B-A17B-NVFP4
model-prefix: qwen3.5
runner: b200
precision: fp4
framework: sglang
multinode: false
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- { tp: 4, ep: 1, conc-start: 4, conc-end: 512, spec-decoding: mtp }
- isl: 1024
osl: 8192
search-space:
- { tp: 4, ep: 1, conc-start: 4, conc-end: 512, spec-decoding: mtp }
- isl: 8192
osl: 1024
search-space:
- { tp: 4, ep: 1, conc-start: 4, conc-end: 512, spec-decoding: mtp }

kimik2.5-int4-b200-vllm:
image: vllm/vllm-openai:v0.15.1
model: moonshotai/Kimi-K2.5
Expand Down
90 changes: 90 additions & 0 deletions benchmarks/single_node/qwen3.5_bf16_b200_mtp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME \
EP_SIZE

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

nvidia-smi

hf download "$MODEL"

export NCCL_NVLS_ENABLE=1
export SGL_ENABLE_JIT_DEEPGEMM=false
export SGLANG_ENABLE_FLASHINFER_GEMM=true
export PYTHONUNBUFFERED=1

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

# Default: recv every ~10 requests; if CONC >= 16, relax to ~30 requests between scheduler recv polls.
if [[ $CONC -ge 16 ]]; then
SCHEDULER_RECV_INTERVAL=30
else
SCHEDULER_RECV_INTERVAL=10
fi

MEM_FRAC_STATIC=0.8
CHUNKED_PREFILL_SIZE=32768
MAX_PREFILL_TOKENS=32768
CUDA_GRAPH_MAX_BATCH_SIZE=$CONC
MAX_RUNNING_REQUESTS=128
CONTEXT_LENGTH=$((ISL + OSL + 20))

# MTP (Multi-Token Prediction) Config - EAGLE speculative decoding
Comment on lines +44 to +45
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 MAX_RUNNING_REQUESTS=128 is hardcoded in all three MTP scripts but the YAML configs set conc-end: 512, so at concurrency >128 the server queues excess requests instead of running them concurrently -- benchmarks will measure queuing behavior, not true high-concurrency performance. Change MAX_RUNNING_REQUESTS to 512 (matching dsr1_fp8_b200_mtp.sh which already does this correctly), or set it to $CONC to stay consistent with CUDA_GRAPH_MAX_BATCH_SIZE=$CONC.

Extended reasoning...

What the bug is: All three new Qwen3.5 MTP benchmark scripts (qwen3.5_bf16_b200_mtp.sh, qwen3.5_fp4_b200_mtp.sh, qwen3.5_fp8_b200_mtp.sh) hardcode MAX_RUNNING_REQUESTS=128 on line 42, while their corresponding YAML configs in nvidia-master.yaml specify conc-end: 512. The --max-running-requests flag limits how many requests the SGLang server will process simultaneously.

How it manifests: When the benchmark client sends requests at concurrency levels above 128 (e.g., 256 or 512), the server will only run 128 concurrently and queue the rest. This means the benchmark is measuring server queuing + processing behavior rather than actual 512-concurrent-request throughput, which defeats the purpose of testing at higher concurrency levels.

Step-by-step proof: The YAML config sets conc-end: 512, so the benchmark harness will sweep concurrency from 4 up to 512. At CONC=512, the script sets CUDA_GRAPH_MAX_BATCH_SIZE=512 (line 41) and passes --max-concurrency 512 to the benchmark client (line 81). However, the server is started with --max-running-requests 128 (line 56). So out of 512 concurrent requests, only 128 run simultaneously -- the remaining 384 sit in the server queue. The CUDA graphs compiled for batch sizes 129-512 are never used, wasting GPU memory.

Why existing code does not prevent it: The non-MTP Qwen3.5 scripts also use MAX_RUNNING_REQUESTS=128, but their YAML configs only go up to conc-end: 64, so there is no mismatch. The MTP scripts were likely copy-pasted from the non-MTP versions without updating this value to match the expanded concurrency range.

Comparison with correct implementation: The existing dsr1_fp8_b200_mtp.sh script (line 43) correctly sets MAX_RUNNING_REQUESTS=512 to match its conc-end: 512 config, confirming this is an oversight in the new scripts.

Fix: Change MAX_RUNNING_REQUESTS=128 to MAX_RUNNING_REQUESTS=512 (or MAX_RUNNING_REQUESTS=$CONC) in all three files: qwen3.5_bf16_b200_mtp.sh:42, qwen3.5_fp4_b200_mtp.sh:42, and qwen3.5_fp8_b200_mtp.sh:42.

SPECULATIVE_NUM_STEPS=3
SPECULATIVE_DRAFT_TOKENS=4
SPECULATIVE_EAGLE_TOPK=1

echo "SCHEDULER_RECV_INTERVAL: $SCHEDULER_RECV_INTERVAL, CONC: $CONC, ISL: $ISL, OSL: $OSL"

set -x
PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \
--trust-remote-code \
--tensor-parallel-size=$TP --data-parallel-size=1 --ep-size $EP_SIZE \
--cuda-graph-max-bs $CUDA_GRAPH_MAX_BATCH_SIZE --max-running-requests $MAX_RUNNING_REQUESTS \
--mem-fraction-static $MEM_FRAC_STATIC --chunked-prefill-size $CHUNKED_PREFILL_SIZE --max-prefill-tokens $MAX_PREFILL_TOKENS \
--context-length $CONTEXT_LENGTH --disable-radix-cache \
--attention-backend trtllm_mha --moe-runner-backend flashinfer_trtllm \
--enable-flashinfer-allreduce-fusion --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL \
--tokenizer-worker-num 6 --stream-interval 30 \
--speculative-algorithm EAGLE --speculative-num-steps $SPECULATIVE_NUM_STEPS --speculative-eagle-topk $SPECULATIVE_EAGLE_TOPK --speculative-num-draft-tokens $SPECULATIVE_DRAFT_TOKENS \
> $SERVER_LOG 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

pip install -q datasets pandas

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--use-chat-template

# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT" --concurrent-requests $CONC
append_lm_eval_summary
fi
set +x
90 changes: 90 additions & 0 deletions benchmarks/single_node/qwen3.5_fp4_b200_mtp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME \
EP_SIZE

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

nvidia-smi

hf download "$MODEL"

export NCCL_NVLS_ENABLE=1
export SGL_ENABLE_JIT_DEEPGEMM=false
export SGLANG_ENABLE_FLASHINFER_GEMM=true
export PYTHONUNBUFFERED=1

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

# Default: recv every ~10 requests; if CONC >= 16, relax to ~30 requests between scheduler recv polls.
if [[ $CONC -ge 16 ]]; then
SCHEDULER_RECV_INTERVAL=30
else
SCHEDULER_RECV_INTERVAL=10
fi

MEM_FRAC_STATIC=0.8
CHUNKED_PREFILL_SIZE=32768
MAX_PREFILL_TOKENS=32768
CUDA_GRAPH_MAX_BATCH_SIZE=$CONC
MAX_RUNNING_REQUESTS=128
CONTEXT_LENGTH=$((ISL + OSL + 20))

# MTP (Multi-Token Prediction) Config - EAGLE speculative decoding
SPECULATIVE_NUM_STEPS=3
SPECULATIVE_DRAFT_TOKENS=4
SPECULATIVE_EAGLE_TOPK=1

echo "SCHEDULER_RECV_INTERVAL: $SCHEDULER_RECV_INTERVAL, CONC: $CONC, ISL: $ISL, OSL: $OSL"

set -x
PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \
--trust-remote-code \
--tensor-parallel-size=$TP --data-parallel-size=1 --ep-size $EP_SIZE \
--cuda-graph-max-bs $CUDA_GRAPH_MAX_BATCH_SIZE --max-running-requests $MAX_RUNNING_REQUESTS \
--mem-fraction-static $MEM_FRAC_STATIC --chunked-prefill-size $CHUNKED_PREFILL_SIZE --max-prefill-tokens $MAX_PREFILL_TOKENS \
--context-length $CONTEXT_LENGTH --disable-radix-cache \
--attention-backend trtllm_mha --moe-runner-backend flashinfer_trtllm --fp4-gemm-backend flashinfer_cutlass --kv-cache-dtype fp8_e4m3 \
--enable-flashinfer-allreduce-fusion --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL \
--tokenizer-worker-num 6 --stream-interval 30 \
--speculative-algorithm EAGLE --speculative-num-steps $SPECULATIVE_NUM_STEPS --speculative-eagle-topk $SPECULATIVE_EAGLE_TOPK --speculative-num-draft-tokens $SPECULATIVE_DRAFT_TOKENS \
> $SERVER_LOG 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

pip install -q datasets pandas

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--use-chat-template

# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT" --concurrent-requests $CONC
append_lm_eval_summary
fi
set +x
91 changes: 91 additions & 0 deletions benchmarks/single_node/qwen3.5_fp8_b200_mtp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME \
EP_SIZE

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

nvidia-smi

hf download "$MODEL"

export NCCL_NVLS_ENABLE=1
export SGL_ENABLE_JIT_DEEPGEMM=false
export SGLANG_ENABLE_FLASHINFER_GEMM=true
export PYTHONUNBUFFERED=1

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

# Default: recv every ~10 requests; if CONC >= 16, relax to ~30 requests between scheduler recv polls.
if [[ $CONC -ge 16 ]]; then
SCHEDULER_RECV_INTERVAL=30
else
SCHEDULER_RECV_INTERVAL=10
fi

MEM_FRAC_STATIC=0.8
CHUNKED_PREFILL_SIZE=32768
MAX_PREFILL_TOKENS=32768
CUDA_GRAPH_MAX_BATCH_SIZE=$CONC
MAX_RUNNING_REQUESTS=128
CONTEXT_LENGTH=$((ISL + OSL + 20))

# MTP (Multi-Token Prediction) Config - EAGLE speculative decoding
SPECULATIVE_NUM_STEPS=3
SPECULATIVE_DRAFT_TOKENS=4
SPECULATIVE_EAGLE_TOPK=1

echo "SCHEDULER_RECV_INTERVAL: $SCHEDULER_RECV_INTERVAL, CONC: $CONC, ISL: $ISL, OSL: $OSL"

set -x
PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \
--trust-remote-code \
--tensor-parallel-size=$TP --data-parallel-size=1 --ep-size $EP_SIZE \
--quantization fp8 --kv-cache-dtype fp8_e4m3 --mamba-ssm-dtype bfloat16 \
--cuda-graph-max-bs $CUDA_GRAPH_MAX_BATCH_SIZE --max-running-requests $MAX_RUNNING_REQUESTS \
--mem-fraction-static $MEM_FRAC_STATIC --chunked-prefill-size $CHUNKED_PREFILL_SIZE --max-prefill-tokens $MAX_PREFILL_TOKENS \
--context-length $CONTEXT_LENGTH --disable-radix-cache \
--attention-backend trtllm_mha --moe-runner-backend flashinfer_trtllm \
--enable-flashinfer-allreduce-fusion --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL \
--tokenizer-worker-num 6 --stream-interval 30 \
--speculative-algorithm EAGLE --speculative-num-steps $SPECULATIVE_NUM_STEPS --speculative-eagle-topk $SPECULATIVE_EAGLE_TOPK --speculative-num-draft-tokens $SPECULATIVE_DRAFT_TOKENS \
> $SERVER_LOG 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

pip install -q datasets pandas

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--use-chat-template

# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT" --concurrent-requests $CONC
append_lm_eval_summary
fi
set +x
12 changes: 11 additions & 1 deletion perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -931,4 +931,14 @@
- "Switch to --attention-backend ROCM_AITER_UNIFIED_ATTN and add fuse_rope_kvcache compilation pass"
- "Remove deprecated VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION/VLLM_ROCM_USE_AITER_MHA env vars and compilation-config cudagraph_mode"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/867


- config-keys:
- qwen3.5-bf16-b200-sglang-mtp
- qwen3.5-fp8-b200-sglang-mtp
- qwen3.5-fp4-b200-sglang-mtp
description:
- "Add Single Node Agg MTP configs for Qwen3.5 B200 SGLang (bf16, fp8, fp4)"
- "EAGLE speculative decoding: num-steps 3, draft-tokens 4, topk 1"
- "New scripts: benchmarks/single_node/qwen3.5_bf16_b200_mtp.sh, qwen3.5_fp8_b200_mtp.sh, qwen3.5_fp4_b200_mtp.sh"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/897

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new perf-changelog entry has pr-link set to /pull/XXX (placeholder) instead of /pull/897. Please update to the actual PR number before merging.

Extended reasoning...

What the bug is

The new perf-changelog.yaml entry added by this PR at line 944 sets pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX, which is a placeholder URL. Since this is PR #897, the link should be https://github.com/SemiAnalysisAI/InferenceX/pull/897.

How it manifests

Any tooling or documentation that consumes perf-changelog.yaml to generate changelogs, dashboards, or PR cross-references will produce a broken link pointing to a non-existent /pull/XXX page. Anyone clicking the link will get a 404.

Step-by-step proof

  1. Look at the diff for perf-changelog.yaml — the new entry starting at line 935 adds three config keys: qwen3.5-bf16-b200-sglang-mtp, qwen3.5-fp8-b200-sglang-mtp, qwen3.5-fp4-b200-sglang-mtp.
  2. Line 944 reads: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX
  3. The PR metadata confirms this is PR [WIP] B200 MTP configs for Qwen3.5 Bf16, FP4, FP8 SGLang #897 (title: "[WIP] B200 MTP configs for Qwen3.5 Bf16, FP4, FP8 SGLang").
  4. Therefore XXX should be replaced with 897.

Note on pre-existing instances

There are a handful of other entries in perf-changelog.yaml that also have /pull/XXX (e.g., the GLM-5, MiniMax H200, Qwen3.5 MI325X/MI300x FP8 entries). Those are pre-existing placeholders from other PRs and are not introduced by this PR. Only the new entry added here needs to be fixed.

Fix

Change line 944 from:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

to:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/897

Loading