Skip to content

Mtp cudagraph#7761

Open
iosmers wants to merge 3 commits into
PaddlePaddle:developfrom
iosmers:mtp_cudagraph
Open

Mtp cudagraph#7761
iosmers wants to merge 3 commits into
PaddlePaddle:developfrom
iosmers:mtp_cudagraph

Conversation

@iosmers
Copy link
Copy Markdown
Collaborator

@iosmers iosmers commented May 9, 2026

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings May 9, 2026 08:08
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 9, 2026

Thanks for your contribution!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 主要围绕 XPU 路径下 MTP(speculative decoding)与 CUDA Graph 的协同做适配:在 XPU 的 model runner 与 MTP proposer 之间透传/设置 step_use_cudagraph,并调整 capture warmup 与执行阶段的 cudagraph 使用条件,以期在合适的 batch 形态下启用图捕获/回放。

Changes:

  • 简化 XPU runner 中 forward_meta.step_use_cudagraph 的设置逻辑,并在执行阶段增加基于 max_capture_size 的限制。
  • XPU runner 的 capture_model() 增加对 MTP/SUFFIX speculative 场景的 warmup 逻辑,并在调用 proposer 时透传 step_use_cudagraph/is_dummy_run
  • MTP proposer(XPU 路径)补齐 step_use_cudagraph 的 forward_meta 初始化、padding 与输出裁剪逻辑。

需要关注的流程性问题(非代码行内评论):

  • PR 标题未按仓库要求带上 tag(示例:[Graph Optimization] ...),且描述模板的 Motivation/Modifications/Tests 基本未填写,建议补全以便评审与回归。

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File Description
fastdeploy/worker/xpu_model_runner.py 调整 XPU runner 的 cudagraph 开关判定、capture warmup,并将 cudagraph 相关状态透传给 MTP proposer
fastdeploy/spec_decode/mtp.py XPU 路径下为 MTP proposer 增加 cudagraph 元数据初始化、padding 与输出裁剪,并调整 xpu_pre_process 调用参数

Comment on lines 1141 to +1144
if self.use_cudagraph:
# Update Batch type for cuda graph for only_decode_batch
if_only_decode = self.only_decode()

only_decode_use_cudagraph = self.use_cudagraph and if_only_decode
# Update config about moe for better performance
# TODO(wanglongzhi):Modifying the config at runtime is not appropriate; it needs to be moved to forward_meta. It will be used in MoEMethodBase.apply()
if self.fd_config.parallel_config.use_ep and self.fd_config.scheduler_config.splitwise_role == "mixed":
self.fd_config.model_config.moe_phase.phase = "decode" if if_only_decode else "prefill"
if self.speculative_decoding:
self.proposer.fd_config.parallel_config.moe_phase.phase = "decode" if if_only_decode else "prefill"

# Update Batch type for cuda graph for only_prefill_batch
only_prefill_use_cudagraph = self.use_cudagraph and self.cudagraph_only_prefill and self.only_prefill()

self.forward_meta.step_use_cudagraph = (
only_prefill_use_cudagraph
if self.cudagraph_only_prefill
else only_decode_use_cudagraph and self.forward_meta.ids_remove_padding.shape[0] > 0
)
self.forward_meta.step_use_cudagraph = self.use_cudagraph and if_only_decode and self.forward_meta.ids_remove_padding.shape[0] > 0
Comment on lines +1157 to 1159
):
if_only_decode = self.only_decode()
self.fd_config.model_config.moe_phase.phase = "decode" if if_only_decode else "prefill"
Comment on lines +1156 to +1157
self.fd_config.parallel_config.use_ep and self.fd_config.scheduler_config.splitwise_role == "mixed"
):
Comment on lines 1492 to +1507
try:
for batch_size in sorted(capture_sizes, reverse=True):
self._dummy_run(
num_tokens=self.scheduler_config.max_num_batched_tokens,
batch_size=batch_size,
expected_decode_len=expected_decode_len,
in_capturing=True,
)
logger.info(f"Warm up the model with the batch size:{batch_size}, num tokens:{expected_decode_len}")
if self.speculative_decoding and self.spec_method in [SpecMethod.MTP, SpecMethod.SUFFIX]:
for capture_size in sorted(capture_sizes, reverse=True):
expected_decode_len = (self.speculative_config.num_speculative_tokens + 1) * 2
self._dummy_run(
num_tokens=self.fd_config.get_max_chunk_tokens(),
batch_size=int(capture_size / (self.speculative_config.num_speculative_tokens + 1)),
in_capturing=True,
expected_decode_len=expected_decode_len,
# accept_all_drafts=True,
)
logger.info(
f"Warm up the model with the num_tokens:{capture_size}, expected_decode_len:{expected_decode_len}"
)
else:
for batch_size in sorted(capture_sizes, reverse=True):
Comment on lines +1496 to +1504
self._dummy_run(
num_tokens=self.fd_config.get_max_chunk_tokens(),
batch_size=int(capture_size / (self.speculative_config.num_speculative_tokens + 1)),
in_capturing=True,
expected_decode_len=expected_decode_len,
# accept_all_drafts=True,
)
logger.info(
f"Warm up the model with the num_tokens:{capture_size}, expected_decode_len:{expected_decode_len}"
self.proposer.run(
full_hidden_states=model_output,
step_use_cudagraph=self.forward_meta.step_use_cudagraph,
# tep_use_cudagraph=False,
Comment on lines 698 to +723
@@ -719,6 +719,14 @@ def _initialize_forward_meta_xpu(self):
# Initialize attention meta data
for attn_backend in self.attn_backends:
attn_backend.init_attention_metadata(self.forward_meta)

# Notes(liuzichang):
Comment on lines +1098 to +1099
# num_speculative_tokens=self.speculative_config.num_speculative_tokens,
num_speculative_tokens=0,
Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-09 16:18:16

📋 Review 摘要

PR 概述:为 XPU 平台上的 MTP 投机解码添加 CUDAGraph 支持
变更范围fastdeploy/spec_decode/mtp.pyfastdeploy/worker/xpu_model_runner.py
影响面 Tag[Speculative Decoding] [XPU]

📝 PR 规范检查

标题缺少官方 Tag,PR 描述 Motivation/Modifications/Usage or Command/Accuracy Tests 各节均为空,Checklist 全未勾选,需按模板补全。

标题建议(可直接复制):

  • [Speculative Decoding][XPU] Support CUDAGraph for MTP on XPU

PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):

## Motivation
为 XPU 平台上的 MTP(Multi-Token Prediction)投机解码添加 CUDAGraph 支持,通过减少 Python 运行时开销来提升 XPU 上的推理性能。

## Modifications
- `fastdeploy/spec_decode/mtp.py`:为 `_initialize_forward_meta_xpu` 添加 `step_use_cudagraph``is_dummy_run``substep` 参数,对齐 CUDA 路径逻辑;在 `_propose_xpu` 中新增 `padding_cudagraph_inputs()` 调用及 cudagraph 输出截断(`model_output[:real_token_num]`)逻辑
- `fastdeploy/worker/xpu_model_runner.py`:简化 `_prepare_inputs` 中的 `step_use_cudagraph` 赋值;在 `capture_model` 中新增 MTP/SUFFIX 专属 warmup 路径(按 `capture_size / (num_speculative_tokens+1)` 计算 batch_size);修正 `execute_model` 中 cudagraph padding 顺序(移至 `_prepare_inputs` 之后);向 `proposer.run()` 传递 `step_use_cudagraph``is_dummy_run` 参数

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别 文件 概述
🟡 建议 fastdeploy/spec_decode/mtp.py:1099 num_speculative_tokens 被硬编码为 0,原始动态值被注释掉,可能导致投机解码 token 计数逻辑异常
❓ 疑问 fastdeploy/worker/xpu_model_runner.py:1144 cudagraph_only_prefill 路径被静默移除,若现有部署中 cudagraph_only_prefill=True 将失效
❓ 疑问 fastdeploy/worker/xpu_model_runner.py:1154 FD_XPU_ENABLE_MIXED_EP_MODE 环境变量守卫被移除,MoE EP mixed 模式下行为发生变化
❓ 疑问 fastdeploy/worker/xpu_model_runner.py:1697 残留 debug 注释含拼写错误(tep_use_cudagraph 应为 step_use_cudagraph),应在合入前清理

总体评价

PR 为 XPU MTP 投机解码增加了 CUDAGraph 支持,核心思路与 CUDA 路径对齐。但存在 num_speculative_tokens=0 硬编码、cudagraph_only_prefill 路径静默移除以及残留调试代码等问题,建议作者确认后清理再合入。

self.model_inputs["seq_lens_decoder"],
num_speculative_tokens=self.speculative_config.num_speculative_tokens,
# num_speculative_tokens=self.speculative_config.num_speculative_tokens,
num_speculative_tokens=0,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 num_speculative_tokens 被硬编码为 0,原始动态值被注释掉。

update_model_inputs 接收到 num_speculative_tokens=0 后,依赖该字段的注意力 mask 构造或 token 计数逻辑可能产生错误结果,相当于在推理阶段也禁用了投机 token 数量感知。若这是 XPU 投机解码的特殊需求(如 attention mask 生成方式不同),请补充注释说明原因;若为调试遗留,请恢复为 self.speculative_config.num_speculative_tokens

if self.cudagraph_only_prefill
else only_decode_use_cudagraph and self.forward_meta.ids_remove_padding.shape[0] > 0
)
self.forward_meta.step_use_cudagraph = self.use_cudagraph and if_only_decode and self.forward_meta.ids_remove_padding.shape[0] > 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 cudagraph_only_prefill 路径被静默移除。

原代码支持:

self.forward_meta.step_use_cudagraph = (
    only_prefill_use_cudagraph
    if self.cudagraph_only_prefill
    else only_decode_use_cudagraph and ...
)

新代码只保留了 decode-only 路径,cudagraph_only_prefill=True 的配置将无法触发 cudagraph。请确认 XPU 平台是否已不再使用该配置,若是请同步更新文档或配置注释。

@@ -1169,8 +1153,8 @@ def _prepare_inputs(self, is_dummy_run=False) -> None:
self.forward_meta.kv_signal_sender = self.share_inputs["kv_signal_sender"]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 envs.FD_XPU_ENABLE_MIXED_EP_MODE 环境变量守卫被移除。

原条件:splitwise_role == "mixed" and envs.FD_XPU_ENABLE_MIXED_EP_MODE
新条件:use_ep and splitwise_role == "mixed"

这意味着在 use_ep=True + splitwise_role=mixed 的场景下,即使 FD_XPU_ENABLE_MIXED_EP_MODE=False,MoE phase 切换逻辑也会生效,可能改变已有部署的行为。此变更是否为有意为之?建议在 PR 描述中说明。

if self.spec_method == SpecMethod.MTP:
self.proposer.run(full_hidden_states=model_output)
self.proposer.run(
full_hidden_states=model_output,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 残留 debug 注释含拼写错误,建议合入前清理。

# tep_use_cudagraph=False,tep_use_cudagraph 应为 step_use_cudagraph(缺少首字母 s)。同样,capture_model 中的 # accept_all_drafts=True, 也是残留的调试注释,若不需要请一并移除。

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-09 16:24:22

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

✅ 所有 Required 任务均通过,建议合并。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
5(0) 5 4 1 0 0 0

2 任务状态汇总

2.1 Required 任务:通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
其余 4 个必选任务通过 - - - - -

2.2 可选任务 — 0/1 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Trigger Jenkins for PR 55s Job -

3 失败详情(仅 required)

无 required 失败任务。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants