Mtp cudagraph#7761
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 主要围绕 XPU 路径下 MTP(speculative decoding)与 CUDA Graph 的协同做适配:在 XPU 的 model runner 与 MTP proposer 之间透传/设置 step_use_cudagraph,并调整 capture warmup 与执行阶段的 cudagraph 使用条件,以期在合适的 batch 形态下启用图捕获/回放。
Changes:
- 简化 XPU runner 中
forward_meta.step_use_cudagraph的设置逻辑,并在执行阶段增加基于max_capture_size的限制。 - XPU runner 的
capture_model()增加对 MTP/SUFFIX speculative 场景的 warmup 逻辑,并在调用 proposer 时透传step_use_cudagraph/is_dummy_run。 - MTP proposer(XPU 路径)补齐
step_use_cudagraph的 forward_meta 初始化、padding 与输出裁剪逻辑。
需要关注的流程性问题(非代码行内评论):
- PR 标题未按仓库要求带上 tag(示例:
[Graph Optimization] ...),且描述模板的 Motivation/Modifications/Tests 基本未填写,建议补全以便评审与回归。
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| fastdeploy/worker/xpu_model_runner.py | 调整 XPU runner 的 cudagraph 开关判定、capture warmup,并将 cudagraph 相关状态透传给 MTP proposer |
| fastdeploy/spec_decode/mtp.py | XPU 路径下为 MTP proposer 增加 cudagraph 元数据初始化、padding 与输出裁剪,并调整 xpu_pre_process 调用参数 |
| if self.use_cudagraph: | ||
| # Update Batch type for cuda graph for only_decode_batch | ||
| if_only_decode = self.only_decode() | ||
|
|
||
| only_decode_use_cudagraph = self.use_cudagraph and if_only_decode | ||
| # Update config about moe for better performance | ||
| # TODO(wanglongzhi):Modifying the config at runtime is not appropriate; it needs to be moved to forward_meta. It will be used in MoEMethodBase.apply() | ||
| if self.fd_config.parallel_config.use_ep and self.fd_config.scheduler_config.splitwise_role == "mixed": | ||
| self.fd_config.model_config.moe_phase.phase = "decode" if if_only_decode else "prefill" | ||
| if self.speculative_decoding: | ||
| self.proposer.fd_config.parallel_config.moe_phase.phase = "decode" if if_only_decode else "prefill" | ||
|
|
||
| # Update Batch type for cuda graph for only_prefill_batch | ||
| only_prefill_use_cudagraph = self.use_cudagraph and self.cudagraph_only_prefill and self.only_prefill() | ||
|
|
||
| self.forward_meta.step_use_cudagraph = ( | ||
| only_prefill_use_cudagraph | ||
| if self.cudagraph_only_prefill | ||
| else only_decode_use_cudagraph and self.forward_meta.ids_remove_padding.shape[0] > 0 | ||
| ) | ||
| self.forward_meta.step_use_cudagraph = self.use_cudagraph and if_only_decode and self.forward_meta.ids_remove_padding.shape[0] > 0 |
| ): | ||
| if_only_decode = self.only_decode() | ||
| self.fd_config.model_config.moe_phase.phase = "decode" if if_only_decode else "prefill" |
| self.fd_config.parallel_config.use_ep and self.fd_config.scheduler_config.splitwise_role == "mixed" | ||
| ): |
| try: | ||
| for batch_size in sorted(capture_sizes, reverse=True): | ||
| self._dummy_run( | ||
| num_tokens=self.scheduler_config.max_num_batched_tokens, | ||
| batch_size=batch_size, | ||
| expected_decode_len=expected_decode_len, | ||
| in_capturing=True, | ||
| ) | ||
| logger.info(f"Warm up the model with the batch size:{batch_size}, num tokens:{expected_decode_len}") | ||
| if self.speculative_decoding and self.spec_method in [SpecMethod.MTP, SpecMethod.SUFFIX]: | ||
| for capture_size in sorted(capture_sizes, reverse=True): | ||
| expected_decode_len = (self.speculative_config.num_speculative_tokens + 1) * 2 | ||
| self._dummy_run( | ||
| num_tokens=self.fd_config.get_max_chunk_tokens(), | ||
| batch_size=int(capture_size / (self.speculative_config.num_speculative_tokens + 1)), | ||
| in_capturing=True, | ||
| expected_decode_len=expected_decode_len, | ||
| # accept_all_drafts=True, | ||
| ) | ||
| logger.info( | ||
| f"Warm up the model with the num_tokens:{capture_size}, expected_decode_len:{expected_decode_len}" | ||
| ) | ||
| else: | ||
| for batch_size in sorted(capture_sizes, reverse=True): |
| self._dummy_run( | ||
| num_tokens=self.fd_config.get_max_chunk_tokens(), | ||
| batch_size=int(capture_size / (self.speculative_config.num_speculative_tokens + 1)), | ||
| in_capturing=True, | ||
| expected_decode_len=expected_decode_len, | ||
| # accept_all_drafts=True, | ||
| ) | ||
| logger.info( | ||
| f"Warm up the model with the num_tokens:{capture_size}, expected_decode_len:{expected_decode_len}" |
| self.proposer.run( | ||
| full_hidden_states=model_output, | ||
| step_use_cudagraph=self.forward_meta.step_use_cudagraph, | ||
| # tep_use_cudagraph=False, |
| @@ -719,6 +719,14 @@ def _initialize_forward_meta_xpu(self): | |||
| # Initialize attention meta data | |||
| for attn_backend in self.attn_backends: | |||
| attn_backend.init_attention_metadata(self.forward_meta) | |||
|
|
|||
| # Notes(liuzichang): | |||
| # num_speculative_tokens=self.speculative_config.num_speculative_tokens, | ||
| num_speculative_tokens=0, |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-09 16:18:16
📋 Review 摘要
PR 概述:为 XPU 平台上的 MTP 投机解码添加 CUDAGraph 支持
变更范围:fastdeploy/spec_decode/mtp.py、fastdeploy/worker/xpu_model_runner.py
影响面 Tag:[Speculative Decoding] [XPU]
📝 PR 规范检查
标题缺少官方 Tag,PR 描述 Motivation/Modifications/Usage or Command/Accuracy Tests 各节均为空,Checklist 全未勾选,需按模板补全。
标题建议(可直接复制):
[Speculative Decoding][XPU] Support CUDAGraph for MTP on XPU
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
为 XPU 平台上的 MTP(Multi-Token Prediction)投机解码添加 CUDAGraph 支持,通过减少 Python 运行时开销来提升 XPU 上的推理性能。
## Modifications
- `fastdeploy/spec_decode/mtp.py`:为 `_initialize_forward_meta_xpu` 添加 `step_use_cudagraph`、`is_dummy_run`、`substep` 参数,对齐 CUDA 路径逻辑;在 `_propose_xpu` 中新增 `padding_cudagraph_inputs()` 调用及 cudagraph 输出截断(`model_output[:real_token_num]`)逻辑
- `fastdeploy/worker/xpu_model_runner.py`:简化 `_prepare_inputs` 中的 `step_use_cudagraph` 赋值;在 `capture_model` 中新增 MTP/SUFFIX 专属 warmup 路径(按 `capture_size / (num_speculative_tokens+1)` 计算 batch_size);修正 `execute_model` 中 cudagraph padding 顺序(移至 `_prepare_inputs` 之后);向 `proposer.run()` 传递 `step_use_cudagraph` 和 `is_dummy_run` 参数
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | fastdeploy/spec_decode/mtp.py:1099 |
num_speculative_tokens 被硬编码为 0,原始动态值被注释掉,可能导致投机解码 token 计数逻辑异常 |
| ❓ 疑问 | fastdeploy/worker/xpu_model_runner.py:1144 |
cudagraph_only_prefill 路径被静默移除,若现有部署中 cudagraph_only_prefill=True 将失效 |
| ❓ 疑问 | fastdeploy/worker/xpu_model_runner.py:1154 |
FD_XPU_ENABLE_MIXED_EP_MODE 环境变量守卫被移除,MoE EP mixed 模式下行为发生变化 |
| ❓ 疑问 | fastdeploy/worker/xpu_model_runner.py:1697 |
残留 debug 注释含拼写错误(tep_use_cudagraph 应为 step_use_cudagraph),应在合入前清理 |
总体评价
PR 为 XPU MTP 投机解码增加了 CUDAGraph 支持,核心思路与 CUDA 路径对齐。但存在 num_speculative_tokens=0 硬编码、cudagraph_only_prefill 路径静默移除以及残留调试代码等问题,建议作者确认后清理再合入。
| self.model_inputs["seq_lens_decoder"], | ||
| num_speculative_tokens=self.speculative_config.num_speculative_tokens, | ||
| # num_speculative_tokens=self.speculative_config.num_speculative_tokens, | ||
| num_speculative_tokens=0, |
There was a problem hiding this comment.
🟡 建议 num_speculative_tokens 被硬编码为 0,原始动态值被注释掉。
update_model_inputs 接收到 num_speculative_tokens=0 后,依赖该字段的注意力 mask 构造或 token 计数逻辑可能产生错误结果,相当于在推理阶段也禁用了投机 token 数量感知。若这是 XPU 投机解码的特殊需求(如 attention mask 生成方式不同),请补充注释说明原因;若为调试遗留,请恢复为 self.speculative_config.num_speculative_tokens。
| if self.cudagraph_only_prefill | ||
| else only_decode_use_cudagraph and self.forward_meta.ids_remove_padding.shape[0] > 0 | ||
| ) | ||
| self.forward_meta.step_use_cudagraph = self.use_cudagraph and if_only_decode and self.forward_meta.ids_remove_padding.shape[0] > 0 |
There was a problem hiding this comment.
❓ 疑问 cudagraph_only_prefill 路径被静默移除。
原代码支持:
self.forward_meta.step_use_cudagraph = (
only_prefill_use_cudagraph
if self.cudagraph_only_prefill
else only_decode_use_cudagraph and ...
)新代码只保留了 decode-only 路径,cudagraph_only_prefill=True 的配置将无法触发 cudagraph。请确认 XPU 平台是否已不再使用该配置,若是请同步更新文档或配置注释。
| @@ -1169,8 +1153,8 @@ def _prepare_inputs(self, is_dummy_run=False) -> None: | |||
| self.forward_meta.kv_signal_sender = self.share_inputs["kv_signal_sender"] | |||
|
|
|||
There was a problem hiding this comment.
❓ 疑问 envs.FD_XPU_ENABLE_MIXED_EP_MODE 环境变量守卫被移除。
原条件:splitwise_role == "mixed" and envs.FD_XPU_ENABLE_MIXED_EP_MODE
新条件:use_ep and splitwise_role == "mixed"
这意味着在 use_ep=True + splitwise_role=mixed 的场景下,即使 FD_XPU_ENABLE_MIXED_EP_MODE=False,MoE phase 切换逻辑也会生效,可能改变已有部署的行为。此变更是否为有意为之?建议在 PR 描述中说明。
| if self.spec_method == SpecMethod.MTP: | ||
| self.proposer.run(full_hidden_states=model_output) | ||
| self.proposer.run( | ||
| full_hidden_states=model_output, |
There was a problem hiding this comment.
❓ 疑问 残留 debug 注释含拼写错误,建议合入前清理。
# tep_use_cudagraph=False, 中 tep_use_cudagraph 应为 step_use_cudagraph(缺少首字母 s)。同样,capture_model 中的 # accept_all_drafts=True, 也是残留的调试注释,若不需要请一并移除。
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览✅ 所有 Required 任务均通过,建议合并。
2 任务状态汇总2.1 Required 任务:通过
2.2 可选任务 — 0/1 通过
3 失败详情(仅 required)无 required 失败任务。 |
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.