Mtp cudagraph by iosmers · Pull Request #7761 · PaddlePaddle/FastDeploy

iosmers · 2026-05-09T08:08:24Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-09T08:08:31Z

Thanks for your contribution!

Copilot

Pull request overview

该 PR 主要围绕 XPU 路径下 MTP（speculative decoding）与 CUDA Graph 的协同做适配：在 XPU 的 model runner 与 MTP proposer 之间透传/设置 step_use_cudagraph，并调整 capture warmup 与执行阶段的 cudagraph 使用条件，以期在合适的 batch 形态下启用图捕获/回放。

Changes:

简化 XPU runner 中 forward_meta.step_use_cudagraph 的设置逻辑，并在执行阶段增加基于 max_capture_size 的限制。
XPU runner 的 capture_model() 增加对 MTP/SUFFIX speculative 场景的 warmup 逻辑，并在调用 proposer 时透传 step_use_cudagraph/is_dummy_run。
MTP proposer（XPU 路径）补齐 step_use_cudagraph 的 forward_meta 初始化、padding 与输出裁剪逻辑。

需要关注的流程性问题（非代码行内评论）：

PR 标题未按仓库要求带上 tag（示例：[Graph Optimization] ...），且描述模板的 Motivation/Modifications/Tests 基本未填写，建议补全以便评审与回归。

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File	Description
fastdeploy/worker/xpu_model_runner.py	调整 XPU runner 的 cudagraph 开关判定、capture warmup，并将 cudagraph 相关状态透传给 MTP proposer
fastdeploy/spec_decode/mtp.py	XPU 路径下为 MTP proposer 增加 cudagraph 元数据初始化、padding 与输出裁剪，并调整 xpu_pre_process 调用参数

        if self.use_cudagraph:
            # Update Batch type for cuda graph for only_decode_batch
            if_only_decode = self.only_decode()
-
-            only_decode_use_cudagraph = self.use_cudagraph and if_only_decode
-            # Update config about moe for better performance
-            # TODO(wanglongzhi):Modifying the config at runtime is not appropriate; it needs to be moved to forward_meta. It will be used in MoEMethodBase.apply()
-            if self.fd_config.parallel_config.use_ep and self.fd_config.scheduler_config.splitwise_role == "mixed":
-                self.fd_config.model_config.moe_phase.phase = "decode" if if_only_decode else "prefill"
-                if self.speculative_decoding:
-                    self.proposer.fd_config.parallel_config.moe_phase.phase = "decode" if if_only_decode else "prefill"
-
-            # Update Batch type for cuda graph for only_prefill_batch
-            only_prefill_use_cudagraph = self.use_cudagraph and self.cudagraph_only_prefill and self.only_prefill()
-
-            self.forward_meta.step_use_cudagraph = (
-                only_prefill_use_cudagraph
-                if self.cudagraph_only_prefill
-                else only_decode_use_cudagraph and self.forward_meta.ids_remove_padding.shape[0] > 0
-            )
+            self.forward_meta.step_use_cudagraph = self.use_cudagraph and if_only_decode and self.forward_meta.ids_remove_padding.shape[0] > 0


+        ): 
            if_only_decode = self.only_decode()
            self.fd_config.model_config.moe_phase.phase = "decode" if if_only_decode else "prefill"


+            self.fd_config.parallel_config.use_ep and self.fd_config.scheduler_config.splitwise_role == "mixed"
+        ): 


        try:
-            for batch_size in sorted(capture_sizes, reverse=True):
-                self._dummy_run(
-                    num_tokens=self.scheduler_config.max_num_batched_tokens,
-                    batch_size=batch_size,
-                    expected_decode_len=expected_decode_len,
-                    in_capturing=True,
-                )
-                logger.info(f"Warm up the model with the batch size:{batch_size}, num tokens:{expected_decode_len}")
+            if self.speculative_decoding and self.spec_method in [SpecMethod.MTP, SpecMethod.SUFFIX]:
+                for capture_size in sorted(capture_sizes, reverse=True):
+                    expected_decode_len = (self.speculative_config.num_speculative_tokens + 1) * 2
+                    self._dummy_run(
+                        num_tokens=self.fd_config.get_max_chunk_tokens(),
+                        batch_size=int(capture_size / (self.speculative_config.num_speculative_tokens + 1)),
+                        in_capturing=True,
+                        expected_decode_len=expected_decode_len,
+                        # accept_all_drafts=True,
+                    )
+                    logger.info(
+                        f"Warm up the model with the num_tokens:{capture_size}, expected_decode_len:{expected_decode_len}"
+                    )
+            else:
+                for batch_size in sorted(capture_sizes, reverse=True):


+                    self._dummy_run(
+                        num_tokens=self.fd_config.get_max_chunk_tokens(),
+                        batch_size=int(capture_size / (self.speculative_config.num_speculative_tokens + 1)),
+                        in_capturing=True,
+                        expected_decode_len=expected_decode_len,
+                        # accept_all_drafts=True,
+                    )
+                    logger.info(
+                        f"Warm up the model with the num_tokens:{capture_size}, expected_decode_len:{expected_decode_len}"


+                    self.proposer.run(
+                        full_hidden_states=model_output,
+                        step_use_cudagraph=self.forward_meta.step_use_cudagraph,
+                        # tep_use_cudagraph=False,


@@ -719,6 +719,14 @@ def _initialize_forward_meta_xpu(self):
        # Initialize attention meta data
        for attn_backend in self.attn_backends:
            attn_backend.init_attention_metadata(self.forward_meta)
+
+        # Notes(liuzichang):


+                    # num_speculative_tokens=self.speculative_config.num_speculative_tokens,
+                    num_speculative_tokens=0,


PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-09 16:18:16

📋 Review 摘要

PR 概述：为 XPU 平台上的 MTP 投机解码添加 CUDAGraph 支持
变更范围：fastdeploy/spec_decode/mtp.py、fastdeploy/worker/xpu_model_runner.py
影响面 Tag：[Speculative Decoding] [XPU]

📝 PR 规范检查

标题缺少官方 Tag，PR 描述 Motivation/Modifications/Usage or Command/Accuracy Tests 各节均为空，Checklist 全未勾选，需按模板补全。

标题建议（可直接复制）：

[Speculative Decoding][XPU] Support CUDAGraph for MTP on XPU

PR 描述建议（可直接复制，必须复刻 checklist §D2 模板的完整结构）：

## Motivation
为 XPU 平台上的 MTP（Multi-Token Prediction）投机解码添加 CUDAGraph 支持，通过减少 Python 运行时开销来提升 XPU 上的推理性能。

## Modifications
- `fastdeploy/spec_decode/mtp.py`：为 `_initialize_forward_meta_xpu` 添加 `step_use_cudagraph`、`is_dummy_run`、`substep` 参数，对齐 CUDA 路径逻辑；在 `_propose_xpu` 中新增 `padding_cudagraph_inputs()` 调用及 cudagraph 输出截断（`model_output[:real_token_num]`）逻辑
- `fastdeploy/worker/xpu_model_runner.py`：简化 `_prepare_inputs` 中的 `step_use_cudagraph` 赋值；在 `capture_model` 中新增 MTP/SUFFIX 专属 warmup 路径（按 `capture_size / (num_speculative_tokens+1)` 计算 batch_size）；修正 `execute_model` 中 cudagraph padding 顺序（移至 `_prepare_inputs` 之后）；向 `proposer.run()` 传递 `step_use_cudagraph` 和 `is_dummy_run` 参数

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别	文件	概述
🟡 建议	`fastdeploy/spec_decode/mtp.py:1099`	`num_speculative_tokens` 被硬编码为 `0`，原始动态值被注释掉，可能导致投机解码 token 计数逻辑异常
❓ 疑问	`fastdeploy/worker/xpu_model_runner.py:1144`	`cudagraph_only_prefill` 路径被静默移除，若现有部署中 `cudagraph_only_prefill=True` 将失效
❓ 疑问	`fastdeploy/worker/xpu_model_runner.py:1154`	`FD_XPU_ENABLE_MIXED_EP_MODE` 环境变量守卫被移除，MoE EP mixed 模式下行为发生变化
❓ 疑问	`fastdeploy/worker/xpu_model_runner.py:1697`	残留 debug 注释含拼写错误（`tep_use_cudagraph` 应为 `step_use_cudagraph`），应在合入前清理

总体评价

PR 为 XPU MTP 投机解码增加了 CUDAGraph 支持，核心思路与 CUDA 路径对齐。但存在 num_speculative_tokens=0 硬编码、cudagraph_only_prefill 路径静默移除以及残留调试代码等问题，建议作者确认后清理再合入。

PaddlePaddle-bot · 2026-05-09T08:21:02Z

                    self.model_inputs["seq_lens_decoder"],
-                    num_speculative_tokens=self.speculative_config.num_speculative_tokens,
+                    # num_speculative_tokens=self.speculative_config.num_speculative_tokens,
+                    num_speculative_tokens=0,


🟡 建议 num_speculative_tokens 被硬编码为 0，原始动态值被注释掉。

update_model_inputs 接收到 num_speculative_tokens=0 后，依赖该字段的注意力 mask 构造或 token 计数逻辑可能产生错误结果，相当于在推理阶段也禁用了投机 token 数量感知。若这是 XPU 投机解码的特殊需求（如 attention mask 生成方式不同），请补充注释说明原因；若为调试遗留，请恢复为 self.speculative_config.num_speculative_tokens。

PaddlePaddle-bot · 2026-05-09T08:21:02Z

-                if self.cudagraph_only_prefill
-                else only_decode_use_cudagraph and self.forward_meta.ids_remove_padding.shape[0] > 0
-            )
+            self.forward_meta.step_use_cudagraph = self.use_cudagraph and if_only_decode and self.forward_meta.ids_remove_padding.shape[0] > 0


❓ 疑问 cudagraph_only_prefill 路径被静默移除。

原代码支持：

self.forward_meta.step_use_cudagraph = ( only_prefill_use_cudagraph if self.cudagraph_only_prefill else only_decode_use_cudagraph and ... )

新代码只保留了 decode-only 路径，cudagraph_only_prefill=True 的配置将无法触发 cudagraph。请确认 XPU 平台是否已不再使用该配置，若是请同步更新文档或配置注释。

PaddlePaddle-bot · 2026-05-09T08:21:02Z

@@ -1169,8 +1153,8 @@ def _prepare_inputs(self, is_dummy_run=False) -> None:
            self.forward_meta.kv_signal_sender = self.share_inputs["kv_signal_sender"]



❓ 疑问 envs.FD_XPU_ENABLE_MIXED_EP_MODE 环境变量守卫被移除。

原条件：splitwise_role == "mixed" and envs.FD_XPU_ENABLE_MIXED_EP_MODE
新条件：use_ep and splitwise_role == "mixed"

这意味着在 use_ep=True + splitwise_role=mixed 的场景下，即使 FD_XPU_ENABLE_MIXED_EP_MODE=False，MoE phase 切换逻辑也会生效，可能改变已有部署的行为。此变更是否为有意为之？建议在 PR 描述中说明。

PaddlePaddle-bot · 2026-05-09T08:21:02Z

                if self.spec_method == SpecMethod.MTP:
-                    self.proposer.run(full_hidden_states=model_output)
+                    self.proposer.run(
+                        full_hidden_states=model_output,


❓ 疑问 残留 debug 注释含拼写错误，建议合入前清理。

# tep_use_cudagraph=False, 中 tep_use_cudagraph 应为 step_use_cudagraph（缺少首字母 s）。同样，capture_model 中的 # accept_all_drafts=True, 也是残留的调试注释，若不需要请一并移除。

PaddlePaddle-bot · 2026-05-09T08:25:29Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-09 16:24:22

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 54374ae
Merge base: 不可用（GitHub API 超时）
CI 详情

1 任务总览

✅ 所有 Required 任务均通过，建议合并。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
5(0)	5	4	1	0	0	0

2 任务状态汇总

2.1 Required 任务：通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
✅	其余 4 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 0/1 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Trigger Jenkins for PR`	55s	Job	-

3 失败详情（仅 required）

无 required 失败任务。

iosmers added 3 commits April 24, 2026 02:21

add mtp cudagraph support

4df1bed

update mtp

d08197c

update

54374ae

Copilot AI review requested due to automatic review settings May 9, 2026 08:08

iosmers had a problem deploying to Metax_ci May 9, 2026 08:08 — with GitHub Actions Failure

Copilot started reviewing on behalf of iosmers May 9, 2026 08:09 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

PaddlePaddle-bot reviewed May 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mtp cudagraph#7761

Mtp cudagraph#7761
iosmers wants to merge 3 commits into
PaddlePaddle:developfrom
iosmers:mtp_cudagraph

iosmers commented May 9, 2026

Uh oh!

paddle-bot Bot commented May 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 9, 2026

Uh oh!

PaddlePaddle-bot May 9, 2026

Uh oh!

PaddlePaddle-bot May 9, 2026

Uh oh!

PaddlePaddle-bot May 9, 2026

Uh oh!

PaddlePaddle-bot commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		self.fd_config.parallel_config.use_ep and self.fd_config.scheduler_config.splitwise_role == "mixed"
		):

		# num_speculative_tokens=self.speculative_config.num_speculative_tokens,
		num_speculative_tokens=0,

		@@ -1169,8 +1153,8 @@ def _prepare_inputs(self, is_dummy_run=False) -> None:
		self.forward_meta.kv_signal_sender = self.share_inputs["kv_signal_sender"]

Conversation

iosmers commented May 9, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot commented May 9, 2026

1 任务总览

2 任务状态汇总

2.1 Required 任务：通过

2.2 可选任务 — 0/1 通过

3 失败详情（仅 required）

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants