[Cherry-Pick][Feature] support decode attention for mix(#7688) by lizhenyun01 · Pull Request #7729 · PaddlePaddle/FastDeploy

lizhenyun01 · 2026-05-07T05:14:55Z

Motivation

C16/静态C8 attention支持，使用方式：flash_attn开启情况下export USE_DECODE_ATTENTION=1

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-07T05:15:01Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-07T06:58:03Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-11 21:26:04

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 2cc2ecb
Merge base: 66dea60 (branch: release/2.6)
查看完整 Diff
CI 详情

1 任务总览

⚠️ 存在 1 个 Required 任务失败，另有 4 个 Required 任务尚在运行/等待中，PR 暂不可合并。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
28(0)	28	19	4	2	3	0

2 任务状态汇总

2.1 Required任务 : 3/8 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Approval`	11s	PR问题：PR修改了受保护模块，缺少4项必要审批	请指定RD对PR进行Review并批准	Job	-
⏳	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	运行中	-	Job	-
⏳	`Extracted partial CE model tasks to run in CI. / run_ce_cases`	-	运行中	-	Job	-
⏸️	`Run Four Cards Tests / run_4_cards_tests`	-	等待中	-	-	-
⏸️	`Run Stable Tests / stable_tests`	-	等待中	-	-	-
✅	其余 3 个必选任务通过（Pre Commit, Run Base Tests / base_tests, Run FastDeploy LogProb Tests / run_tests_logprob）	-	-	-	-	-

2.2 可选任务 — 16/20 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	14m13s	Job	-
❌	`Check PR Template`	15s	Job	-
❌	`Trigger Jenkins for PR`	20m42s	Job	-
⏸️	`CI_HPU`	-	-	-
✅	其余 16 个可选任务通过	-	-	-

3 失败详情（仅 required）

Approval — 代码规范（审批流程）（置信度: 高）

Approval

状态: ❌ 失败
错误类型: 代码规范（审批流程）
置信度: 高
根因摘要: PR修改了受保护模块，缺少4项必要审批
分析器: 通用分析(fallback)

根因详情:
PR #7729 涉及新增 custom op 及修改受保护路径，check_approval.sh 脚本检测到存在 4 项未满足的审批要求。缺少 FastDeploy RD 对 custom op 的审批、缺少 PaddlePaddle RD 对 custom op 的审批、缺少对 fastdeploy/spec_decode / custom_ops/gpu_ops/speculate_decoding 修改的审批、以及缺少对 fastdeploy/envs.py 修改的审批。

关键日志:

0. You must have one FastDeploy RD (qingqing01, Jiang-Jia-Jun, heavengate) approval for adding custom op.
1. You must have one PaddlePaddle RD (jeff41404, yongqiangma) approval for adding custom op.
2. You must have one FastDeploy RD (freeliuzc, Deleter-D) approval for modifying [fastdeploy/spec_decode,custom_ops/gpu_ops/speculate_decoding].
3. You must have one FastDeploy RD (Jiang-Jia-Jun, yuanlehome, rainyfly, Wanglongzhi2001) approval for modifying [fastdeploy/envs.py].
There are 4 approved errors.
##[error]Process completed with exit code 6.

修复建议:

请 FastDeploy RD 之一（@qingqing01 / @Jiang-Jia-Jun / @heavengate）审批本 PR 中新增的 custom op 代码
请 PaddlePaddle RD 之一（@jeff41404 / @yongqiangma）审批本 PR 中新增的 custom op 代码
请 FastDeploy RD 之一（@freeliuzc / @Deleter-D）审批 fastdeploy/spec_decode / custom_ops/gpu_ops/speculate_decoding 的改动
请 FastDeploy RD 之一（@Jiang-Jia-Jun / @yuanlehome / @rainyfly / @Wanglongzhi2001）审批 fastdeploy/envs.py 的改动

修复建议摘要: 请指定RD对PR进行Review并批准（FastDeploy RD + PaddlePaddle RD）

链接: 查看日志

codecov-commenter · 2026-05-11T05:22:20Z

Codecov Report

❌ Patch coverage is 63.52941% with 31 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.6@66dea60). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...l_executor/layers/attention/append_attn_backend.py	0.00%	10 Missing and 1 partial ⚠️
...el_executor/layers/attention/flash_attn_backend.py	46.66%	6 Missing and 2 partials ⚠️
fastdeploy/spec_decode/mtp.py	42.85%	6 Missing and 2 partials ⚠️
...cutor/layers/attention/ops/config_for_attention.py	85.71%	0 Missing and 1 partial ⚠️
...or/layers/attention/ops/decode_append_attention.py	88.88%	0 Missing and 1 partial ⚠️
...ers/attention/ops/decoder_write_cache_with_rope.py	88.88%	0 Missing and 1 partial ⚠️
fastdeploy/worker/gpu_model_runner.py	80.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@              Coverage Diff               @@
##             release/2.6    #7729   +/-   ##
==============================================
  Coverage               ?   72.45%           
==============================================
  Files                  ?      381           
  Lines                  ?    54016           
  Branches               ?     8445           
==============================================
  Hits                   ?    39139           
  Misses                 ?    12084           
  Partials               ?     2793

Flag	Coverage Δ
GPU	`72.45% <63.52%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-11 21:30:39

📋 Review 摘要

PR 概述：为 decode 阶段新增 C16（fp16/bf16 KV cache）和静态 C8（int8 KV cache）decode attention CUDA kernel，通过环境变量 USE_DECODE_ATTENTION=1 启用，仅支持 SM ≥ 90（Hopper+）+ NVCC ≥ 12.0。
变更范围：custom_ops/gpu_ops/append_attention/、layers/attention/、worker/gpu_model_runner.py、spec_decode/mtp.py
影响面 Tag：[OP] [Feature]

📝 PR 规范检查

标题格式符合 [Cherry-Pick][Feature] 描述(#原PR号) 规范 ✅。但 ## Modifications、## Usage or Command、## Accuracy Tests 三个 section 均为空（仅保留注释占位符），Checklist 条目未勾选，建议补全。

标题建议（可直接复制）：

[Cherry-Pick][Feature] support decode attention for C16/C8 mix(#7688)

PR 描述建议（可直接复制）：

## Motivation
C16/静态C8 attention支持，使用方式：flash_attn开启情况下export USE_DECODE_ATTENTION=1

## Modifications
- 新增 C16（fp16/bf16 KV cache）和静态 C8（int8 KV cache）decode attention CUDA 实现（`custom_ops/gpu_ops/append_attention/` 目录，含 attention_func.cuh、decode_append_attention_c16_impl.cuh、decode_append_attention_c8_impl.cuh 等核心 kernel 文件）
- 新增 `decode_append_attention.cu`、`decoder_write_cache_with_rope.cu` 作为算子入口，仅在 SM ≥ 90 且 NVCC ≥ 12.0 时编译
- 新增 Python wrapper 层：`layers/attention/ops/decode_append_attention.py`、`decoder_write_cache_with_rope.py`、`config_for_attention.py`
- 在 `flash_attn_backend.py` decode 阶段，当 `USE_DECODE_ATTENTION=1` 时切换到新 decode attention 路径（替代原 `append_attention`）
- `fastdeploy/envs.py` 新增 `USE_DECODE_ATTENTION` 环境变量开关
- 同步更新 `spec_decode/mtp.py`、`worker/gpu_model_runner.py`、`worker/metax_model_runner.py` 中的 decode attention buffer 分配与初始化
- 新增测试：`tests/operators/attention/test_decode_append_attention.py`、`test_decode_append_attention_c16.py`

## Usage or Command
```bash
export USE_DECODE_ATTENTION=1
# 在 flash_attn 开启时生效，仅限 SM ≥ 90（Hopper+），NVCC ≥ 12.0
```

## Accuracy Tests
N/A（该 PR 通过新 decode attention kernel 替代原 append_attention decode 路径，需要补充精度对比数据或说明无需精度验证的理由）

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别	文件	概述
📝 PR 规范	—	Modifications / Usage / Accuracy Tests 为空，Checklist 未勾选
🟡 建议	`fastdeploy/model_executor/layers/attention/flash_attn_backend.py:281`	遗留调试 `print` 语句，应移除或改用 logger
🟡 建议	`custom_ops/setup_ops.py:547`	`os.system()` 返回值被忽略，代码生成失败时构建不中断
❓ 疑问	`fastdeploy/envs.py:281`	`ENABLE_V1_KVCACHE_MANAGER` 在本次变更文件中未见引用，请确认是否遗漏或提前注册

总体评价

整体实现思路完整，新增 CUDA kernel + Python wrapper + 测试的闭环结构良好，SM 门控和环境变量开关设计合理。主要问题是遗留的调试 print、构建脚本错误不中断、以及 ENABLE_V1_KVCACHE_MANAGER 的用途需要确认；PR 描述需补充变更说明和精度验证数据。

PaddlePaddle-bot · 2026-05-11T13:35:50Z

+        self.max_tokens_per_batch: int = self.speculate_max_draft_token_num + 1
        if FLASH_ATTN_VERSION is None:
            init_flash_attn_version()
+        print(f"num_heads: {self.num_heads}, kv_num_heads: {self.kv_num_heads}")


🟡 建议 遗留调试 print 语句，会在每次 FlashAttnBackend 初始化时打印到 stdout，建议移除或改为 logger.debug()。

# 移除此行，或改为： logger.debug("num_heads: %d, kv_num_heads: %d", self.num_heads, self.kv_num_heads)

PaddlePaddle-bot · 2026-05-11T13:35:50Z


    if cc >= 90 and nvcc_version >= 12.0:
+        # decode attention
+        os.system(


🟡 建议 os.system() 返回值被丢弃。若 auto_gen_template_attention.py 执行失败（依赖缺失、模板错误等），后续 find_end_files("gpu_ops/append_attention", ".cu") 可能找不到自动生成的 .cu 文件，导致编译静默失败。建议检查退出码：

ret = os.system( "python utils/auto_gen_template_attention.py --config ..." ) if ret != 0: raise RuntimeError("auto_gen_template_attention.py failed with code %d" % ret)

PaddlePaddle-bot · 2026-05-11T13:35:50Z

    # Whether to enable FP8 quantization with pow2scale.
    "FD_FP8_QUANT_WITH_POW2SCALE": lambda: bool(int(os.getenv("FD_FP8_QUANT_WITH_POW2SCALE", "0"))),
+    # enable kv cache manager v1
+    "ENABLE_V1_KVCACHE_MANAGER": lambda: int(os.getenv("ENABLE_V1_KVCACHE_MANAGER", "0")),


❓ 疑问 ENABLE_V1_KVCACHE_MANAGER 在本次 PR 改动的文件中未见任何引用。请确认：

该变量是否确实属于本次变更范围，还是误混入？

若有使用点，是否遗漏了对应文件的变更？

lizhenyun01 added 11 commits May 7, 2026 13:03

support c8 decode attention

2265dc9

support c16 attention && backend

4c922bc

opt kernel

de6450d

fix

111230a

opt larger batch

03263a0

inplace out

cb64cb3

fix input_batch && remove fast_math

b1acb37

fix xpu

a5e394f

fix bug

6a5b3c6

fix ci

307e5a8

opt and fix mtp

3f29b01

lizhenyun01 had a problem deploying to Metax_ci May 7, 2026 05:14 — with GitHub Actions Failure

lizhenyun01 changed the title ~~[Feature] support decode attention for mix(#7688)~~ [Cherry-Pick][Feature] support decode attention for mix(#7688) May 7, 2026

This comment was marked as outdated.

Sign in to view

fix merge

35a876a

lizhenyun01 had a problem deploying to Metax_ci May 11, 2026 03:48 — with GitHub Actions Failure

clean code

9ab4e13

lizhenyun01 had a problem deploying to Metax_ci May 11, 2026 03:58 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

fix merge

2cc2ecb

lizhenyun01 had a problem deploying to Metax_ci May 11, 2026 12:53 — with GitHub Actions Failure

PaddlePaddle-bot reviewed May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cherry-Pick][Feature] support decode attention for mix(#7688)#7729

[Cherry-Pick][Feature] support decode attention for mix(#7688)#7729
lizhenyun01 wants to merge 14 commits into
PaddlePaddle:release/2.6from
lizhenyun01:dec_attn_2.6

lizhenyun01 commented May 7, 2026

Uh oh!

paddle-bot Bot commented May 7, 2026

Uh oh!

PaddlePaddle-bot commented May 7, 2026 •

edited

Loading

Approval

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented May 11, 2026 •

edited

Loading

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 11, 2026

Uh oh!

PaddlePaddle-bot May 11, 2026

Uh oh!

PaddlePaddle-bot May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lizhenyun01 commented May 7, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 7, 2026

Uh oh!

PaddlePaddle-bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 3/8 通过

2.2 可选任务 — 16/20 通过

3 失败详情（仅 required）

Approval

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PaddlePaddle-bot commented May 7, 2026 •

edited

Loading

codecov-commenter commented May 11, 2026 •

edited

Loading