Skip to content

[Cherry-Pick][Feature] support decode attention for mix(#7688)#7729

Open
lizhenyun01 wants to merge 14 commits into
PaddlePaddle:release/2.6from
lizhenyun01:dec_attn_2.6
Open

[Cherry-Pick][Feature] support decode attention for mix(#7688)#7729
lizhenyun01 wants to merge 14 commits into
PaddlePaddle:release/2.6from
lizhenyun01:dec_attn_2.6

Conversation

@lizhenyun01
Copy link
Copy Markdown
Collaborator

Motivation

C16/静态C8 attention支持,使用方式:flash_attn开启情况下export USE_DECODE_ATTENTION=1

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 7, 2026

Thanks for your contribution!

@lizhenyun01 lizhenyun01 changed the title [Feature] support decode attention for mix(#7688) [Cherry-Pick][Feature] support decode attention for mix(#7688) May 7, 2026
@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 7, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-11 21:26:04

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

⚠️ 存在 1 个 Required 任务失败,另有 4 个 Required 任务尚在运行/等待中,PR 暂不可合并。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
28(0) 28 19 4 2 3 0

2 任务状态汇总

2.1 Required任务 : 3/8 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 11s PR问题:PR修改了受保护模块,缺少4项必要审批 请指定RD对PR进行Review并批准 Job -
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 运行中 - Job -
Extracted partial CE model tasks to run in CI. / run_ce_cases - 运行中 - Job -
⏸️ Run Four Cards Tests / run_4_cards_tests - 等待中 - - -
⏸️ Run Stable Tests / stable_tests - 等待中 - - -
其余 3 个必选任务通过(Pre Commit, Run Base Tests / base_tests, Run FastDeploy LogProb Tests / run_tests_logprob) - - - - -

2.2 可选任务 — 16/20 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 14m13s Job -
Check PR Template 15s Job -
Trigger Jenkins for PR 20m42s Job -
⏸️ CI_HPU - - -
其余 16 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — 代码规范(审批流程)(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: 代码规范(审批流程)
  • 置信度: 高
  • 根因摘要: PR修改了受保护模块,缺少4项必要审批
  • 分析器: 通用分析(fallback)

根因详情:
PR #7729 涉及新增 custom op 及修改受保护路径,check_approval.sh 脚本检测到存在 4 项未满足的审批要求。缺少 FastDeploy RD 对 custom op 的审批、缺少 PaddlePaddle RD 对 custom op 的审批、缺少对 fastdeploy/spec_decode / custom_ops/gpu_ops/speculate_decoding 修改的审批、以及缺少对 fastdeploy/envs.py 修改的审批。

关键日志:

0. You must have one FastDeploy RD (qingqing01, Jiang-Jia-Jun, heavengate) approval for adding custom op.
1. You must have one PaddlePaddle RD (jeff41404, yongqiangma) approval for adding custom op.
2. You must have one FastDeploy RD (freeliuzc, Deleter-D) approval for modifying [fastdeploy/spec_decode,custom_ops/gpu_ops/speculate_decoding].
3. You must have one FastDeploy RD (Jiang-Jia-Jun, yuanlehome, rainyfly, Wanglongzhi2001) approval for modifying [fastdeploy/envs.py].
There are 4 approved errors.
##[error]Process completed with exit code 6.

修复建议:

  1. 请 FastDeploy RD 之一(@qingqing01 / @Jiang-Jia-Jun / @heavengate)审批本 PR 中新增的 custom op 代码
  2. 请 PaddlePaddle RD 之一(@jeff41404 / @yongqiangma)审批本 PR 中新增的 custom op 代码
  3. 请 FastDeploy RD 之一(@freeliuzc / @Deleter-D)审批 fastdeploy/spec_decode / custom_ops/gpu_ops/speculate_decoding 的改动
  4. 请 FastDeploy RD 之一(@Jiang-Jia-Jun / @yuanlehome / @rainyfly / @Wanglongzhi2001)审批 fastdeploy/envs.py 的改动

修复建议摘要: 请指定RD对PR进行Review并批准(FastDeploy RD + PaddlePaddle RD)

链接: 查看日志

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 11, 2026

Codecov Report

❌ Patch coverage is 63.52941% with 31 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.6@66dea60). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...l_executor/layers/attention/append_attn_backend.py 0.00% 10 Missing and 1 partial ⚠️
...el_executor/layers/attention/flash_attn_backend.py 46.66% 6 Missing and 2 partials ⚠️
fastdeploy/spec_decode/mtp.py 42.85% 6 Missing and 2 partials ⚠️
...cutor/layers/attention/ops/config_for_attention.py 85.71% 0 Missing and 1 partial ⚠️
...or/layers/attention/ops/decode_append_attention.py 88.88% 0 Missing and 1 partial ⚠️
...ers/attention/ops/decoder_write_cache_with_rope.py 88.88% 0 Missing and 1 partial ⚠️
fastdeploy/worker/gpu_model_runner.py 80.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.6    #7729   +/-   ##
==============================================
  Coverage               ?   72.45%           
==============================================
  Files                  ?      381           
  Lines                  ?    54016           
  Branches               ?     8445           
==============================================
  Hits                   ?    39139           
  Misses                 ?    12084           
  Partials               ?     2793           
Flag Coverage Δ
GPU 72.45% <63.52%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-11 21:30:39

📋 Review 摘要

PR 概述:为 decode 阶段新增 C16(fp16/bf16 KV cache)和静态 C8(int8 KV cache)decode attention CUDA kernel,通过环境变量 USE_DECODE_ATTENTION=1 启用,仅支持 SM ≥ 90(Hopper+)+ NVCC ≥ 12.0。
变更范围custom_ops/gpu_ops/append_attention/layers/attention/worker/gpu_model_runner.pyspec_decode/mtp.py
影响面 Tag[OP] [Feature]


📝 PR 规范检查

标题格式符合 [Cherry-Pick][Feature] 描述(#原PR号) 规范 ✅。但 ## Modifications## Usage or Command## Accuracy Tests 三个 section 均为空(仅保留注释占位符),Checklist 条目未勾选,建议补全。

标题建议(可直接复制):

  • [Cherry-Pick][Feature] support decode attention for C16/C8 mix(#7688)

PR 描述建议(可直接复制):

## Motivation
C16/静态C8 attention支持,使用方式:flash_attn开启情况下export USE_DECODE_ATTENTION=1

## Modifications
- 新增 C16(fp16/bf16 KV cache)和静态 C8(int8 KV cache)decode attention CUDA 实现(`custom_ops/gpu_ops/append_attention/` 目录,含 attention_func.cuh、decode_append_attention_c16_impl.cuh、decode_append_attention_c8_impl.cuh 等核心 kernel 文件)
- 新增 `decode_append_attention.cu``decoder_write_cache_with_rope.cu` 作为算子入口,仅在 SM ≥ 90 且 NVCC ≥ 12.0 时编译
- 新增 Python wrapper 层:`layers/attention/ops/decode_append_attention.py``decoder_write_cache_with_rope.py``config_for_attention.py`
-`flash_attn_backend.py` decode 阶段,当 `USE_DECODE_ATTENTION=1` 时切换到新 decode attention 路径(替代原 `append_attention`- `fastdeploy/envs.py` 新增 `USE_DECODE_ATTENTION` 环境变量开关
- 同步更新 `spec_decode/mtp.py``worker/gpu_model_runner.py``worker/metax_model_runner.py` 中的 decode attention buffer 分配与初始化
- 新增测试:`tests/operators/attention/test_decode_append_attention.py``test_decode_append_attention_c16.py`

## Usage or Command
```bash
export USE_DECODE_ATTENTION=1
# 在 flash_attn 开启时生效,仅限 SM ≥ 90(Hopper+),NVCC ≥ 12.0
```

## Accuracy Tests
N/A(该 PR 通过新 decode attention kernel 替代原 append_attention decode 路径,需要补充精度对比数据或说明无需精度验证的理由)

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别 文件 概述
📝 PR 规范 Modifications / Usage / Accuracy Tests 为空,Checklist 未勾选
🟡 建议 fastdeploy/model_executor/layers/attention/flash_attn_backend.py:281 遗留调试 print 语句,应移除或改用 logger
🟡 建议 custom_ops/setup_ops.py:547 os.system() 返回值被忽略,代码生成失败时构建不中断
❓ 疑问 fastdeploy/envs.py:281 ENABLE_V1_KVCACHE_MANAGER 在本次变更文件中未见引用,请确认是否遗漏或提前注册

总体评价

整体实现思路完整,新增 CUDA kernel + Python wrapper + 测试的闭环结构良好,SM 门控和环境变量开关设计合理。主要问题是遗留的调试 print、构建脚本错误不中断、以及 ENABLE_V1_KVCACHE_MANAGER 的用途需要确认;PR 描述需补充变更说明和精度验证数据。

self.max_tokens_per_batch: int = self.speculate_max_draft_token_num + 1
if FLASH_ATTN_VERSION is None:
init_flash_attn_version()
print(f"num_heads: {self.num_heads}, kv_num_heads: {self.kv_num_heads}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 遗留调试 print 语句,会在每次 FlashAttnBackend 初始化时打印到 stdout,建议移除或改为 logger.debug()

# 移除此行,或改为:
logger.debug("num_heads: %d, kv_num_heads: %d", self.num_heads, self.kv_num_heads)

Comment thread custom_ops/setup_ops.py

if cc >= 90 and nvcc_version >= 12.0:
# decode attention
os.system(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 os.system() 返回值被丢弃。若 auto_gen_template_attention.py 执行失败(依赖缺失、模板错误等),后续 find_end_files("gpu_ops/append_attention", ".cu") 可能找不到自动生成的 .cu 文件,导致编译静默失败。建议检查退出码:

ret = os.system(
    "python utils/auto_gen_template_attention.py --config ..."
)
if ret != 0:
    raise RuntimeError("auto_gen_template_attention.py failed with code %d" % ret)

Comment thread fastdeploy/envs.py
# Whether to enable FP8 quantization with pow2scale.
"FD_FP8_QUANT_WITH_POW2SCALE": lambda: bool(int(os.getenv("FD_FP8_QUANT_WITH_POW2SCALE", "0"))),
# enable kv cache manager v1
"ENABLE_V1_KVCACHE_MANAGER": lambda: int(os.getenv("ENABLE_V1_KVCACHE_MANAGER", "0")),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 ENABLE_V1_KVCACHE_MANAGER 在本次 PR 改动的文件中未见任何引用。请确认:

  1. 该变量是否确实属于本次变更范围,还是误混入?
  2. 若有使用点,是否遗漏了对应文件的变更?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants