[Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc#7777
[Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc#7777ShaneGZhu wants to merge 3 commits into
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览
2 任务状态汇总2.1 Required任务 : 1/2 通过
2.2 可选任务 — 11/17 通过
3 失败详情(仅 required)Approval — 审批流程(置信度: 高)Approval
根因详情: 关键日志: 修复建议:
修复建议摘要: 请FastDeploy RD及PaddlePaddle RD各Approve一次 关联变更: PR 新增了 Custom Op 相关代码,触发审批检查 |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-11 20:11:35
📋 Review 摘要
PR 概述:新增 grouped_topk CUDA 融合算子,将 cast + sigmoid + bias + noaux_tc 合并为单次 kernel launch,提升 MoE 路由推理性能(实测 TPS +3.95%)。
变更范围:custom_ops/gpu_ops/(新增 .cu)、fastdeploy/model_executor/layers/moe/moe.py、tests/operators/
影响面 Tag:[OP] [Optimization]
📝 PR 规范检查
存在两处问题:① 标题使用了非官方 Tag [Ops](官方为 [OP]);② Usage or Command 节内容为空(仅保留了模板注释);③ Checklist 全未勾选,但单测和精度结果均已提供。
标题建议(可直接复制):
[OP] Kernel fusion: cast+sigmoid+bias+noauxtc
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
新增融合算子 `grouped_topk`,将 cast + sigmoid + bias + noaux_tc 合并为单次 CUDA kernel launch,减少 memory bandwidth 开销,提升 MoE 路由阶段推理性能。当前仅支持 CUDA 设备。
## Modifications
- `custom_ops/gpu_ops/grouped_topk_kernels.cu`:新增 `grouped_topk_fused_kernel`,一次 kernel launch 完成 sigmoid/cast、bias 加法、group score 计算(top-2 sum)、group/expert topk 选取及稀疏 scores 写回
- `custom_ops/gpu_ops/cpp_extensions.cc`:添加 `grouped_topk` 函数声明及 pybind11 注册(`m.def("grouped_topk", ...)`)
- `custom_ops/setup_ops.py`:将 `grouped_topk_kernels.cu` 加入两处编译源文件列表(普通版和冗余版)
- `fastdeploy/model_executor/layers/moe/moe.py`:在 `get_moe_scores` 中新增 `use_fused=True` 分支,对非 EP(`expert_id_to_ep_rank_array is None`)场景调用 `grouped_topk` 替代原 `fused_cast_sigmoid_bias + noaux_tc` 双步路径
- `tests/operators/test_grouped_topk_op.py`:新增算子单测,覆盖 DeepSeek-V3、GLM-4.5-Air、Qwen3-30B-A3B、Kimi-K2 多模型配置及 bfloat16 输入
## Usage or Command
N/A
## Accuracy Tests
精度测试结果见 PR 描述 Accuracy Tests 表格(fused_cast+noaux vs fused_cast_grouped_topk 对比):所有配置 diff ≤ 1.19e-07,idx 全部一致(✓)。
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | tests/operators/test_grouped_topk_op.py |
test_deterministic 末行为 self.ass(截断),topk_indices 确定性断言缺失,运行时会抛 AttributeError |
| ❓ 疑问 | fastdeploy/model_executor/layers/moe/moe.py:45 |
warning 消息仍为 "import noaux_tc Failed!",但 grouped_topk 也在同一 try 块导入,排查时易误导 |
| ❓ 疑问 | fastdeploy/model_executor/layers/moe/moe.py:50 |
if current_platform.is_cuda(): pass 是空块,建议删除以保持整洁 |
总体评价
融合算子实现思路清晰,kernel 正确性经过多配置精度验证,测试覆盖度较好。test_deterministic 末行疑似截断(self.ass),需补全 topk_indices 的确定性断言,否则测试运行时会出错。moe.py 侧有两处小的清理遗漏,不影响功能。
| noaux_tc_redundant, | ||
| ) | ||
| except: | ||
| logger.warning("import noaux_tc Failed!") |
There was a problem hiding this comment.
❓ 疑问 try 块现已同时导入 grouped_topk,但 warning 消息仍为 "import noaux_tc Failed!",容易误导排查。
建议改为:
logger.warning("import grouped_topk / noaux_tc Failed!")| from fastdeploy.model_executor.layers.moe.fused_cast_sigmoid_bias import ( | ||
| fused_cast_sigmoid_bias, | ||
| ) | ||
| pass |
There was a problem hiding this comment.
❓ 疑问 if current_platform.is_cuda(): pass 是空块,fused_cast_sigmoid_bias 的条件导入已删除,这里只剩一个无意义的 pass。
建议直接删除整个 if 块以保持代码整洁。
| from fastdeploy.model_executor.layers.moe.fused_cast_sigmoid_bias import ( | ||
| fused_cast_sigmoid_bias, | ||
| ) | ||
| pass |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #7777 +/- ##
==========================================
Coverage ? 71.13%
==========================================
Files ? 396
Lines ? 55831
Branches ? 8724
==========================================
Hits ? 39716
Misses ? 13366
Partials ? 2749
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
Kernel fusion: cast + sigmoid + bias + noauxtc. Currently, this is supported only on CUDA devices.
Modifications
Usage or Command
Accuracy Tests
fused_cast+noaux (A) vs fused_cast_grouped_topk (C) 性能对比
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.