[Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc by ShaneGZhu · Pull Request #7777 · PaddlePaddle/FastDeploy

ShaneGZhu · 2026-05-11T11:06:18Z

Motivation

Kernel fusion: cast + sigmoid + bias + noauxtc. Currently, this is supported only on CUDA devices.

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

测试分支	并行方式	模型	主要对比	请求数量	引擎MAX BS	平均输入	平均输出	TPS	OTPS	QPS	TTFT(ms)	解码速度(tok/s)
develop	TP8	GLM4.5-Air	baseline	256	256	159.53	5433.42	3263.46	3170.38	0.583	1257.81	30.38
develop	TP8	GLM4.5-Air	fused_cast	256	256	159.53	5662.12	3330.47 (+2.06%)	3239.20	0.572	1282.50	30.70 (+1%)
develop	TP8	GLM4.5-Air	fused_cast_get_moe_score	256	256	159.53	5604.22	3392.78 (+3.95%)	3298.88	0.589 (+1%)	1458.27	30.54 (+0.5%)

Usage or Command

Accuracy Tests

fused_cast+noaux (A) vs fused_cast_grouped_topk (C) 性能对比

config	T (token数)	E (专家数)	path_a(µs)	path_c(µs)	c/a	diff	idx
deepseek_v3	1	256	22.27	9.04	2.46x	5.96e-08	✓
deepseek_v3	8	256	21.19	9.13	2.32x	5.96e-08	✓
deepseek_v3	32	256	21.20	9.23	2.30x	8.94e-08	✓
deepseek_v3	128	256	21.23	10.57	2.01x	8.94e-08	✓
deepseek_v3	256	256	21.23	10.69	1.99x	8.94e-08	✓
deepseek_v3	512	256	21.61	11.00	1.96x	8.94e-08	✓
deepseek_v3	1024	256	23.60	18.62	1.27x	1.19e-07	✓
deepseek_v3	2048	256	26.77	27.00	0.99x	1.19e-07	✓
deepseek_v3	4096	256	36.17	42.64	0.85x	1.19e-07	✓
deepseek_v3	8192	256	60.25	68.77	0.88x	8.94e-08	✓
glm45_air	1	128	21.40	8.80	2.43x	1.49e-08	✓
glm45_air	8	128	21.38	8.89	2.41x	1.49e-08	✓
glm45_air	32	128	21.16	10.18	2.08x	2.24e-08	✓
glm45_air	128	128	21.32	10.31	2.07x	2.98e-08	✓
glm45_air	256	128	21.29	10.34	2.06x	2.98e-08	✓
glm45_air	512	128	21.41	10.41	2.06x	2.98e-08	✓
glm45_air	1024	128	21.25	10.85	1.96x	2.98e-08	✓
glm45_air	2048	128	21.86	12.10	1.81x	2.98e-08	✓
glm45_air	4096	128	26.12	16.56	1.58x	2.98e-08	✓
glm45_air	8192	128	40.95	28.43	1.44x	4.47e-08	✓

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

merge develop

paddle-bot · 2026-05-11T11:06:25Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-11T11:27:39Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-11 20:02:02

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 49563e4
Merge base: 589a721 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

⚠️ 有 1 个 Required 任务失败，阻塞合并，需优先处理。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
19(0)	19	12	2	4	1	0

2 任务状态汇总

2.1 Required任务 : 1/2 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Approval`	10s	PR问题：新增Custom Op缺少2项RD审批	请FastDeploy RD及PaddlePaddle RD各批准一次	Job	-
✅	其余 1 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 11/17 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Check PR Template`	13s	Job	-
⏳	`xpu_build_test / xpu-build-test`	-	Job	-
⏳	`FD-Build-Linux / fd-build`	-	Job	-
⏳	`Run iluvatar Tests / run_iluvatar_cases`	-	Job	-
⏳	`Trigger Jenkins for PR`	-	Job	-
⏸️	`CI_HPU`	-	-	-
✅	其余 11 个可选任务通过	-	-	-

3 失败详情（仅 required）

Approval — 审批流程（置信度: 高）

Approval

状态: ❌ 失败
错误类型: 审批流程
置信度: 高
根因摘要: PR新增Custom Op，缺少FastDeploy RD和PaddlePaddle RD各一人审批
分析器: 通用分析(fallback)

根因详情:
CI 脚本 scripts/check_approval.sh 检测到本 PR 涉及新增 Custom Op，触发了双重审批要求：① 需要至少一位 FastDeploy RD（@qingqing01/@Jiang-Jia-Jun/@heavengate）批准；② 需要至少一位 PaddlePaddle RD（@jeff41404/@yongqiangma）批准。目前两项审批均未满足，脚本以 exit code 6 退出。

关键日志:

0. You must have one FastDeploy RD (qingqing01(dangqingqing), Jiang-Jia-Jun(jiangjiajun), heavengate(dengkaipeng)) approval for adding custom op.
1. You must have one PaddlePaddle RD (jeff41404(gaoxiang), yongqiangma(mayongqiang)) approval for adding custom op.

There are 2 approved errors.
##[error]Process completed with exit code 6.

修复建议:

请 @qingqing01、@Jiang-Jia-Jun 或 @heavengate 中至少一位 FastDeploy RD 在 PR 上进行 Approve
请 @jeff41404 或 @yongqiangma 中至少一位 PaddlePaddle RD 在 PR 上进行 Approve

修复建议摘要: 请FastDeploy RD及PaddlePaddle RD各Approve一次

关联变更: PR 新增了 Custom Op 相关代码，触发审批检查
链接: 查看日志

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-11 20:11:35

📋 Review 摘要

PR 概述：新增 grouped_topk CUDA 融合算子，将 cast + sigmoid + bias + noaux_tc 合并为单次 kernel launch，提升 MoE 路由推理性能（实测 TPS +3.95%）。

变更范围：custom_ops/gpu_ops/（新增 .cu）、fastdeploy/model_executor/layers/moe/moe.py、tests/operators/

影响面 Tag：[OP] [Optimization]

📝 PR 规范检查

存在两处问题：① 标题使用了非官方 Tag [Ops]（官方为 [OP]）；② Usage or Command 节内容为空（仅保留了模板注释）；③ Checklist 全未勾选，但单测和精度结果均已提供。

标题建议（可直接复制）：

[OP] Kernel fusion: cast+sigmoid+bias+noauxtc

PR 描述建议（可直接复制，必须复刻 checklist §D2 模板的完整结构）：

## Motivation
新增融合算子 `grouped_topk`，将 cast + sigmoid + bias + noaux_tc 合并为单次 CUDA kernel launch，减少 memory bandwidth 开销，提升 MoE 路由阶段推理性能。当前仅支持 CUDA 设备。

## Modifications
- `custom_ops/gpu_ops/grouped_topk_kernels.cu`：新增 `grouped_topk_fused_kernel`，一次 kernel launch 完成 sigmoid/cast、bias 加法、group score 计算（top-2 sum）、group/expert topk 选取及稀疏 scores 写回
- `custom_ops/gpu_ops/cpp_extensions.cc`：添加 `grouped_topk` 函数声明及 pybind11 注册（`m.def("grouped_topk", ...)`）
- `custom_ops/setup_ops.py`：将 `grouped_topk_kernels.cu` 加入两处编译源文件列表（普通版和冗余版）
- `fastdeploy/model_executor/layers/moe/moe.py`：在 `get_moe_scores` 中新增 `use_fused=True` 分支，对非 EP（`expert_id_to_ep_rank_array is None`）场景调用 `grouped_topk` 替代原 `fused_cast_sigmoid_bias + noaux_tc` 双步路径
- `tests/operators/test_grouped_topk_op.py`：新增算子单测，覆盖 DeepSeek-V3、GLM-4.5-Air、Qwen3-30B-A3B、Kimi-K2 多模型配置及 bfloat16 输入

## Usage or Command
N/A

## Accuracy Tests
精度测试结果见 PR 描述 Accuracy Tests 表格（fused_cast+noaux vs fused_cast_grouped_topk 对比）：所有配置 diff ≤ 1.19e-07，idx 全部一致（✓）。

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别	文件	概述
🟡 建议	`tests/operators/test_grouped_topk_op.py`	`test_deterministic` 末行为 `self.ass`（截断），`topk_indices` 确定性断言缺失，运行时会抛 `AttributeError`
❓ 疑问	`fastdeploy/model_executor/layers/moe/moe.py:45`	warning 消息仍为 "import noaux_tc Failed!"，但 `grouped_topk` 也在同一 try 块导入，排查时易误导
❓ 疑问	`fastdeploy/model_executor/layers/moe/moe.py:50`	`if current_platform.is_cuda(): pass` 是空块，建议删除以保持整洁

总体评价

融合算子实现思路清晰，kernel 正确性经过多配置精度验证，测试覆盖度较好。test_deterministic 末行疑似截断（self.ass），需补全 topk_indices 的确定性断言，否则测试运行时会出错。moe.py 侧有两处小的清理遗漏，不影响功能。

PaddlePaddle-bot · 2026-05-11T12:15:22Z

+        noaux_tc_redundant,
+    )
 except:
    logger.warning("import noaux_tc Failed!")


❓ 疑问 try 块现已同时导入 grouped_topk，但 warning 消息仍为 "import noaux_tc Failed!"，容易误导排查。

建议改为：

logger.warning("import grouped_topk / noaux_tc Failed!")

PaddlePaddle-bot · 2026-05-11T12:15:22Z

-    from fastdeploy.model_executor.layers.moe.fused_cast_sigmoid_bias import (
-        fused_cast_sigmoid_bias,
-    )
+    pass


❓ 疑问 if current_platform.is_cuda(): pass 是空块，fused_cast_sigmoid_bias 的条件导入已删除，这里只剩一个无意义的 pass。

建议直接删除整个 if 块以保持代码整洁。

gongshaotian

LGTM

gongshaotian · 2026-05-11T12:32:33Z

-    from fastdeploy.model_executor.layers.moe.fused_cast_sigmoid_bias import (
-        fused_cast_sigmoid_bias,
-    )
+    pass


codecov-commenter · 2026-05-11T13:22:16Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@589a721). Learn more about missing BASE report.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7777   +/-   ##
==========================================
  Coverage           ?   71.13%           
==========================================
  Files              ?      396           
  Lines              ?    55831           
  Branches           ?     8724           
==========================================
  Hits               ?    39716           
  Misses             ?    13366           
  Partials           ?     2749

Flag	Coverage Δ
GPU	`71.13% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ShaneGZhu added 2 commits May 11, 2026 18:20

[Ops][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc

68443e9

Merge remote-tracking branch 'origin/develop' into get_moe_score

ec06609

merge develop

ShaneGZhu had a problem deploying to Metax_ci May 11, 2026 11:06 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

Add unit_test file

49563e4

ShaneGZhu had a problem deploying to Metax_ci May 11, 2026 11:52 — with GitHub Actions Failure

ShaneGZhu marked this pull request as ready for review May 11, 2026 11:53

PaddlePaddle-bot reviewed May 11, 2026

View reviewed changes

gongshaotian approved these changes May 11, 2026

View reviewed changes

ShaneGZhu changed the title ~~[Ops][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc~~ [Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc#7777

[Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc#7777
ShaneGZhu wants to merge 3 commits into
PaddlePaddle:developfrom
ShaneGZhu:get_moe_score

ShaneGZhu commented May 11, 2026 •

edited

Loading

Uh oh!

paddle-bot Bot commented May 11, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 11, 2026 •

edited

Loading

Approval

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 11, 2026

Uh oh!

PaddlePaddle-bot May 11, 2026

Uh oh!

gongshaotian May 11, 2026

Uh oh!

gongshaotian left a comment

Uh oh!

gongshaotian May 11, 2026

Uh oh!

codecov-commenter commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ShaneGZhu commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

fused_cast+noaux (A) vs fused_cast_grouped_topk (C) 性能对比

Checklist

Uh oh!

paddle-bot Bot commented May 11, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 1/2 通过

2.2 可选任务 — 11/17 通过

3 失败详情（仅 required）

Approval

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

gongshaotian May 11, 2026

Choose a reason for hiding this comment

Uh oh!

gongshaotian left a comment

Choose a reason for hiding this comment

Uh oh!

gongshaotian May 11, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 11, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ShaneGZhu commented May 11, 2026 •

edited

Loading

PaddlePaddle-bot commented May 11, 2026 •

edited

Loading