[Speculative Decoding] Refine ngram kernel signature and adapt ngram proposer#7774
[Speculative Decoding] Refine ngram kernel signature and adapt ngram proposer#7774NKNaN wants to merge 2 commits into
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览存在 1 个 Required 失败任务(Approval 待审批),需处理后方可合并。
2 任务状态汇总2.1 Required任务 : 1/2 通过
2.2 可选任务 — 11/17 通过
3 失败详情(仅 required)Approval — 代码审批(置信度: 高)Approval
根因详情: 关键日志: 修复建议:
修复建议摘要: 请 freeliuzc 或 Deleter-D 审批此 PR 链接: 查看日志 |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #7774 +/- ##
==========================================
Coverage ? 71.53%
==========================================
Files ? 396
Lines ? 55822
Branches ? 8724
==========================================
Hits ? 39935
Misses ? 13136
Partials ? 2751
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-11 22:49:23
📋 Review 摘要
PR 概述:精简 ngram match kernel 接口(合并 input_ids/input_ids_len 到 token_ids_all),修复 ngram 指针偏移 bug,并完成端到端验证
变更范围:custom_ops/gpu_ops/speculate_decoding/、fastdeploy/spec_decode/ngram.py、fastdeploy/config.py、fastdeploy/worker/gpu_model_runner.py、测试文件
影响面 Tag:[Speculative Decoding] [OP] [FDConfig]
📝 PR 规范检查
标题含官方 Tag [Speculative Decoding] ✓;但 PR body 缺少 ## Usage or Command 和 ## Accuracy Tests 两个必填节,结构不符合描述模板要求。
PR 描述建议(可直接复制):
## Motivation
精简 ngram match kernel 接口,将原本分离的 `input_ids`/`input_ids_len` 参数合并到 `token_ids_all`(由 `prompt_lens` 划定 prompt 与 generated tokens 边界),并修复 ngram 指针偏移 bug(`step_idx` 语义由 0-based 末尾位置索引统一为 token 计数语义)。完成 A800 单卡端到端结果验证,确认投机解码 ngram 方法的端到端正确性。
## Modifications
1. **`custom_ops/gpu_ops/speculate_decoding/ngram_match.cu` / `cpp_extensions.cc`**:删除 `input_ids`、`input_ids_len`、`input_ids_stride` 参数;GPU kernel 与 CPU fallback 均改为直接从 `token_ids_all[:, :prompt_len]` 读取 prompt(搜索域)、从 `token_ids_all[:, prompt_len:]` 读取 pre_ids(ngram 来源);修复 ngram 指针偏移 bug:将 `cur_step_idx + 1 - ngram_size` 改为 `cur_step_idx - ngram_size`。
2. **`fastdeploy/spec_decode/ngram.py`**:删除 `input_ids_len` 相关张量及 `update()` 方法,`_run_impl` 调用签名与新 kernel 接口对齐。
3. **`fastdeploy/config.py`**:将 `SpecMethod.NGRAM` 加入 CUDAGraph capture 的 expected_decode_len 计算逻辑。
4. **`fastdeploy/worker/gpu_model_runner.py`**:`capture_model()` 中为 NGRAM 方法补充 warmup 路径,与 MTP/SUFFIX 保持一致。
5. **测试**:更新 `tests/operators/test_ngram_match.py`、`tests/spec_decode/test_benchmark_ngram_kernel.py`、`tests/spec_decode/test_ngram_gpu_kernel.py`;新增 `tests/spec_decode/test_ngram_proposer.py`。
## Usage or Command
N/A
## Accuracy Tests
端到端验证(A800 单卡):通过打印 `token_ids_all` 读写日志确认 prompt 写入与读取一致;验证 `draft_tokens` 均在 `token_ids_all[:prompt_len + step_idx]` 范围内命中;`step_idx` 跨步增量(如 25→31)确认一次 decode 成功接受多个 speculative tokens。
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| ❓ 疑问 | fastdeploy/spec_decode/ngram.py:38 |
update() 已删除,需确认 gpu_model_runner.py 中无残余调用 |
总体评价
接口精简思路清晰,bug fix(ngram 指针偏移)通过大量 e2e 日志得到充分验证,测试覆盖完整。描述结构需补全 ## Usage or Command 和 ## Accuracy Tests 两节。
| self.input_ids_len[bid] = seq_len | ||
| self.input_ids_len_gpu[bid] = seq_len | ||
|
|
||
| def _run_impl(self, share_inputs): |
There was a problem hiding this comment.
❓ 疑问 update() 方法在此 PR 中已删除,请确认 fastdeploy/worker/gpu_model_runner.py 中(如 _postprocess 等位置)已无 proposer.update(bid, seq_len) 残余调用,否则在 NGRAM 模式下会引发 AttributeError。
Motivation
投机解码 ngram 方法端到端结果验证
Modifications
测试脚本(AI studio A800单卡环境能够跑通):
修改ngram kernel接口:
由于 input_ids 和 pre_ids 目前全部并入 token_ids_all,将原本接口中的 input_ids 删除,prompt tokens 和 predict tokens 完全由 token_ids_all 负责记录。
确认修改后的 ngram match kernel 端到端执行正确:
token_ids_all 为 5 时是 dummy batch,seq_lens_decoder=0。除此之外可以看到 token_ids_all prompt 部分的读和写的内容一致。
kernel 中的 ngram 地址计算 bug 修复后日志打印结果显示能够匹配到,且存在经过 verify 后在一步 decode 中接受了多个token的情况,如:
[NGRAM-DEBUG] call=22 slt=[6, 6] step_idx=[[25], [24], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[192, 77, 0, 0, 0, 0, 0, 0]
[NGRAM-DEBUG] call=23 slt=[6, 6] step_idx=[[31], [25], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[198, 78, 0, 0, 0, 0, 0, 0]
相邻两次proposer.run()时,step_idx[0] 从 25 增加到 31,seq_len_decoder[0] 从 192 增加到 198
CUDAGraph 适配
Overlap Schedule 适配
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.