wip: dsa for v1 cache manager#7770
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 旨在让 FastDeploy 的 v1 cache controller/transfer manager 支持 MLA 与 DSA 两种新的 KV cache 布局,并将 cache 初始化、命名与 swap 逻辑从原先的 GQA/MHA 路径泛化为按 attention 类型动态分支。
Changes:
- 在 v1 TransferManager 中加入 MLA/DSA 模式探测,并在构建 layer 索引与 swap 逻辑中区分 MLA(仅 key)、DSA(key+indexer)、GQA(key+value[+scales])。
- 在 v1 CacheController 中调整 cache 命名规则,并新增 MLA/DSA 专用 device cache 初始化函数。
- 在 v1 Host cache 初始化中加入 DSA shape 分支(但当前实现仍存在未覆盖的关键分配与分支一致性问题)。
PR 标题/描述检查(需要调整)
- 标题当前为
wip: Feat/hisparse,不符合仓库要求的[CLASS]Title格式;建议例如:[KVCache][Feature] Support MLA/DSA cache layout in v1 cache manager。 - 描述里模板的 Motivation/Modifications/Usage 等段落仍较空,建议补充“为什么需要 MLA/DSA host swap / 命名变更是否为 breaking change / 兼容策略”等关键信息,并确认是否需要更新相关文档(如环境变量/配置说明或 KVCache 使用文档)。
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| fastdeploy/cache_manager/v1/transfer_manager.py | 新增 MLA/DSA 模式识别与 swap 分支,但同步 all-layer swap 对 DSA 的 indexer 覆盖存在缺陷 |
| fastdeploy/cache_manager/v1/cache_controller.py | 新增 MLA/DSA 初始化与命名规则调整,但 host cache 分配与 MTP/DSA shape 解包存在阻断性问题 |
| self._device_id, | ||
| mode, | ||
| ) | ||
| swap_cache_all_layers( | ||
| self._device_value_caches, | ||
| self._host_value_ptrs, | ||
| self._num_host_blocks, | ||
| device_block_ids, | ||
| host_block_ids, | ||
| self._device_id, | ||
| mode, | ||
| ) | ||
| if self._is_fp8_quantization() and self._device_key_scales and self._host_key_scales_ptrs: | ||
| # Value cache is only used in GQA | ||
| if not self._is_mla and self._device_value_caches: | ||
| swap_cache_all_layers( | ||
| self._device_value_caches, | ||
| self._host_value_ptrs, | ||
| self._num_host_blocks, | ||
| device_block_ids, | ||
| host_block_ids, | ||
| self._device_id, | ||
| mode, | ||
| ) | ||
| # Scale cache is only used in GQA + fp8 quantization | ||
| if ( | ||
| not self._is_mla | ||
| and self._is_fp8_quantization() | ||
| and self._device_key_scales | ||
| and self._host_key_scales_ptrs | ||
| ): |
| names = { | ||
| "key": f"key_cache_{layer_idx}_rank{local_rank}.device{self._device_id}", | ||
| } | ||
|
|
||
| if self._is_dsa: | ||
| names["indexer"] = f"indexer_caches_{layer_idx}_rank{local_rank}.device{self._device_id}" | ||
| elif self._is_mla: | ||
| pass # MLA: only key, no value, no indexer | ||
| else: | ||
| # GQA/MHA: key + value + optional scales | ||
| names["value"] = f"value_caches_{layer_idx}_rank{local_rank}.device{self._device_id}" | ||
| names["key_scale"] = f"key_cache_scales_{layer_idx}_rank{local_rank}.device{self._device_id}" | ||
| names["value_scale"] = f"value_cache_scales_{layer_idx}_rank{local_rank}.device{self._device_id}" |
| # Get kv cache shape (pass num_host_blocks as max_num_blocks for host cache) | ||
| key_cache_shape, value_cache_shape = attn_backend.get_kv_cache_shape( | ||
| max_num_blocks=num_host_blocks, kv_cache_quant_type=kv_cache_quant_type | ||
| ) | ||
| if self._is_dsa: | ||
| kv_cache_quant_type = "uint8" | ||
| key_cache_shape, _, indexer_cache_shape = attn_backend.get_kv_cache_shape( | ||
| max_num_blocks=num_host_blocks, kv_cache_quant_type=kv_cache_quant_type | ||
| ) | ||
| value_cache_shape = [] | ||
| else: |
|
|
||
| paddle.device.cuda.empty_cache() | ||
| logger.info("MLA kv cache initialized!") | ||
|
|
| if self._is_dsa: | ||
| kv_cache_quant_type = "uint8" | ||
| key_cache_shape, value_cache_shape = attn_backend.get_kv_cache_shape( | ||
| max_num_blocks=num_gpu_blocks, kv_cache_quant_type=kv_cache_quant_type | ||
| ) | ||
| cache_dtype = "uint8" | ||
| else: | ||
| key_cache_shape, value_cache_shape = attn_backend.get_kv_cache_shape( | ||
| max_num_blocks=num_gpu_blocks, kv_cache_quant_type=kv_cache_quant_type | ||
| ) | ||
| indexer_cache_shape = [] | ||
| cache_dtype = self.model_config.dtype |
| names = { | ||
| "key": f"key_cache_{layer_idx}_rank{local_rank}.device{self._device_id}", | ||
| } | ||
|
|
||
| if self._is_dsa: | ||
| names["indexer"] = f"indexer_caches_{layer_idx}_rank{local_rank}.device{self._device_id}" | ||
| elif self._is_mla: |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-11 14:51:09
📋 Review 摘要
PR 概述:为 v1 cache manager 添加 MLA 和 DSA cache 布局支持,通用化 cache 初始化、命名及 transfer 逻辑。
变更范围:fastdeploy/cache_manager/v1/cache_controller.py、fastdeploy/cache_manager/v1/transfer_manager.py
影响面 Tag:[KVCache]
📝 PR 规范检查
PR 标题缺少合法 [Tag],以 wip: 开头不符合规范。PR 描述中 Motivation、Modifications、Usage or Command 均为空。
标题建议(可直接复制):
[KVCache] Add MLA and DSA cache layout support in v1 cache manager
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
v1 cache manager 原仅支持 GQA/MHA 路径。为支持 DeepSeek 等模型使用的 MLA(Multi-head Latent Attention,仅 key cache,无 value)以及 DSA(HiSparse,key + indexer uint8 双池)cache 布局,需要对 cache 初始化、命名和 transfer 逻辑进行泛化。
## Modifications
- `cache_controller.py`:
- 新增 `_is_mla` / `_is_dsa` 标志(基于 `kv_lora_rank` 与 `index_head_dim`)
- `_get_cache_names` 统一命名规范:GQA 含 key/value/scales,DSA 含 key/indexer,MLA 仅含 key;key cache 名从 `key_caches_` 改为 `key_cache_`
- 新增 `initialize_mla_kv_cache`(key-only,调用 set_data_ipc 固定内存)
- 新增 `initialize_dsa_kv_cache`(key + indexer uint8,调用 set_data_ipc)
- `initialize_kv_cache` 分发到三条路径;`initialize_mtp_kv_cache`、`initialize_host_cache` 同步支持 DSA 分支
- `transfer_manager.py`:
- 新增 `_is_mla` / `_is_dsa` 标志
- `_build_device_layer_indices` / `_build_host_layer_indices` 按模式填充 device/host 索引
- swap 路径(all-layers / single-layer / async)跳过 MLA value swap;DSA 使用 indexer 槽位进行 swap
## Usage or Command
N/A
## Accuracy Tests
- DSA 集中式验证(gsm8k):
```bash
python3 gsm8k.py
🎯 Evaluation Complete: Accuracy = 94.06% (649/690)
```
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | cache_controller.py:460 |
initialize_mtp_kv_cache DSA 分支:解包值数量错误(2 vs 3)且 indexer_cache_shape 未定义 |
| 🟡 建议 | tests/cache_manager/v1/ |
key_caches_ 命名未随实现同步更新为 key_cache_,相关测试将失败 |
| 🟡 建议 | cache_controller.py:368 |
initialize_mla_kv_cache 未调用 initialize_host_cache,若启用 swap 将静默失败 |
总体评价
整体设计思路合理,MLA/DSA 三路分发架构清晰。但 initialize_mtp_kv_cache 的 DSA 分支存在导致运行时 crash 的 P0 bug,且 key cache 命名变更未同步到测试文件,需修复后合入。
| ) | ||
| if self._is_dsa: | ||
| kv_cache_quant_type = "uint8" | ||
| key_cache_shape, value_cache_shape = attn_backend.get_kv_cache_shape( |
There was a problem hiding this comment.
🔴 Bug initialize_mtp_kv_cache 的 DSA 分支解包值数量错误,且 indexer_cache_shape 未定义
DSA backend 的 get_kv_cache_shape 返回 3 个值(如 initialize_dsa_kv_cache 和 initialize_host_cache 中正确解包为 key_cache_shape, _, indexer_cache_shape 所示),但此处只解包 2 个值,将抛出 ValueError: too many values to unpack。
此外,indexer_cache_shape 在 _is_dsa=True 分支中从未赋值,后续 elif indexer_cache_shape: 会触发 NameError。
建议修复:
if self._is_dsa:
kv_cache_quant_type = "uint8"
key_cache_shape, _, indexer_cache_shape = attn_backend.get_kv_cache_shape(
max_num_blocks=num_gpu_blocks, kv_cache_quant_type=kv_cache_quant_type
)
value_cache_shape = [] # DSA 没有 value cache
cache_dtype = "uint8"
else:
key_cache_shape, value_cache_shape = attn_backend.get_kv_cache_shape(
max_num_blocks=num_gpu_blocks, kv_cache_quant_type=kv_cache_quant_type
)
indexer_cache_shape = []
cache_dtype = self.model_config.dtype
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览当前存在 1 个 Required 失败任务需优先处理,另有 4 个 Required 任务等待中、1 个运行中。
2 任务状态汇总2.1 Required任务 : 2/8 通过
2.2 可选任务 — 22/30 通过
3 失败详情(仅 required)Approval — 代码规范(置信度: 高)Approval
根因详情: 关键日志: 修复建议:
修复建议摘要: @xyxinyang 或 @zyyzghb 在 PR 中审批即可 关联变更: PR 新增 DSA kv cache 相关 logger.info 调用(GQA/MLA/DSA kv cache 初始化日志) |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7770 +/- ##
==========================================
Coverage ? 71.49%
==========================================
Files ? 396
Lines ? 55789
Branches ? 8730
==========================================
Hits ? 39887
Misses ? 13157
Partials ? 2745
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This pull request introduces comprehensive support for MLA and DSA cache layouts in the FastDeploy cache manager and transfer manager. The changes generalize cache initialization, naming, and transfer logic to handle these new attention mechanisms alongside the existing GQA/MHA (Grouped/Masked Multi-Head Attention) path. The core logic now dynamically adapts to the attention type, ensuring correct cache allocation and data movement for each mode.
Key changes include:
Support for MLA and DSA cache layouts:
kv_lora_rankandindex_head_diminmodel_config, enabling conditional logic throughout the cache and transfer managers. [1] [2]initialize_mla_kv_cache(key-only, no value/indexer) andinitialize_dsa_kv_cache(key + indexer, uint8), with correct tensor allocation and naming._get_cache_namesto produce the correct set of cache keys for each attention type, ensuring consistency across device and host caches.Generalization of cache initialization and transfer:
initialize_kv_cache,initialize_mtp_kv_cache, andinitialize_host_cacheto branch into MLA/DSA/GQA logic, with correct dtype, shape, and quantization handling. [1] [2] [3] [4]Cache naming and mapping consistency:
These changes collectively enable flexible and robust support for new attention mechanisms, improving the extensibility and correctness of FastDeploy's cache management.
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.