Skip to content

wip: dsa for v1 cache manager#7770

Draft
Moonchild1227 wants to merge 4 commits into
PaddlePaddle:developfrom
Moonchild1227:feat/hisparse
Draft

wip: dsa for v1 cache manager#7770
Moonchild1227 wants to merge 4 commits into
PaddlePaddle:developfrom
Moonchild1227:feat/hisparse

Conversation

@Moonchild1227
Copy link
Copy Markdown
Contributor

This pull request introduces comprehensive support for MLA and DSA cache layouts in the FastDeploy cache manager and transfer manager. The changes generalize cache initialization, naming, and transfer logic to handle these new attention mechanisms alongside the existing GQA/MHA (Grouped/Masked Multi-Head Attention) path. The core logic now dynamically adapts to the attention type, ensuring correct cache allocation and data movement for each mode.

Key changes include:

Support for MLA and DSA cache layouts:

  • Added detection of MLA and DSA modes via kv_lora_rank and index_head_dim in model_config, enabling conditional logic throughout the cache and transfer managers. [1] [2]
  • Implemented specialized cache initializers: initialize_mla_kv_cache (key-only, no value/indexer) and initialize_dsa_kv_cache (key + indexer, uint8), with correct tensor allocation and naming.
  • Adapted _get_cache_names to produce the correct set of cache keys for each attention type, ensuring consistency across device and host caches.

Generalization of cache initialization and transfer:

  • Updated initialize_kv_cache, initialize_mtp_kv_cache, and initialize_host_cache to branch into MLA/DSA/GQA logic, with correct dtype, shape, and quantization handling. [1] [2] [3] [4]
  • Modified the transfer manager to correctly build device/host cache indices and perform swap operations for MLA (key-only), DSA (key + indexer), and GQA (key + value [+ scales]), including async and per-layer swaps. [1] [2] [3] [4] [5] [6] [7]

Cache naming and mapping consistency:

  • Standardized cache naming conventions for all attention types, ensuring device/host cache maps and transfer logic remain in sync. [1] [2] [3]

These changes collectively enable flexible and robust support for new attention mechanisms, improving the extensibility and correctness of FastDeploy's cache management.

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

  • dsa验证
    • 集中式
python3 gsm8k.py
🎯 Evaluation Complete: Accuracy = 94.06% (649/690)

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings May 11, 2026 06:42
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 11, 2026

Thanks for your contribution!

@paddle-bot paddle-bot Bot added the contributor External developers label May 11, 2026
@Moonchild1227 Moonchild1227 changed the title wip: Feat/hisparse wip: dsa for v1 cache manager May 11, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 旨在让 FastDeploy 的 v1 cache controller/transfer manager 支持 MLA 与 DSA 两种新的 KV cache 布局,并将 cache 初始化、命名与 swap 逻辑从原先的 GQA/MHA 路径泛化为按 attention 类型动态分支。

Changes:

  • 在 v1 TransferManager 中加入 MLA/DSA 模式探测,并在构建 layer 索引与 swap 逻辑中区分 MLA(仅 key)、DSA(key+indexer)、GQA(key+value[+scales])。
  • 在 v1 CacheController 中调整 cache 命名规则,并新增 MLA/DSA 专用 device cache 初始化函数。
  • 在 v1 Host cache 初始化中加入 DSA shape 分支(但当前实现仍存在未覆盖的关键分配与分支一致性问题)。

PR 标题/描述检查(需要调整)

  • 标题当前为 wip: Feat/hisparse,不符合仓库要求的 [CLASS]Title 格式;建议例如:[KVCache][Feature] Support MLA/DSA cache layout in v1 cache manager
  • 描述里模板的 Motivation/Modifications/Usage 等段落仍较空,建议补充“为什么需要 MLA/DSA host swap / 命名变更是否为 breaking change / 兼容策略”等关键信息,并确认是否需要更新相关文档(如环境变量/配置说明或 KVCache 使用文档)。

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
fastdeploy/cache_manager/v1/transfer_manager.py 新增 MLA/DSA 模式识别与 swap 分支,但同步 all-layer swap 对 DSA 的 indexer 覆盖存在缺陷
fastdeploy/cache_manager/v1/cache_controller.py 新增 MLA/DSA 初始化与命名规则调整,但 host cache 分配与 MTP/DSA shape 解包存在阻断性问题

Comment on lines 347 to +367
self._device_id,
mode,
)
swap_cache_all_layers(
self._device_value_caches,
self._host_value_ptrs,
self._num_host_blocks,
device_block_ids,
host_block_ids,
self._device_id,
mode,
)
if self._is_fp8_quantization() and self._device_key_scales and self._host_key_scales_ptrs:
# Value cache is only used in GQA
if not self._is_mla and self._device_value_caches:
swap_cache_all_layers(
self._device_value_caches,
self._host_value_ptrs,
self._num_host_blocks,
device_block_ids,
host_block_ids,
self._device_id,
mode,
)
# Scale cache is only used in GQA + fp8 quantization
if (
not self._is_mla
and self._is_fp8_quantization()
and self._device_key_scales
and self._host_key_scales_ptrs
):
Comment on lines +236 to +248
names = {
"key": f"key_cache_{layer_idx}_rank{local_rank}.device{self._device_id}",
}

if self._is_dsa:
names["indexer"] = f"indexer_caches_{layer_idx}_rank{local_rank}.device{self._device_id}"
elif self._is_mla:
pass # MLA: only key, no value, no indexer
else:
# GQA/MHA: key + value + optional scales
names["value"] = f"value_caches_{layer_idx}_rank{local_rank}.device{self._device_id}"
names["key_scale"] = f"key_cache_scales_{layer_idx}_rank{local_rank}.device{self._device_id}"
names["value_scale"] = f"value_cache_scales_{layer_idx}_rank{local_rank}.device{self._device_id}"
Comment on lines 672 to +679
# Get kv cache shape (pass num_host_blocks as max_num_blocks for host cache)
key_cache_shape, value_cache_shape = attn_backend.get_kv_cache_shape(
max_num_blocks=num_host_blocks, kv_cache_quant_type=kv_cache_quant_type
)
if self._is_dsa:
kv_cache_quant_type = "uint8"
key_cache_shape, _, indexer_cache_shape = attn_backend.get_kv_cache_shape(
max_num_blocks=num_host_blocks, kv_cache_quant_type=kv_cache_quant_type
)
value_cache_shape = []
else:

paddle.device.cuda.empty_cache()
logger.info("MLA kv cache initialized!")

Comment on lines +458 to +469
if self._is_dsa:
kv_cache_quant_type = "uint8"
key_cache_shape, value_cache_shape = attn_backend.get_kv_cache_shape(
max_num_blocks=num_gpu_blocks, kv_cache_quant_type=kv_cache_quant_type
)
cache_dtype = "uint8"
else:
key_cache_shape, value_cache_shape = attn_backend.get_kv_cache_shape(
max_num_blocks=num_gpu_blocks, kv_cache_quant_type=kv_cache_quant_type
)
indexer_cache_shape = []
cache_dtype = self.model_config.dtype
Comment on lines +236 to +242
names = {
"key": f"key_cache_{layer_idx}_rank{local_rank}.device{self._device_id}",
}

if self._is_dsa:
names["indexer"] = f"indexer_caches_{layer_idx}_rank{local_rank}.device{self._device_id}"
elif self._is_mla:
Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-11 14:51:09

📋 Review 摘要

PR 概述:为 v1 cache manager 添加 MLA 和 DSA cache 布局支持,通用化 cache 初始化、命名及 transfer 逻辑。
变更范围fastdeploy/cache_manager/v1/cache_controller.pyfastdeploy/cache_manager/v1/transfer_manager.py
影响面 Tag[KVCache]


📝 PR 规范检查

PR 标题缺少合法 [Tag],以 wip: 开头不符合规范。PR 描述中 MotivationModificationsUsage or Command 均为空。

标题建议(可直接复制):

  • [KVCache] Add MLA and DSA cache layout support in v1 cache manager

PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):

## Motivation
v1 cache manager 原仅支持 GQA/MHA 路径。为支持 DeepSeek 等模型使用的 MLA(Multi-head Latent Attention,仅 key cache,无 value)以及 DSA(HiSparse,key + indexer uint8 双池)cache 布局,需要对 cache 初始化、命名和 transfer 逻辑进行泛化。

## Modifications
- `cache_controller.py`- 新增 `_is_mla` / `_is_dsa` 标志(基于 `kv_lora_rank``index_head_dim`- `_get_cache_names` 统一命名规范:GQA 含 key/value/scales,DSA 含 key/indexer,MLA 仅含 key;key cache 名从 `key_caches_` 改为 `key_cache_`
  - 新增 `initialize_mla_kv_cache`(key-only,调用 set_data_ipc 固定内存)
  - 新增 `initialize_dsa_kv_cache`(key + indexer uint8,调用 set_data_ipc)
  - `initialize_kv_cache` 分发到三条路径;`initialize_mtp_kv_cache``initialize_host_cache` 同步支持 DSA 分支
- `transfer_manager.py`- 新增 `_is_mla` / `_is_dsa` 标志
  - `_build_device_layer_indices` / `_build_host_layer_indices` 按模式填充 device/host 索引
  - swap 路径(all-layers / single-layer / async)跳过 MLA value swap;DSA 使用 indexer 槽位进行 swap

## Usage or Command
N/A

## Accuracy Tests
- DSA 集中式验证(gsm8k):
```bash
python3 gsm8k.py
🎯 Evaluation Complete: Accuracy = 94.06% (649/690)
```

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别 文件 概述
🔴 Bug cache_controller.py:460 initialize_mtp_kv_cache DSA 分支:解包值数量错误(2 vs 3)且 indexer_cache_shape 未定义
🟡 建议 tests/cache_manager/v1/ key_caches_ 命名未随实现同步更新为 key_cache_,相关测试将失败
🟡 建议 cache_controller.py:368 initialize_mla_kv_cache 未调用 initialize_host_cache,若启用 swap 将静默失败

总体评价

整体设计思路合理,MLA/DSA 三路分发架构清晰。但 initialize_mtp_kv_cache 的 DSA 分支存在导致运行时 crash 的 P0 bug,且 key cache 命名变更未同步到测试文件,需修复后合入。

)
if self._is_dsa:
kv_cache_quant_type = "uint8"
key_cache_shape, value_cache_shape = attn_backend.get_kv_cache_shape(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug initialize_mtp_kv_cache 的 DSA 分支解包值数量错误,且 indexer_cache_shape 未定义

DSA backend 的 get_kv_cache_shape 返回 3 个值(如 initialize_dsa_kv_cacheinitialize_host_cache 中正确解包为 key_cache_shape, _, indexer_cache_shape 所示),但此处只解包 2 个值,将抛出 ValueError: too many values to unpack

此外,indexer_cache_shape_is_dsa=True 分支中从未赋值,后续 elif indexer_cache_shape: 会触发 NameError

建议修复:

if self._is_dsa:
    kv_cache_quant_type = "uint8"
    key_cache_shape, _, indexer_cache_shape = attn_backend.get_kv_cache_shape(
        max_num_blocks=num_gpu_blocks, kv_cache_quant_type=kv_cache_quant_type
    )
    value_cache_shape = []  # DSA 没有 value cache
    cache_dtype = "uint8"
else:
    key_cache_shape, value_cache_shape = attn_backend.get_kv_cache_shape(
        max_num_blocks=num_gpu_blocks, kv_cache_quant_type=kv_cache_quant_type
    )
    indexer_cache_shape = []
    cache_dtype = self.model_config.dtype

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-11 14:53:35

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

当前存在 1 个 Required 失败任务需优先处理,另有 4 个 Required 任务等待中、1 个运行中。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
38(0) 38 24 3 4 6 1

2 任务状态汇总

2.1 Required任务 : 2/8 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 6s PR问题:新增logger.info调用,需指定RD审批日志变更 @xyxinyang@zyyzghb 在 PR 中审批 Job -
Run Base Tests / base_tests - 运行中 - Job -
⏸️ Extracted partial CE model tasks to run in CI. / run_ce_cases - 等待中 - - -
⏸️ Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 等待中 - - -
⏸️ Run Four Cards Tests / run_4_cards_tests - 等待中 - - -
⏸️ Run Stable Tests / stable_tests - 等待中 - - -
其余 2 个必选任务通过 - - - - -

2.2 可选任务 — 22/30 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Check PR Template 18s Job -
Cleanup artifacts 7s Job -
其余 22 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — 代码规范(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: 代码规范
  • 置信度: 高
  • 根因摘要: PR新增logger.info调用,需指定RD审批日志变更
  • 分析器: 通用分析(fallback)

根因详情:
本次 PR 在 DSA kv cache 相关代码中新增了多处 logger.info 调用。根据仓库 scripts/check_approval.sh 的审批规则,修改日志行为(.info/.debug/.error/log_request)需要至少一位指定 FastDeploy RD 进行审批。当前 PR 尚未收到指定审批者的 approve,导致 Approval 任务以 exit code 6 失败。

关键日志:

Detected log modification in diff:
+        logger.info(
+        logger.info("GQA kv cache initialized!")
+        logger.info(f"Initializing MLA kv cache: ...")
+        logger.info("MLA kv cache initialized!")
+        logger.info("DSA kv cache initialized!")

0. You must have one FastDeploy RD (xyxinyang(zhouchong), zyyzghb(zhangyongyue)) approval for modifying logging behavior (.info/.debug/.error/log_request).
There are 1 approved errors.
##[error]Process completed with exit code 6.

修复建议:

  1. 在 PR 评论中 @ xyxinyang(zhouchong)zyyzghb(zhangyongyue) 请求审批,审批后 CI 会自动重跑通过
  2. 若 logger 调用为非必要,可考虑移除部分新增的 logger.info 语句

修复建议摘要: @xyxinyang@zyyzghb 在 PR 中审批即可

关联变更: PR 新增 DSA kv cache 相关 logger.info 调用(GQA/MLA/DSA kv cache 初始化日志)
链接: 查看日志

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 18.79699% with 108 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@a2e216a). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/cache_manager/v1/cache_controller.py 4.70% 81 Missing ⚠️
fastdeploy/cache_manager/v1/transfer_manager.py 43.75% 22 Missing and 5 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7770   +/-   ##
==========================================
  Coverage           ?   71.49%           
==========================================
  Files              ?      396           
  Lines              ?    55789           
  Branches           ?     8730           
==========================================
  Hits               ?    39887           
  Misses             ?    13157           
  Partials           ?     2745           
Flag Coverage Δ
GPU 71.49% <18.79%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants