wip: dsa for v1 cache manager by Moonchild1227 · Pull Request #7770 · PaddlePaddle/FastDeploy

Moonchild1227 · 2026-05-11T06:42:40Z

This pull request introduces comprehensive support for MLA and DSA cache layouts in the FastDeploy cache manager and transfer manager. The changes generalize cache initialization, naming, and transfer logic to handle these new attention mechanisms alongside the existing GQA/MHA (Grouped/Masked Multi-Head Attention) path. The core logic now dynamically adapts to the attention type, ensuring correct cache allocation and data movement for each mode.

Key changes include:

Support for MLA and DSA cache layouts:

Added detection of MLA and DSA modes via kv_lora_rank and index_head_dim in model_config, enabling conditional logic throughout the cache and transfer managers. [1] [2]
Implemented specialized cache initializers: initialize_mla_kv_cache (key-only, no value/indexer) and initialize_dsa_kv_cache (key + indexer, uint8), with correct tensor allocation and naming.
Adapted _get_cache_names to produce the correct set of cache keys for each attention type, ensuring consistency across device and host caches.

Generalization of cache initialization and transfer:

Updated initialize_kv_cache, initialize_mtp_kv_cache, and initialize_host_cache to branch into MLA/DSA/GQA logic, with correct dtype, shape, and quantization handling. [1] [2] [3] [4]
Modified the transfer manager to correctly build device/host cache indices and perform swap operations for MLA (key-only), DSA (key + indexer), and GQA (key + value [+ scales]), including async and per-layer swaps. [1] [2] [3] [4] [5] [6] [7]

Cache naming and mapping consistency:

Standardized cache naming conventions for all attention types, ensuring device/host cache maps and transfer logic remain in sync. [1] [2] [3]

These changes collectively enable flexible and robust support for new attention mechanisms, improving the extensibility and correctness of FastDeploy's cache management.

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

dsa验证
- 集中式

python3 gsm8k.py
🎯 Evaluation Complete: Accuracy = 94.06% (649/690)

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-11T06:42:48Z

Thanks for your contribution!

Copilot

Pull request overview

该 PR 旨在让 FastDeploy 的 v1 cache controller/transfer manager 支持 MLA 与 DSA 两种新的 KV cache 布局，并将 cache 初始化、命名与 swap 逻辑从原先的 GQA/MHA 路径泛化为按 attention 类型动态分支。

Changes:

在 v1 TransferManager 中加入 MLA/DSA 模式探测，并在构建 layer 索引与 swap 逻辑中区分 MLA(仅 key)、DSA(key+indexer)、GQA(key+value[+scales])。
在 v1 CacheController 中调整 cache 命名规则，并新增 MLA/DSA 专用 device cache 初始化函数。
在 v1 Host cache 初始化中加入 DSA shape 分支（但当前实现仍存在未覆盖的关键分配与分支一致性问题）。

PR 标题/描述检查（需要调整）

标题当前为 wip: Feat/hisparse，不符合仓库要求的 [CLASS]Title 格式；建议例如：[KVCache][Feature] Support MLA/DSA cache layout in v1 cache manager。
描述里模板的 Motivation/Modifications/Usage 等段落仍较空，建议补充“为什么需要 MLA/DSA host swap / 命名变更是否为 breaking change / 兼容策略”等关键信息，并确认是否需要更新相关文档（如环境变量/配置说明或 KVCache 使用文档）。

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
fastdeploy/cache_manager/v1/transfer_manager.py	新增 MLA/DSA 模式识别与 swap 分支，但同步 all-layer swap 对 DSA 的 indexer 覆盖存在缺陷
fastdeploy/cache_manager/v1/cache_controller.py	新增 MLA/DSA 初始化与命名规则调整，但 host cache 分配与 MTP/DSA shape 解包存在阻断性问题

                self._device_id,
                mode,
            )
-            swap_cache_all_layers(
-                self._device_value_caches,
-                self._host_value_ptrs,
-                self._num_host_blocks,
-                device_block_ids,
-                host_block_ids,
-                self._device_id,
-                mode,
-            )
-            if self._is_fp8_quantization() and self._device_key_scales and self._host_key_scales_ptrs:
+            # Value cache is only used in GQA
+            if not self._is_mla and self._device_value_caches:
+                swap_cache_all_layers(
+                    self._device_value_caches,
+                    self._host_value_ptrs,
+                    self._num_host_blocks,
+                    device_block_ids,
+                    host_block_ids,
+                    self._device_id,
+                    mode,
+                )
+            # Scale cache is only used in GQA + fp8 quantization
+            if (
+                not self._is_mla
+                and self._is_fp8_quantization()
+                and self._device_key_scales
+                and self._host_key_scales_ptrs
+            ):


+        names = {
+            "key": f"key_cache_{layer_idx}_rank{local_rank}.device{self._device_id}",
        }

+        if self._is_dsa:
+            names["indexer"] = f"indexer_caches_{layer_idx}_rank{local_rank}.device{self._device_id}"
+        elif self._is_mla:
+            pass  # MLA: only key, no value, no indexer
+        else:
+            # GQA/MHA: key + value + optional scales
+            names["value"] = f"value_caches_{layer_idx}_rank{local_rank}.device{self._device_id}"
+            names["key_scale"] = f"key_cache_scales_{layer_idx}_rank{local_rank}.device{self._device_id}"
+            names["value_scale"] = f"value_cache_scales_{layer_idx}_rank{local_rank}.device{self._device_id}"


        # Get kv cache shape (pass num_host_blocks as max_num_blocks for host cache)
-        key_cache_shape, value_cache_shape = attn_backend.get_kv_cache_shape(
-            max_num_blocks=num_host_blocks, kv_cache_quant_type=kv_cache_quant_type
-        )
+        if self._is_dsa:
+            kv_cache_quant_type = "uint8"
+            key_cache_shape, _, indexer_cache_shape = attn_backend.get_kv_cache_shape(
+                max_num_blocks=num_host_blocks, kv_cache_quant_type=kv_cache_quant_type
+            )
+            value_cache_shape = []
+        else:


+
+        paddle.device.cuda.empty_cache()
+        logger.info("MLA kv cache initialized!")
+


+        if self._is_dsa:
+            kv_cache_quant_type = "uint8"
+            key_cache_shape, value_cache_shape = attn_backend.get_kv_cache_shape(
+                max_num_blocks=num_gpu_blocks, kv_cache_quant_type=kv_cache_quant_type
+            )
+            cache_dtype = "uint8"
+        else:
+            key_cache_shape, value_cache_shape = attn_backend.get_kv_cache_shape(
+                max_num_blocks=num_gpu_blocks, kv_cache_quant_type=kv_cache_quant_type
+            )
+            indexer_cache_shape = []
+            cache_dtype = self.model_config.dtype


+        names = {
+            "key": f"key_cache_{layer_idx}_rank{local_rank}.device{self._device_id}",
        }

+        if self._is_dsa:
+            names["indexer"] = f"indexer_caches_{layer_idx}_rank{local_rank}.device{self._device_id}"
+        elif self._is_mla:


PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-11 14:51:09

📋 Review 摘要

PR 概述：为 v1 cache manager 添加 MLA 和 DSA cache 布局支持，通用化 cache 初始化、命名及 transfer 逻辑。
变更范围：fastdeploy/cache_manager/v1/cache_controller.py、fastdeploy/cache_manager/v1/transfer_manager.py
影响面 Tag：[KVCache]

📝 PR 规范检查

PR 标题缺少合法 [Tag]，以 wip: 开头不符合规范。PR 描述中 Motivation、Modifications、Usage or Command 均为空。

标题建议（可直接复制）：

[KVCache] Add MLA and DSA cache layout support in v1 cache manager

PR 描述建议（可直接复制，必须复刻 checklist §D2 模板的完整结构）：

## Motivation
v1 cache manager 原仅支持 GQA/MHA 路径。为支持 DeepSeek 等模型使用的 MLA（Multi-head Latent Attention，仅 key cache，无 value）以及 DSA（HiSparse，key + indexer uint8 双池）cache 布局，需要对 cache 初始化、命名和 transfer 逻辑进行泛化。

## Modifications
- `cache_controller.py`：
  - 新增 `_is_mla` / `_is_dsa` 标志（基于 `kv_lora_rank` 与 `index_head_dim`）
  - `_get_cache_names` 统一命名规范：GQA 含 key/value/scales，DSA 含 key/indexer，MLA 仅含 key；key cache 名从 `key_caches_` 改为 `key_cache_`
  - 新增 `initialize_mla_kv_cache`（key-only，调用 set_data_ipc 固定内存）
  - 新增 `initialize_dsa_kv_cache`（key + indexer uint8，调用 set_data_ipc）
  - `initialize_kv_cache` 分发到三条路径；`initialize_mtp_kv_cache`、`initialize_host_cache` 同步支持 DSA 分支
- `transfer_manager.py`：
  - 新增 `_is_mla` / `_is_dsa` 标志
  - `_build_device_layer_indices` / `_build_host_layer_indices` 按模式填充 device/host 索引
  - swap 路径（all-layers / single-layer / async）跳过 MLA value swap；DSA 使用 indexer 槽位进行 swap

## Usage or Command
N/A

## Accuracy Tests
- DSA 集中式验证（gsm8k）：
```bash
python3 gsm8k.py
🎯 Evaluation Complete: Accuracy = 94.06% (649/690)
```

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别	文件	概述
🔴 Bug	`cache_controller.py:460`	`initialize_mtp_kv_cache` DSA 分支：解包值数量错误（2 vs 3）且 `indexer_cache_shape` 未定义
🟡 建议	`tests/cache_manager/v1/`	`key_caches_` 命名未随实现同步更新为 `key_cache_`，相关测试将失败
🟡 建议	`cache_controller.py:368`	`initialize_mla_kv_cache` 未调用 `initialize_host_cache`，若启用 swap 将静默失败

总体评价

整体设计思路合理，MLA/DSA 三路分发架构清晰。但 initialize_mtp_kv_cache 的 DSA 分支存在导致运行时 crash 的 P0 bug，且 key cache 命名变更未同步到测试文件，需修复后合入。

PaddlePaddle-bot · 2026-05-11T06:53:53Z

-        )
+        if self._is_dsa:
+            kv_cache_quant_type = "uint8"
+            key_cache_shape, value_cache_shape = attn_backend.get_kv_cache_shape(


🔴 Bug initialize_mtp_kv_cache 的 DSA 分支解包值数量错误，且 indexer_cache_shape 未定义

DSA backend 的 get_kv_cache_shape 返回 3 个值（如 initialize_dsa_kv_cache 和 initialize_host_cache 中正确解包为 key_cache_shape, _, indexer_cache_shape 所示），但此处只解包 2 个值，将抛出 ValueError: too many values to unpack。

此外，indexer_cache_shape 在 _is_dsa=True 分支中从未赋值，后续 elif indexer_cache_shape: 会触发 NameError。

建议修复：

if self._is_dsa: kv_cache_quant_type = "uint8" key_cache_shape, _, indexer_cache_shape = attn_backend.get_kv_cache_shape( max_num_blocks=num_gpu_blocks, kv_cache_quant_type=kv_cache_quant_type ) value_cache_shape = [] # DSA 没有 value cache cache_dtype = "uint8" else: key_cache_shape, value_cache_shape = attn_backend.get_kv_cache_shape( max_num_blocks=num_gpu_blocks, kv_cache_quant_type=kv_cache_quant_type ) indexer_cache_shape = [] cache_dtype = self.model_config.dtype

PaddlePaddle-bot · 2026-05-11T06:54:51Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-11 14:53:35

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 7b1dc24
Merge base: a2e216a (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

当前存在 1 个 Required 失败任务需优先处理，另有 4 个 Required 任务等待中、1 个运行中。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
38(0)	38	24	3	4	6	1

2 任务状态汇总

2.1 Required任务 : 2/8 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Approval`	6s	PR问题：新增logger.info调用，需指定RD审批日志变更	@xyxinyang 或 @zyyzghb 在 PR 中审批	Job	-
⏳	`Run Base Tests / base_tests`	-	运行中	-	Job	-
⏸️	`Extracted partial CE model tasks to run in CI. / run_ce_cases`	-	等待中	-	-	-
⏸️	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	等待中	-	-	-
⏸️	`Run Four Cards Tests / run_4_cards_tests`	-	等待中	-	-	-
⏸️	`Run Stable Tests / stable_tests`	-	等待中	-	-	-
✅	其余 2 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 22/30 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Check PR Template`	18s	Job	-
❌	`Cleanup artifacts`	7s	Job	-
✅	其余 22 个可选任务通过	-	-	-

3 失败详情（仅 required）

Approval — 代码规范（置信度: 高）

Approval

状态: ❌ 失败
错误类型: 代码规范
置信度: 高
根因摘要: PR新增logger.info调用，需指定RD审批日志变更
分析器: 通用分析(fallback)

根因详情:
本次 PR 在 DSA kv cache 相关代码中新增了多处 logger.info 调用。根据仓库 scripts/check_approval.sh 的审批规则，修改日志行为（.info/.debug/.error/log_request）需要至少一位指定 FastDeploy RD 进行审批。当前 PR 尚未收到指定审批者的 approve，导致 Approval 任务以 exit code 6 失败。

关键日志:

Detected log modification in diff:
+        logger.info(
+        logger.info("GQA kv cache initialized!")
+        logger.info(f"Initializing MLA kv cache: ...")
+        logger.info("MLA kv cache initialized!")
+        logger.info("DSA kv cache initialized!")

0. You must have one FastDeploy RD (xyxinyang(zhouchong), zyyzghb(zhangyongyue)) approval for modifying logging behavior (.info/.debug/.error/log_request).
There are 1 approved errors.
##[error]Process completed with exit code 6.

修复建议:

在 PR 评论中 @ xyxinyang(zhouchong) 或 zyyzghb(zhangyongyue) 请求审批，审批后 CI 会自动重跑通过
若 logger 调用为非必要，可考虑移除部分新增的 logger.info 语句

修复建议摘要: @xyxinyang 或 @zyyzghb 在 PR 中审批即可

关联变更: PR 新增 DSA kv cache 相关 logger.info 调用（GQA/MLA/DSA kv cache 初始化日志）
链接: 查看日志

codecov-commenter · 2026-05-11T08:14:59Z

Codecov Report

❌ Patch coverage is 18.79699% with 108 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@a2e216a). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/cache_manager/v1/cache_controller.py	4.70%	81 Missing ⚠️
fastdeploy/cache_manager/v1/transfer_manager.py	43.75%	22 Missing and 5 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7770   +/-   ##
==========================================
  Coverage           ?   71.49%           
==========================================
  Files              ?      396           
  Lines              ?    55789           
  Branches           ?     8730           
==========================================
  Hits               ?    39887           
  Misses             ?    13157           
  Partials           ?     2745

Flag	Coverage Δ
GPU	`71.49% <18.79%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Moonchild1227 and others added 4 commits May 11, 2026 14:37

feat: support dsa for v1 cache manager

b0cbb89

style: pre-commit

8c043d4

feat: Support DSA and reserved pooled interface for DSA offloading.

f54eaa7

Align V1 KV cache init with V0 by calling set_data_ipc

7b1dc24

Copilot AI review requested due to automatic review settings May 11, 2026 06:42

Moonchild1227 had a problem deploying to Metax_ci May 11, 2026 06:42 — with GitHub Actions Failure

paddle-bot Bot added the contributor External developers label May 11, 2026

Moonchild1227 changed the title ~~wip: Feat/hisparse~~ wip: dsa for v1 cache manager May 11, 2026

Copilot started reviewing on behalf of Moonchild1227 May 11, 2026 06:43 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

PaddlePaddle-bot suggested changes May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wip: dsa for v1 cache manager#7770

wip: dsa for v1 cache manager#7770
Moonchild1227 wants to merge 4 commits into
PaddlePaddle:developfrom
Moonchild1227:feat/hisparse

Moonchild1227 commented May 11, 2026

Uh oh!

paddle-bot Bot commented May 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 11, 2026

Uh oh!

PaddlePaddle-bot commented May 11, 2026

Approval

Uh oh!

codecov-commenter commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		paddle.device.cuda.empty_cache()
		logger.info("MLA kv cache initialized!")

Conversation

Moonchild1227 commented May 11, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot commented May 11, 2026

1 任务总览

2 任务状态汇总

2.1 Required任务 : 2/8 通过

2.2 可选任务 — 22/30 通过

3 失败详情（仅 required）

Approval

Uh oh!

codecov-commenter commented May 11, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants