[KVCache][BugFix] Fix cache_controller_v1 kv_cache_quant_type dtype and null value_cache_shape crash#7757
Conversation
… dtype when kv_cache_quant_type is set When enable_cache_manager_v1=True and kv_cache_quant_type is configured (e.g., int8), cache_controller.v1 was allocating KV cache tensors using model compute dtype (bfloat16) instead of uint8. This caused a C++ dtype mismatch crash in append_attention_gpu because the attention kernel accesses int8/fp8 quantized caches as uint8_t* internally. Fix: use "uint8" as the cache allocation dtype whenever kv_cache_quant_type is not None, consistent with how gpu_model_runner handles this in the non-v1 code path. Affected: initialize_kv_cache() and initialize_mtp_kv_cache() in CacheController.
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览CI 任务仍在进行中:3 个 Required 任务运行中,1 个 Required 任务等待中。目前无 Required 任务失败,整体状态良好。
2 任务状态汇总2.1 Required任务 : 6/10 通过
2.2 可选任务 — 22/26 通过
3 失败详情(仅 required)无 required 失败任务。 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7757 +/- ##
==========================================
Coverage ? 71.68%
==========================================
Files ? 396
Lines ? 55713
Branches ? 8713
==========================================
Hits ? 39939
Misses ? 13030
Partials ? 2744
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
## Motivation PR PaddlePaddle#7757 修改了 initialize_kv_cache 和 initialize_mtp_kv_cache, 量化场景下(kv_cache_quant_type is not None)使用 uint8 存储, 非量化场景使用 model_config.dtype,补充对应单元测试。 ## Modifications 新增 TestInitializeKVCacheDtype 测试类(6 个用例): - 无量化时 initialize_kv_cache 使用 model_config.dtype(bfloat16/float16) - int8 量化时 initialize_kv_cache 使用 uint8 - block_wise_fp8 量化时 initialize_kv_cache key/value 张量使用 uint8 - 无量化时 initialize_mtp_kv_cache 使用 model_config.dtype - int8 量化时 initialize_mtp_kv_cache 使用 uint8 - 量化时 cache_kvs_map 中存储的张量也是 uint8
…e is None Motivation initialize_kv_cache and initialize_mtp_kv_cache in CacheControllerV1 always unconditionally create a value cache tensor, which causes a crash (None shape) for attention backends that return value_cache_shape=None (e.g. MLA variants). Modifications - initialize_kv_cache: handle get_kv_cache_shape returning None value_cache_shape; only create val_cache / val_cache_scales when value_cache_shape is not None; cache_kvs_list order now matches gpu_model_runner.py: [key] or [key, val]. - initialize_mtp_kv_cache: apply the same fix for MTP layers.
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-11 14:31:09
📋 Review 摘要
PR 概述:修复 cache_manager_v1 在开启 KV cache 量化时错误使用模型 dtype 而非 uint8 分配 cache tensor 的 bug,并补充 MLA 架构下 value_cache_shape 为 None 的安全处理。
变更范围:fastdeploy/cache_manager/v1/cache_controller.py、tests/cache_manager/v1/
影响面 Tag:[KVCache] [BugFix]
📝 PR 规范检查
标题包含两个 Tag([KVCache][BugFix]),按规范每个 PR 标题应仅包含一个官方 Tag;PR 描述未使用官方模板(缺少 ## Motivation、## Modifications 等标准 section 标题)。
标题建议(可直接复制):
[BugFix] fix cache_manager_v1 allocating kv cache with wrong dtype when kv_cache_quant_type is set
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
当 `enable_cache_manager_v1=True` 且设置了 KV cache 量化类型(如 `kv_cache_quant_type=int8`)时,`CacheController.initialize_kv_cache` 和 `initialize_mtp_kv_cache` 仍以模型计算精度(如 `bfloat16`)分配 KV cache tensor,而非量化所需的 `uint8`,导致 C++ 算子 `append_attention_gpu` 因 dtype 不匹配而崩溃:`The type of data we are trying to retrieve (uint8) does not match the type of data (bfloat16)`。
## Modifications
- `fastdeploy/cache_manager/v1/cache_controller.py`:在 `initialize_kv_cache` 和 `initialize_mtp_kv_cache` 中,根据 `kv_cache_quant_type` 确定 `cache_dtype`(有量化时用 `"uint8"`,无量化时用 `model_config.dtype`),与 `gpu_model_runner.py` 非 v1 路径保持一致;同时补充 `value_cache_shape` 为 `None` 时(MLA 架构)只分配 key cache 的处理逻辑。
- `tests/cache_manager/v1/test_cache_controller.py`:新增 `TestInitializeKVCacheDtype` 测试类,覆盖非量化、int8、block_wise_fp8、MLA(value_cache_shape=None)等场景下的 dtype 正确性验证。
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 📝 PR 规范 | PR 标题 | 标题包含两个 Tag([KVCache][BugFix]),规范要求仅一个 |
| 📝 PR 规范 | PR 描述 | 描述缺少标准 section 标题(## Motivation、## Modifications、## Usage or Command、## Accuracy Tests、## Checklist),未通过结构核验 |
总体评价
修复逻辑正确,与非 v1 路径保持一致,并配有充分的单元测试;仅需按规范整理标题(单 Tag)和描述(标准模板)后即可合入。
问题描述
本 PR 修复
CacheControllerV1中两个 KV cache 初始化相关的 bug:1. KV cache 量化时 dtype 错误
当
enable_cache_manager_v1=True且配置了 KV cache 量化(如kv_cache_quant_type=int8)时,CacheController.initialize_kv_cache和initialize_mtp_kv_cache仍然使用模型计算精度(bfloat16)分配 KV cache tensor,而不是量化所需的uint8。这导致 C++ 算子
append_attention_gpu崩溃:根因:attention layer 的
cache_quant_type_str被正确设为"cache_int8",C++ kernel 按uint8_t*访问 cache;但 v1 cache controller 分配的 cache 是bfloat16,产生 dtype 不匹配。2. value_cache_shape 为 None 时崩溃(MLA/DeepSeek)
initialize_kv_cache和initialize_mtp_kv_cache无条件创建 value cache tensor。对于返回value_cache_shape=None的 attention backend(如 MLA 变体 / DeepSeek),paddle.full(shape=None, ...)会导致崩溃。gpu_model_runner.py已正确处理此情况 — 本 PR 使cache_controller_v1对齐该逻辑。修复
cache_controller.py: 根据kv_cache_quant_type确定cache_dtype:有量化类型 →"uint8",无量化类型 →model_config.dtypecache_controller.py: 在创建val_cache/val_cache_scales前检查if value_cache_shape;cache_kvs_list顺序与gpu_model_runner.py一致:[key]或[key, val]initialize_mtp_kv_cache: 应用相同的两处修复影响范围
CacheController.initialize_kv_cache()CacheController.initialize_mtp_kv_cache()initialize_host_cache()(已正确使用cache_config.cache_dtype)单元测试
新增测试覆盖:
initialize_kv_cache/initialize_mtp_kv_cache在无量化、int8、block_wise_fp8 下的 dtype 验证initialize_kv_cache/initialize_mtp_kv_cache在value_cache_shape=None下的行为(MLA / DeepSeek)cache_kvs_mapdtype 验证Checklist