[KVCache][BugFix] Fix cache_controller_v1 kv_cache_quant_type dtype and null value_cache_shape crash by kevincheng2 · Pull Request #7757 · PaddlePaddle/FastDeploy

kevincheng2 · 2026-05-09T06:05:53Z

问题描述

本 PR 修复 CacheControllerV1 中两个 KV cache 初始化相关的 bug：

1. KV cache 量化时 dtype 错误

当 enable_cache_manager_v1=True 且配置了 KV cache 量化（如 kv_cache_quant_type=int8）时，CacheController.initialize_kv_cache 和 initialize_mtp_kv_cache 仍然使用模型计算精度（bfloat16）分配 KV cache tensor，而不是量化所需的 uint8。

这导致 C++ 算子 append_attention_gpu 崩溃：

RuntimeError: (InvalidArgument) The type of data we are trying to retrieve (uint8) does not match the type of data (bfloat16) currently contained in the container.
[Hint: Expected dtype() == phi::CppTypeToDataType<T>::Type(), but received dtype():16 != phi::CppTypeToDataType<T>::Type():2.]

根因：attention layer 的 cache_quant_type_str 被正确设为 "cache_int8"，C++ kernel 按 uint8_t* 访问 cache；但 v1 cache controller 分配的 cache 是 bfloat16，产生 dtype 不匹配。

2. value_cache_shape 为 None 时崩溃（MLA/DeepSeek）

initialize_kv_cache 和 initialize_mtp_kv_cache 无条件创建 value cache tensor。对于返回 value_cache_shape=None 的 attention backend（如 MLA 变体 / DeepSeek），paddle.full(shape=None, ...) 会导致崩溃。

gpu_model_runner.py 已正确处理此情况 — 本 PR 使 cache_controller_v1 对齐该逻辑。

修复

cache_controller.py: 根据 kv_cache_quant_type 确定 cache_dtype：有量化类型 → "uint8"，无量化类型 → model_config.dtype
cache_controller.py: 在创建 val_cache / val_cache_scales 前检查 if value_cache_shape；cache_kvs_list 顺序与 gpu_model_runner.py 一致：[key] 或 [key, val]
initialize_mtp_kv_cache: 应用相同的两处修复

影响范围

CacheController.initialize_kv_cache()
CacheController.initialize_mtp_kv_cache()
不影响 initialize_host_cache()（已正确使用 cache_config.cache_dtype）

单元测试

新增测试覆盖：

initialize_kv_cache / initialize_mtp_kv_cache 在无量化、int8、block_wise_fp8 下的 dtype 验证
initialize_kv_cache / initialize_mtp_kv_cache 在 value_cache_shape=None 下的行为（MLA / DeepSeek）
cache_kvs_map dtype 验证

Checklist

Add at least a tag in the PR title.
Format your code, run pre-commit before commit.
Add unit tests.
Provide accuracy results.

… dtype when kv_cache_quant_type is set When enable_cache_manager_v1=True and kv_cache_quant_type is configured (e.g., int8), cache_controller.v1 was allocating KV cache tensors using model compute dtype (bfloat16) instead of uint8. This caused a C++ dtype mismatch crash in append_attention_gpu because the attention kernel accesses int8/fp8 quantized caches as uint8_t* internally. Fix: use "uint8" as the cache allocation dtype whenever kv_cache_quant_type is not None, consistent with how gpu_model_runner handles this in the non-v1 code path. Affected: initialize_kv_cache() and initialize_mtp_kv_cache() in CacheController.

paddle-bot · 2026-05-09T06:05:59Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-09T06:34:10Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-11 14:52:24

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 92e9052
Merge base: 85f1cb2 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

CI 任务仍在进行中：3 个 Required 任务运行中，1 个 Required 任务等待中。目前无 Required 任务失败，整体状态良好。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
36(0)	36	28	1	5	2	0

2 任务状态汇总

2.1 Required任务 : 6/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
⏳	`xpu_8cards_case_test / run_xpu_8cards_cases`	-	运行中	-	Job	-
⏳	`Extracted partial CE model tasks to run in CI. / run_ce_cases`	-	运行中	-	Job	-
⏳	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	运行中	-	Job	-
⏸️	`xpu_4cards_case_test / run_xpu_4cards_cases`	-	等待中	-	-	-
✅	其余 6 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 22/26 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Check PR Template`	12s	Job	-
⏳	`Run iluvatar Tests / run_iluvatar_cases`	-	Job	-
⏳	`Trigger Jenkins for PR`	-	Job	-
⏸️	`CI_HPU`	-	-	-
✅	其余 22 个可选任务通过	-	-	-

3 失败详情（仅 required）

无 required 失败任务。

codecov-commenter · 2026-05-09T07:32:53Z

Codecov Report

❌ Patch coverage is 70.83333% with 7 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@85f1cb2). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/cache_manager/v1/cache_controller.py	70.83%	6 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7757   +/-   ##
==========================================
  Coverage           ?   71.68%           
==========================================
  Files              ?      396           
  Lines              ?    55713           
  Branches           ?     8713           
==========================================
  Hits               ?    39939           
  Misses             ?    13030           
  Partials           ?     2744

Flag	Coverage Δ
GPU	`71.68% <70.83%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

## Motivation PR PaddlePaddle#7757 修改了 initialize_kv_cache 和 initialize_mtp_kv_cache，量化场景下（kv_cache_quant_type is not None）使用 uint8 存储，非量化场景使用 model_config.dtype，补充对应单元测试。 ## Modifications 新增 TestInitializeKVCacheDtype 测试类（6 个用例）： - 无量化时 initialize_kv_cache 使用 model_config.dtype（bfloat16/float16） - int8 量化时 initialize_kv_cache 使用 uint8 - block_wise_fp8 量化时 initialize_kv_cache key/value 张量使用 uint8 - 无量化时 initialize_mtp_kv_cache 使用 model_config.dtype - int8 量化时 initialize_mtp_kv_cache 使用 uint8 - 量化时 cache_kvs_map 中存储的张量也是 uint8

…e is None Motivation initialize_kv_cache and initialize_mtp_kv_cache in CacheControllerV1 always unconditionally create a value cache tensor, which causes a crash (None shape) for attention backends that return value_cache_shape=None (e.g. MLA variants). Modifications - initialize_kv_cache: handle get_kv_cache_shape returning None value_cache_shape; only create val_cache / val_cache_scales when value_cache_shape is not None; cache_kvs_list order now matches gpu_model_runner.py: [key] or [key, val]. - initialize_mtp_kv_cache: apply the same fix for MTP layers.

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-11 14:31:09

📋 Review 摘要

PR 概述：修复 cache_manager_v1 在开启 KV cache 量化时错误使用模型 dtype 而非 uint8 分配 cache tensor 的 bug，并补充 MLA 架构下 value_cache_shape 为 None 的安全处理。
变更范围：fastdeploy/cache_manager/v1/cache_controller.py、tests/cache_manager/v1/
影响面 Tag：[KVCache] [BugFix]

📝 PR 规范检查

标题包含两个 Tag（[KVCache][BugFix]），按规范每个 PR 标题应仅包含一个官方 Tag；PR 描述未使用官方模板（缺少 ## Motivation、## Modifications 等标准 section 标题）。

标题建议（可直接复制）：

[BugFix] fix cache_manager_v1 allocating kv cache with wrong dtype when kv_cache_quant_type is set

PR 描述建议（可直接复制，必须复刻 checklist §D2 模板的完整结构）：

## Motivation
当 `enable_cache_manager_v1=True` 且设置了 KV cache 量化类型（如 `kv_cache_quant_type=int8`）时，`CacheController.initialize_kv_cache` 和 `initialize_mtp_kv_cache` 仍以模型计算精度（如 `bfloat16`）分配 KV cache tensor，而非量化所需的 `uint8`，导致 C++ 算子 `append_attention_gpu` 因 dtype 不匹配而崩溃：`The type of data we are trying to retrieve (uint8) does not match the type of data (bfloat16)`。

## Modifications
- `fastdeploy/cache_manager/v1/cache_controller.py`：在 `initialize_kv_cache` 和 `initialize_mtp_kv_cache` 中，根据 `kv_cache_quant_type` 确定 `cache_dtype`（有量化时用 `"uint8"`，无量化时用 `model_config.dtype`），与 `gpu_model_runner.py` 非 v1 路径保持一致；同时补充 `value_cache_shape` 为 `None` 时（MLA 架构）只分配 key cache 的处理逻辑。
- `tests/cache_manager/v1/test_cache_controller.py`：新增 `TestInitializeKVCacheDtype` 测试类，覆盖非量化、int8、block_wise_fp8、MLA（value_cache_shape=None）等场景下的 dtype 正确性验证。

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别	文件	概述
📝 PR 规范	PR 标题	标题包含两个 Tag（`[KVCache][BugFix]`），规范要求仅一个
📝 PR 规范	PR 描述	描述缺少标准 section 标题（`## Motivation`、`## Modifications`、`## Usage or Command`、`## Accuracy Tests`、`## Checklist`），未通过结构核验

总体评价

修复逻辑正确，与非 v1 路径保持一致，并配有充分的单元测试；仅需按规范整理标题（单 Tag）和描述（标准模板）后即可合入。

kevincheng2 had a problem deploying to Metax_ci May 9, 2026 06:05 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

kevincheng2 had a problem deploying to Metax_ci May 11, 2026 03:25 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

kevincheng2 added 2 commits May 11, 2026 14:14

[KVCache][Test] add unit tests for null value_cache_shape (MLA/DeepSeek)

92e9052

kevincheng2 had a problem deploying to Metax_ci May 11, 2026 06:24 — with GitHub Actions Failure

PaddlePaddle-bot reviewed May 11, 2026

View reviewed changes

kevincheng2 changed the title ~~[KVCache][BugFix] fix cache_manager_v1 allocating kv cache with wrong dtype when kv_cache_quant_type is set~~ [KVCache][BugFix] Fix cache_controller_v1 kv_cache_quant_type dtype and null value_cache_shape crash May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KVCache][BugFix] Fix cache_controller_v1 kv_cache_quant_type dtype and null value_cache_shape crash#7757

[KVCache][BugFix] Fix cache_controller_v1 kv_cache_quant_type dtype and null value_cache_shape crash#7757
kevincheng2 wants to merge 4 commits into
PaddlePaddle:developfrom
kevincheng2:fix/cache-controller-v1-kv-quant-dtype

kevincheng2 commented May 9, 2026 •

edited

Loading

Uh oh!

paddle-bot Bot commented May 9, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 9, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented May 9, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kevincheng2 commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

问题描述

1. KV cache 量化时 dtype 错误

2. value_cache_shape 为 None 时崩溃（MLA/DeepSeek）

修复

影响范围

单元测试

Checklist

Uh oh!

paddle-bot Bot commented May 9, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 6/10 通过

2.2 可选任务 — 22/26 通过

3 失败详情（仅 required）

Uh oh!

codecov-commenter commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kevincheng2 commented May 9, 2026 •

edited

Loading

PaddlePaddle-bot commented May 9, 2026 •

edited

Loading

codecov-commenter commented May 9, 2026 •

edited

Loading