Skip to content

Conversation

@liyonghua0910
Copy link
Collaborator

@liyonghua0910 liyonghua0910 commented Feb 10, 2026

Motivation

修复 num_cpu_blocks 计算错误问题:

  • 在此 PR 之前,若要分配 100G 内存用作 cpu cache,需要配置 swap_space: 50;
  • 在此 PR 之后,若要分配 100G 内存用作 cpu cache,需要配置 swap_space: 100。

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Feb 10, 2026

Thanks for your contribution!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 旨在修复 CacheConfig.num_cpu_blocks(由 swap_space 推导)计算不准确的问题,并统一 KV cache 的 dtype→字节数映射逻辑,同时将 MLA cache 的判定下沉到 CacheConfig 中供其他模块复用。

Changes:

  • CacheConfig 中新增 get_cache_bytes() 并重构 bytes_per_block/num_cpu_blocks 的计算逻辑(引入 use_mla_cachekv_factor)。
  • GPUModelRunner 的理论 KV cache 估算逻辑改为使用 fd_config.cache_config.use_mla_cache
  • CacheTransferManager 复用 CacheConfig.get_cache_bytes(),并新增/补充相关单元测试。

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/utils/test_config.py 新增 get_cache_bytesnum_cpu_blocks 计算的单测用例
fastdeploy/worker/gpu_model_runner.py MLA cache 判定来源改为 cache_config.use_mla_cache
fastdeploy/config.py 重构 CPU cache blocks 计算,并新增 get_cache_bytes()use_mla_cache
fastdeploy/cache_manager/prefix_cache_manager.py 调整初始化日志输出字段
fastdeploy/cache_manager/cache_transfer_manager.py 复用 CacheConfig.get_cache_bytes() 并移除重复实现

Comment on lines +127 to +129
f"Prefix cache manager is initialized with {self.num_gpu_blocks} gpu blocks "
f"and {self.num_cpu_blocks} cpu blocks, bytes_per_token_per_layer for each rank: "
f"{self.cache_config.bytes_per_token_per_layer / self.config.parallel_config.tensor_parallel_size}"
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的日志改为读取 self.cache_config.bytes_per_token_per_layer,但在仓库里不少单测/构造器用的是 SimpleNamespace 或未设置 model_cfgCacheConfig,并不会有该属性(例如 tests/cache_manager/test_prefix_cache_manager.py 的 fake cache_config 只提供了 bytes_per_layer_per_block)。这样会在初始化时直接触发 AttributeError,导致现有测试/用法回归。
建议:日志中用 getattr 提供回退(优先 bytes_per_token_per_layer,否则回退到 bytes_per_layer_per_block 或跳过该字段),确保兼容旧的测试桩对象。

Copilot uses AI. Check for mistakes.
Comment on lines 232 to 265
# Test case 1: swap_space is None -> num_cpu_blocks = 0
cache_config = CacheConfig({
"model_cfg": model_config,
"cache_dtype": "bfloat16",
"swap_space": None,
})
assert cache_config.num_cpu_blocks == 0

# Test case 2: swap_space = 1GB
# bytes_per_block = head_num * head_dim * byte_size * kv_factor * block_size * num_hidden_layers
# = 32 * 128 * 2 * 2 * 64 * 24 = 25165824 bytes
# num_cpu_blocks = 1 * 1024^3 / 25165824 = 42
cache_config = CacheConfig({
"model_cfg": model_config,
"cache_dtype": "bfloat16",
"swap_space": 1,
})
expected_blocks = int(1 * 1024 ** 3 / (32 * 128 * 2 * 2 * 64 * 24))
assert cache_config.num_cpu_blocks == expected_blocks
assert cache_config.num_cpu_blocks == 42

# Test case 3: swap_space = 2GB
cache_config = CacheConfig({
"model_cfg": model_config,
"cache_dtype": "bfloat16",
"swap_space": 2,
})
assert cache_config.num_cpu_blocks == 85

# Test case 4: with fp32 dtype (4 bytes)
cache_config = CacheConfig({
"model_cfg": model_config,
"cache_dtype": "float32",
"swap_space": 1,
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

该单测对 num_cpu_blocks 的期望值依赖多个隐含默认值(例如 block_size=64FD_ATTENTION_BACKEND != MLA_ATTN 从而 kv_factor=2)。一旦这些默认值在未来调整或 CI 环境设置了不同的 attention backend,测试会稳定性变差/误报。
建议:在构造 CacheConfig 时显式传入 block_size,并显式设置 use_mla_cache(或在测试里 patch envs.FD_ATTENTION_BACKEND),同时用 cache_config.bytes_per_block 推导期望值以减少对魔法数字的耦合。

Suggested change
# Test case 1: swap_space is None -> num_cpu_blocks = 0
cache_config = CacheConfig({
"model_cfg": model_config,
"cache_dtype": "bfloat16",
"swap_space": None,
})
assert cache_config.num_cpu_blocks == 0
# Test case 2: swap_space = 1GB
# bytes_per_block = head_num * head_dim * byte_size * kv_factor * block_size * num_hidden_layers
# = 32 * 128 * 2 * 2 * 64 * 24 = 25165824 bytes
# num_cpu_blocks = 1 * 1024^3 / 25165824 = 42
cache_config = CacheConfig({
"model_cfg": model_config,
"cache_dtype": "bfloat16",
"swap_space": 1,
})
expected_blocks = int(1 * 1024 ** 3 / (32 * 128 * 2 * 2 * 64 * 24))
assert cache_config.num_cpu_blocks == expected_blocks
assert cache_config.num_cpu_blocks == 42
# Test case 3: swap_space = 2GB
cache_config = CacheConfig({
"model_cfg": model_config,
"cache_dtype": "bfloat16",
"swap_space": 2,
})
assert cache_config.num_cpu_blocks == 85
# Test case 4: with fp32 dtype (4 bytes)
cache_config = CacheConfig({
"model_cfg": model_config,
"cache_dtype": "float32",
"swap_space": 1,
# Use an explicit block size to avoid relying on CacheConfig defaults
block_size = 64
# Test case 1: swap_space is None -> num_cpu_blocks = 0
cache_config = CacheConfig({
"model_cfg": model_config,
"cache_dtype": "bfloat16",
"swap_space": None,
"block_size": block_size,
"use_mla_cache": False,
})
assert cache_config.num_cpu_blocks == 0
# Test case 2: swap_space = 1GB
cache_config = CacheConfig({
"model_cfg": model_config,
"cache_dtype": "bfloat16",
"swap_space": 1,
"block_size": block_size,
"use_mla_cache": False,
})
expected_blocks = int(1 * 1024 ** 3 / cache_config.bytes_per_block)
assert cache_config.num_cpu_blocks == expected_blocks
# Test case 3: swap_space = 2GB
cache_config = CacheConfig({
"model_cfg": model_config,
"cache_dtype": "bfloat16",
"swap_space": 2,
"block_size": block_size,
"use_mla_cache": False,
})
expected_blocks = int(2 * 1024 ** 3 / cache_config.bytes_per_block)
assert cache_config.num_cpu_blocks == expected_blocks
# Test case 4: with fp32 dtype (4 bytes)
cache_config = CacheConfig({
"model_cfg": model_config,
"cache_dtype": "float32",
"swap_space": 1,
"block_size": block_size,
"use_mla_cache": False,

Copilot uses AI. Check for mistakes.
@@ -2745,7 +2739,7 @@ def cal_theortical_kvcache(self):

# NOTE:(changwenbin) Determie whether it is Multi-Head Latent Attention,
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注释里 “Determie” 拼写错误,建议改为 “Determine”。(注释仍建议保持英文)

Suggested change
# NOTE:(changwenbin) Determie whether it is Multi-Head Latent Attention,
# NOTE:(changwenbin) Determine whether it is Multi-Head Latent Attention,

Copilot uses AI. Check for mistakes.
* byte_size
)

self.head_num = getattr(self.model_cfg, "num_key_value_heads") or self.model_cfg.num_attention_heads
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里 self.head_num = getattr(self.model_cfg, "num_key_value_heads") or ... 有两个确定性问题:

  1. getattr 未提供默认值,若某些 model_cfg 没有 num_key_value_heads 属性会直接抛异常(回归了之前的 hasattr 保护)。
  2. 代码没有处理 num_key_value_heads <= 0 的哨兵值;在本仓库 ModelConfig 默认 num_key_value_heads = -1,此时会被当作 truthy 使用,导致 head_num 为 -1,进而 bytes_per_block/num_cpu_blocks 计算变成负数/错误。
    建议:使用 getattr(self.model_cfg, "num_key_value_heads", None) 并显式判断 is not None and int(v) > 0,否则回退到 num_attention_heads
Suggested change
self.head_num = getattr(self.model_cfg, "num_key_value_heads") or self.model_cfg.num_attention_heads
# Prefer num_key_value_heads when it is defined and positive; otherwise fall back to num_attention_heads
num_kv_heads = getattr(self.model_cfg, "num_key_value_heads", None)
if num_kv_heads is not None:
try:
num_kv_heads = int(num_kv_heads)
except (TypeError, ValueError):
num_kv_heads = None
if num_kv_heads is not None and num_kv_heads > 0:
self.head_num = num_kv_heads
else:
self.head_num = self.model_cfg.num_attention_heads

Copilot uses AI. Check for mistakes.
@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@83b4b08). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #6438   +/-   ##
==========================================
  Coverage           ?   68.23%           
==========================================
  Files              ?      391           
  Lines              ?    52789           
  Branches           ?     8220           
==========================================
  Hits               ?    36023           
  Misses             ?    14106           
  Partials           ?     2660           
Flag Coverage Δ
GPU 68.23% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants