[BugFix] fix num_cpu_blocks computation #6438

liyonghua0910 · 2026-02-10T09:21:38Z

Motivation

修复 num_cpu_blocks 计算错误问题：

在此 PR 之前，若要分配 100G 内存用作 cpu cache，需要配置 swap_space: 50；
在此 PR 之后，若要分配 100G 内存用作 cpu cache，需要配置 swap_space: 100。

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-02-10T09:21:46Z

Thanks for your contribution!

Copilot

Pull request overview

该 PR 旨在修复 CacheConfig.num_cpu_blocks（由 swap_space 推导）计算不准确的问题，并统一 KV cache 的 dtype→字节数映射逻辑，同时将 MLA cache 的判定下沉到 CacheConfig 中供其他模块复用。

Changes:

在 CacheConfig 中新增 get_cache_bytes() 并重构 bytes_per_block/num_cpu_blocks 的计算逻辑（引入 use_mla_cache 与 kv_factor）。
GPUModelRunner 的理论 KV cache 估算逻辑改为使用 fd_config.cache_config.use_mla_cache。
CacheTransferManager 复用 CacheConfig.get_cache_bytes()，并新增/补充相关单元测试。

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/utils/test_config.py	新增 `get_cache_bytes` 与 `num_cpu_blocks` 计算的单测用例
fastdeploy/worker/gpu_model_runner.py	MLA cache 判定来源改为 `cache_config.use_mla_cache`
fastdeploy/config.py	重构 CPU cache blocks 计算，并新增 `get_cache_bytes()` 与 `use_mla_cache`
fastdeploy/cache_manager/prefix_cache_manager.py	调整初始化日志输出字段
fastdeploy/cache_manager/cache_transfer_manager.py	复用 `CacheConfig.get_cache_bytes()` 并移除重复实现

Copilot · 2026-02-10T10:12:40Z

fastdeploy/cache_manager/prefix_cache_manager.py

+            f"Prefix cache manager is initialized with {self.num_gpu_blocks} gpu blocks "
+            f"and {self.num_cpu_blocks} cpu blocks, bytes_per_token_per_layer for each rank: "
+            f"{self.cache_config.bytes_per_token_per_layer / self.config.parallel_config.tensor_parallel_size}"


这里的日志改为读取 self.cache_config.bytes_per_token_per_layer，但在仓库里不少单测/构造器用的是 SimpleNamespace 或未设置 model_cfg 的 CacheConfig，并不会有该属性（例如 tests/cache_manager/test_prefix_cache_manager.py 的 fake cache_config 只提供了 bytes_per_layer_per_block）。这样会在初始化时直接触发 AttributeError，导致现有测试/用法回归。
建议：日志中用 getattr 提供回退（优先 bytes_per_token_per_layer，否则回退到 bytes_per_layer_per_block 或跳过该字段），确保兼容旧的测试桩对象。

Copilot · 2026-02-10T10:12:40Z

tests/utils/test_config.py

+        # Test case 1: swap_space is None -> num_cpu_blocks = 0
+        cache_config = CacheConfig({
+            "model_cfg": model_config,
+            "cache_dtype": "bfloat16",
+            "swap_space": None,
+        })
+        assert cache_config.num_cpu_blocks == 0
+
+        # Test case 2: swap_space = 1GB
+        # bytes_per_block = head_num * head_dim * byte_size * kv_factor * block_size * num_hidden_layers
+        #                 = 32 * 128 * 2 * 2 * 64 * 24 = 25165824 bytes
+        # num_cpu_blocks = 1 * 1024^3 / 25165824 = 42
+        cache_config = CacheConfig({
+            "model_cfg": model_config,
+            "cache_dtype": "bfloat16",
+            "swap_space": 1,
+        })
+        expected_blocks = int(1 * 1024 ** 3 / (32 * 128 * 2 * 2 * 64 * 24))
+        assert cache_config.num_cpu_blocks == expected_blocks
+        assert cache_config.num_cpu_blocks == 42
+
+        # Test case 3: swap_space = 2GB
+        cache_config = CacheConfig({
+            "model_cfg": model_config,
+            "cache_dtype": "bfloat16",
+            "swap_space": 2,
+        })
+        assert cache_config.num_cpu_blocks == 85
+
+        # Test case 4: with fp32 dtype (4 bytes)
+        cache_config = CacheConfig({
+            "model_cfg": model_config,
+            "cache_dtype": "float32",
+            "swap_space": 1,


该单测对 num_cpu_blocks 的期望值依赖多个隐含默认值（例如 block_size=64、FD_ATTENTION_BACKEND != MLA_ATTN 从而 kv_factor=2）。一旦这些默认值在未来调整或 CI 环境设置了不同的 attention backend，测试会稳定性变差/误报。
建议：在构造 CacheConfig 时显式传入 block_size，并显式设置 use_mla_cache（或在测试里 patch envs.FD_ATTENTION_BACKEND），同时用 cache_config.bytes_per_block 推导期望值以减少对魔法数字的耦合。

Suggested change

# Test case 1: swap_space is None -> num_cpu_blocks = 0

cache_config = CacheConfig({

"model_cfg": model_config,

"cache_dtype": "bfloat16",

"swap_space": None,

})

assert cache_config.num_cpu_blocks == 0

# Test case 2: swap_space = 1GB

# bytes_per_block = head_num * head_dim * byte_size * kv_factor * block_size * num_hidden_layers

# = 32 * 128 * 2 * 2 * 64 * 24 = 25165824 bytes

# num_cpu_blocks = 1 * 1024^3 / 25165824 = 42

cache_config = CacheConfig({

"model_cfg": model_config,

"cache_dtype": "bfloat16",

"swap_space": 1,

})

expected_blocks = int(1 * 1024 ** 3 / (32 * 128 * 2 * 2 * 64 * 24))

assert cache_config.num_cpu_blocks == expected_blocks

assert cache_config.num_cpu_blocks == 42

# Test case 3: swap_space = 2GB

cache_config = CacheConfig({

"model_cfg": model_config,

"cache_dtype": "bfloat16",

"swap_space": 2,

})

assert cache_config.num_cpu_blocks == 85

# Test case 4: with fp32 dtype (4 bytes)

cache_config = CacheConfig({

"model_cfg": model_config,

"cache_dtype": "float32",

"swap_space": 1,

# Use an explicit block size to avoid relying on CacheConfig defaults

block_size = 64

# Test case 1: swap_space is None -> num_cpu_blocks = 0

cache_config = CacheConfig({

"model_cfg": model_config,

"cache_dtype": "bfloat16",

"swap_space": None,

"block_size": block_size,

"use_mla_cache": False,

})

assert cache_config.num_cpu_blocks == 0

# Test case 2: swap_space = 1GB

cache_config = CacheConfig({

"model_cfg": model_config,

"cache_dtype": "bfloat16",

"swap_space": 1,

"block_size": block_size,

"use_mla_cache": False,

})

expected_blocks = int(1 * 1024 ** 3 / cache_config.bytes_per_block)

assert cache_config.num_cpu_blocks == expected_blocks

# Test case 3: swap_space = 2GB

cache_config = CacheConfig({

"model_cfg": model_config,

"cache_dtype": "bfloat16",

"swap_space": 2,

"block_size": block_size,

"use_mla_cache": False,

})

expected_blocks = int(2 * 1024 ** 3 / cache_config.bytes_per_block)

assert cache_config.num_cpu_blocks == expected_blocks

# Test case 4: with fp32 dtype (4 bytes)

cache_config = CacheConfig({

"model_cfg": model_config,

"cache_dtype": "float32",

"swap_space": 1,

"block_size": block_size,

"use_mla_cache": False,

Copilot · 2026-02-10T10:12:41Z

fastdeploy/worker/gpu_model_runner.py

@@ -2745,7 +2739,7 @@ def cal_theortical_kvcache(self):

        # NOTE:(changwenbin) Determie whether it is Multi-Head Latent Attention,


注释里 “Determie” 拼写错误，建议改为 “Determine”。（注释仍建议保持英文）

Suggested change

# NOTE:(changwenbin) Determie whether it is Multi-Head Latent Attention,

# NOTE:(changwenbin) Determine whether it is Multi-Head Latent Attention,

Copilot · 2026-02-10T10:12:41Z

fastdeploy/config.py

-                * byte_size
-            )
+
+            self.head_num = getattr(self.model_cfg, "num_key_value_heads") or self.model_cfg.num_attention_heads


这里 self.head_num = getattr(self.model_cfg, "num_key_value_heads") or ... 有两个确定性问题：

getattr 未提供默认值，若某些 model_cfg 没有 num_key_value_heads 属性会直接抛异常（回归了之前的 hasattr 保护）。

代码没有处理 num_key_value_heads <= 0 的哨兵值；在本仓库 ModelConfig 默认 num_key_value_heads = -1，此时会被当作 truthy 使用，导致 head_num 为 -1，进而 bytes_per_block/num_cpu_blocks 计算变成负数/错误。
建议：使用 getattr(self.model_cfg, "num_key_value_heads", None) 并显式判断 is not None and int(v) > 0，否则回退到 num_attention_heads。

Suggested change

self.head_num = getattr(self.model_cfg, "num_key_value_heads") or self.model_cfg.num_attention_heads

# Prefer num_key_value_heads when it is defined and positive; otherwise fall back to num_attention_heads

num_kv_heads = getattr(self.model_cfg, "num_key_value_heads", None)

if num_kv_heads is not None:

try:

num_kv_heads = int(num_kv_heads)

except (TypeError, ValueError):

num_kv_heads = None

if num_kv_heads is not None and num_kv_heads > 0:

self.head_num = num_kv_heads

else:

self.head_num = self.model_cfg.num_attention_heads

fastdeploy/config.py

codecov-commenter · 2026-02-12T04:28:48Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@83b4b08). Learn more about missing BASE report.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #6438   +/-   ##
==========================================
  Coverage           ?   68.23%           
==========================================
  Files              ?      391           
  Lines              ?    52789           
  Branches           ?     8220           
==========================================
  Hits               ?    36023           
  Misses             ?    14106           
  Partials           ?     2660

Flag	Coverage Δ
GPU	`68.23% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

[BugFix] fix num_cpu_blocks computation

182a37a

liyonghua0910 temporarily deployed to Metax_ci February 10, 2026 09:21 — with GitHub Actions Inactive

[fix] fix syntax and log

1f7570a

liyonghua0910 temporarily deployed to Metax_ci February 10, 2026 10:06 — with GitHub Actions Inactive

Jiang-Jia-Jun requested a review from Copilot February 10, 2026 10:06

Copilot started reviewing on behalf of Jiang-Jia-Jun February 10, 2026 10:07 View session

Copilot AI reviewed Feb 10, 2026

View reviewed changes

[fix] pre-commit

f197384

liyonghua0910 temporarily deployed to Metax_ci February 12, 2026 02:55 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] fix num_cpu_blocks computation #6438

[BugFix] fix num_cpu_blocks computation #6438

liyonghua0910 commented Feb 10, 2026 •

edited

Loading

Uh oh!

paddle-bot bot commented Feb 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Uh oh!

codecov-commenter commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -2745,7 +2739,7 @@ def cal_theortical_kvcache(self):

		# NOTE:(changwenbin) Determie whether it is Multi-Head Latent Attention,

	# NOTE:(changwenbin) Determie whether it is Multi-Head Latent Attention,
	# NOTE:(changwenbin) Determine whether it is Multi-Head Latent Attention,

-            self.head_num = getattr(self.model_cfg, "num_key_value_heads") or self.model_cfg.num_attention_heads
+            # Prefer num_key_value_heads when it is defined and positive; otherwise fall back to num_attention_heads
+            num_kv_heads = getattr(self.model_cfg, "num_key_value_heads", None)
+            if num_kv_heads is not None:
+                try:
+                    num_kv_heads = int(num_kv_heads)
+                except (TypeError, ValueError):
+                    num_kv_heads = None
+            if num_kv_heads is not None and num_kv_heads > 0:
+                self.head_num = num_kv_heads
+            else:
+                self.head_num = self.model_cfg.num_attention_heads

[BugFix] fix num_cpu_blocks computation #6438

Are you sure you want to change the base?

[BugFix] fix num_cpu_blocks computation #6438

Conversation

liyonghua0910 commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Feb 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-commenter commented Feb 12, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

liyonghua0910 commented Feb 10, 2026 •

edited

Loading