-
Notifications
You must be signed in to change notification settings - Fork 699
[BugFix] fix num_cpu_blocks computation #6438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
[BugFix] fix num_cpu_blocks computation #6438
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
该 PR 旨在修复 CacheConfig.num_cpu_blocks(由 swap_space 推导)计算不准确的问题,并统一 KV cache 的 dtype→字节数映射逻辑,同时将 MLA cache 的判定下沉到 CacheConfig 中供其他模块复用。
Changes:
- 在
CacheConfig中新增get_cache_bytes()并重构bytes_per_block/num_cpu_blocks的计算逻辑(引入use_mla_cache与kv_factor)。 GPUModelRunner的理论 KV cache 估算逻辑改为使用fd_config.cache_config.use_mla_cache。CacheTransferManager复用CacheConfig.get_cache_bytes(),并新增/补充相关单元测试。
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/utils/test_config.py | 新增 get_cache_bytes 与 num_cpu_blocks 计算的单测用例 |
| fastdeploy/worker/gpu_model_runner.py | MLA cache 判定来源改为 cache_config.use_mla_cache |
| fastdeploy/config.py | 重构 CPU cache blocks 计算,并新增 get_cache_bytes() 与 use_mla_cache |
| fastdeploy/cache_manager/prefix_cache_manager.py | 调整初始化日志输出字段 |
| fastdeploy/cache_manager/cache_transfer_manager.py | 复用 CacheConfig.get_cache_bytes() 并移除重复实现 |
| f"Prefix cache manager is initialized with {self.num_gpu_blocks} gpu blocks " | ||
| f"and {self.num_cpu_blocks} cpu blocks, bytes_per_token_per_layer for each rank: " | ||
| f"{self.cache_config.bytes_per_token_per_layer / self.config.parallel_config.tensor_parallel_size}" |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的日志改为读取 self.cache_config.bytes_per_token_per_layer,但在仓库里不少单测/构造器用的是 SimpleNamespace 或未设置 model_cfg 的 CacheConfig,并不会有该属性(例如 tests/cache_manager/test_prefix_cache_manager.py 的 fake cache_config 只提供了 bytes_per_layer_per_block)。这样会在初始化时直接触发 AttributeError,导致现有测试/用法回归。
建议:日志中用 getattr 提供回退(优先 bytes_per_token_per_layer,否则回退到 bytes_per_layer_per_block 或跳过该字段),确保兼容旧的测试桩对象。
tests/utils/test_config.py
Outdated
| # Test case 1: swap_space is None -> num_cpu_blocks = 0 | ||
| cache_config = CacheConfig({ | ||
| "model_cfg": model_config, | ||
| "cache_dtype": "bfloat16", | ||
| "swap_space": None, | ||
| }) | ||
| assert cache_config.num_cpu_blocks == 0 | ||
|
|
||
| # Test case 2: swap_space = 1GB | ||
| # bytes_per_block = head_num * head_dim * byte_size * kv_factor * block_size * num_hidden_layers | ||
| # = 32 * 128 * 2 * 2 * 64 * 24 = 25165824 bytes | ||
| # num_cpu_blocks = 1 * 1024^3 / 25165824 = 42 | ||
| cache_config = CacheConfig({ | ||
| "model_cfg": model_config, | ||
| "cache_dtype": "bfloat16", | ||
| "swap_space": 1, | ||
| }) | ||
| expected_blocks = int(1 * 1024 ** 3 / (32 * 128 * 2 * 2 * 64 * 24)) | ||
| assert cache_config.num_cpu_blocks == expected_blocks | ||
| assert cache_config.num_cpu_blocks == 42 | ||
|
|
||
| # Test case 3: swap_space = 2GB | ||
| cache_config = CacheConfig({ | ||
| "model_cfg": model_config, | ||
| "cache_dtype": "bfloat16", | ||
| "swap_space": 2, | ||
| }) | ||
| assert cache_config.num_cpu_blocks == 85 | ||
|
|
||
| # Test case 4: with fp32 dtype (4 bytes) | ||
| cache_config = CacheConfig({ | ||
| "model_cfg": model_config, | ||
| "cache_dtype": "float32", | ||
| "swap_space": 1, |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
该单测对 num_cpu_blocks 的期望值依赖多个隐含默认值(例如 block_size=64、FD_ATTENTION_BACKEND != MLA_ATTN 从而 kv_factor=2)。一旦这些默认值在未来调整或 CI 环境设置了不同的 attention backend,测试会稳定性变差/误报。
建议:在构造 CacheConfig 时显式传入 block_size,并显式设置 use_mla_cache(或在测试里 patch envs.FD_ATTENTION_BACKEND),同时用 cache_config.bytes_per_block 推导期望值以减少对魔法数字的耦合。
| # Test case 1: swap_space is None -> num_cpu_blocks = 0 | |
| cache_config = CacheConfig({ | |
| "model_cfg": model_config, | |
| "cache_dtype": "bfloat16", | |
| "swap_space": None, | |
| }) | |
| assert cache_config.num_cpu_blocks == 0 | |
| # Test case 2: swap_space = 1GB | |
| # bytes_per_block = head_num * head_dim * byte_size * kv_factor * block_size * num_hidden_layers | |
| # = 32 * 128 * 2 * 2 * 64 * 24 = 25165824 bytes | |
| # num_cpu_blocks = 1 * 1024^3 / 25165824 = 42 | |
| cache_config = CacheConfig({ | |
| "model_cfg": model_config, | |
| "cache_dtype": "bfloat16", | |
| "swap_space": 1, | |
| }) | |
| expected_blocks = int(1 * 1024 ** 3 / (32 * 128 * 2 * 2 * 64 * 24)) | |
| assert cache_config.num_cpu_blocks == expected_blocks | |
| assert cache_config.num_cpu_blocks == 42 | |
| # Test case 3: swap_space = 2GB | |
| cache_config = CacheConfig({ | |
| "model_cfg": model_config, | |
| "cache_dtype": "bfloat16", | |
| "swap_space": 2, | |
| }) | |
| assert cache_config.num_cpu_blocks == 85 | |
| # Test case 4: with fp32 dtype (4 bytes) | |
| cache_config = CacheConfig({ | |
| "model_cfg": model_config, | |
| "cache_dtype": "float32", | |
| "swap_space": 1, | |
| # Use an explicit block size to avoid relying on CacheConfig defaults | |
| block_size = 64 | |
| # Test case 1: swap_space is None -> num_cpu_blocks = 0 | |
| cache_config = CacheConfig({ | |
| "model_cfg": model_config, | |
| "cache_dtype": "bfloat16", | |
| "swap_space": None, | |
| "block_size": block_size, | |
| "use_mla_cache": False, | |
| }) | |
| assert cache_config.num_cpu_blocks == 0 | |
| # Test case 2: swap_space = 1GB | |
| cache_config = CacheConfig({ | |
| "model_cfg": model_config, | |
| "cache_dtype": "bfloat16", | |
| "swap_space": 1, | |
| "block_size": block_size, | |
| "use_mla_cache": False, | |
| }) | |
| expected_blocks = int(1 * 1024 ** 3 / cache_config.bytes_per_block) | |
| assert cache_config.num_cpu_blocks == expected_blocks | |
| # Test case 3: swap_space = 2GB | |
| cache_config = CacheConfig({ | |
| "model_cfg": model_config, | |
| "cache_dtype": "bfloat16", | |
| "swap_space": 2, | |
| "block_size": block_size, | |
| "use_mla_cache": False, | |
| }) | |
| expected_blocks = int(2 * 1024 ** 3 / cache_config.bytes_per_block) | |
| assert cache_config.num_cpu_blocks == expected_blocks | |
| # Test case 4: with fp32 dtype (4 bytes) | |
| cache_config = CacheConfig({ | |
| "model_cfg": model_config, | |
| "cache_dtype": "float32", | |
| "swap_space": 1, | |
| "block_size": block_size, | |
| "use_mla_cache": False, |
| @@ -2745,7 +2739,7 @@ def cal_theortical_kvcache(self): | |||
|
|
|||
| # NOTE:(changwenbin) Determie whether it is Multi-Head Latent Attention, | |||
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注释里 “Determie” 拼写错误,建议改为 “Determine”。(注释仍建议保持英文)
| # NOTE:(changwenbin) Determie whether it is Multi-Head Latent Attention, | |
| # NOTE:(changwenbin) Determine whether it is Multi-Head Latent Attention, |
| * byte_size | ||
| ) | ||
|
|
||
| self.head_num = getattr(self.model_cfg, "num_key_value_heads") or self.model_cfg.num_attention_heads |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里 self.head_num = getattr(self.model_cfg, "num_key_value_heads") or ... 有两个确定性问题:
getattr未提供默认值,若某些 model_cfg 没有num_key_value_heads属性会直接抛异常(回归了之前的 hasattr 保护)。- 代码没有处理
num_key_value_heads <= 0的哨兵值;在本仓库ModelConfig默认num_key_value_heads = -1,此时会被当作 truthy 使用,导致 head_num 为 -1,进而bytes_per_block/num_cpu_blocks计算变成负数/错误。
建议:使用getattr(self.model_cfg, "num_key_value_heads", None)并显式判断is not None and int(v) > 0,否则回退到num_attention_heads。
| self.head_num = getattr(self.model_cfg, "num_key_value_heads") or self.model_cfg.num_attention_heads | |
| # Prefer num_key_value_heads when it is defined and positive; otherwise fall back to num_attention_heads | |
| num_kv_heads = getattr(self.model_cfg, "num_key_value_heads", None) | |
| if num_kv_heads is not None: | |
| try: | |
| num_kv_heads = int(num_kv_heads) | |
| except (TypeError, ValueError): | |
| num_kv_heads = None | |
| if num_kv_heads is not None and num_kv_heads > 0: | |
| self.head_num = num_kv_heads | |
| else: | |
| self.head_num = self.model_cfg.num_attention_heads |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #6438 +/- ##
==========================================
Coverage ? 68.23%
==========================================
Files ? 391
Lines ? 52789
Branches ? 8220
==========================================
Hits ? 36023
Misses ? 14106
Partials ? 2660
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
修复 num_cpu_blocks 计算错误问题:
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.