[Refactor] Organize token processor metrics/traces code#7762
[Refactor] Organize token processor metrics/traces code#7762liyonghua0910 wants to merge 2 commits into
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览
2 任务状态汇总2.1 Required任务 : 5/10 通过
2.2 可选任务 — 22/26 通过
3 失败详情(仅 required)Approval — 代码规范(审批缺失)(置信度: 高)Approval
根因详情: 关键日志: 修复建议:
修复建议摘要: 请指定RD负责人approve envs.py和log行为修改 关联变更: |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-11 11:20:55
📋 Review 摘要
PR 概述:重构 token_processor.py 中的 metrics/traces 代码,将内联逻辑提取为命名方法,并新增 FD_ENABLE_OBSERVABILITY 可观测性开关
变更范围:fastdeploy/output/、fastdeploy/entrypoints/openai/、fastdeploy/engine/、fastdeploy/envs.py
影响面 Tag:[DataProcessor] [APIServer] [Engine] [FDConfig]
📝 PR 规范检查
PR 标题使用了非官方 Tag [Refactor](不在官方 Tag 列表中),且描述模板各 Section 均为空(仅含占位注释)。建议修正如下:
标题建议(可直接复制):
[DataProcessor] Organize token processor metrics/traces code
PR 描述建议(可直接复制):
## Motivation
重构 `fastdeploy/output/token_processor.py` 中分散的 metrics/traces 代码,将内联逻辑提取为命名方法,并新增 `FD_ENABLE_OBSERVABILITY` 环境变量开关(默认启用)统一控制 Prometheus metrics、tracing span 和结构化日志的上报行为,提升代码可维护性并支持低开销部署场景。
## Modifications
- `fastdeploy/output/token_processor.py`:重构为以下独立方法,并统一受 `self._observability_enabled` 开关控制:
- `_setup_trace_context`:初始化 trace 传播上下文
- `_record_task_metrics_on_first_token` / `_on_subsequent_token` / `_on_completion`:task.metrics 时间戳更新
- `_record_trace_on_first_token` / `_on_completion`:trace span 上报
- `_record_prometheus_metrics_on_first_token` / `_on_token` / `_on_completion`:Prometheus 指标观测
- `_log_request_on_completion`:LIFECYCLE 结构化日志
- `fastdeploy/engine/request.py`:为 `record_recv_first_token` 和 `record_decode_recv_second_token` 新增可选 `cur_time` 参数,避免重复调用 `time.time()`
- `fastdeploy/envs.py`:新增 `FD_ENABLE_OBSERVABILITY` 环境变量(默认 `"1"` 即启用)
- `fastdeploy/entrypoints/openai/serving_chat.py` / `serving_completion.py`:对 metrics 字段读取加 `or 0` 防御 None 值
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [ ] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | fastdeploy/output/token_processor.py:253 |
_record_trace_on_completion(task) 未传 rid,ZMQ 路径新增了 rid=None 的 DECODE span,原代码此路径无此 span,属行为改变 |
总体评价
重构思路清晰,方法拆分和命名语义准确,FD_ENABLE_OBSERVABILITY 开关设计合理且向后兼容。核心问题是 ZMQ 路径(_process_per_token)新增了携带 rid=None 的 DECODE trace span,与原代码行为不一致,需确认是有意为之还是重构遗漏,建议补全 rid 或显式处理。
| preempted_count=getattr(task.metrics, "preempted_count", 0), | ||
| ) | ||
| self._record_task_metrics_on_completion(task, current_time, recovery_stop) | ||
| self._record_trace_on_completion(task) |
There was a problem hiding this comment.
🟡 建议 _record_trace_on_completion(task) 未传入 rid,默认 rid=None
原代码在 _process_per_token(ZMQ 路径)中调用的 _record_completion_metrics 从未上报 DECODE span。重构后 _record_trace_on_completion 内部调用了 tracing.trace_report_span(name=DECODE, rid=rid, ...),相当于在 ZMQ 路径新增了一个 rid=None 的 DECODE span,属于行为改变。
若 tracing.trace_report_span 无法优雅处理 rid=None,可能导致 tracing 异常或孤立 span。
建议修复方式:
# 在 _process_per_token 中计算 rid 并传入
rid = task_id.split("_")[0]
self._record_trace_on_completion(task, rid)或在 _record_trace_on_completion 内部处理 rid is None 的情况:
if rid is None:
return # ZMQ 路径不上报 DECODE span,保持原行为
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7762 +/- ##
==========================================
Coverage ? 72.17%
==========================================
Files ? 396
Lines ? 55736
Branches ? 8720
==========================================
Hits ? 40226
Misses ? 12724
Partials ? 2786
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| task.metrics.cal_cost_time() | ||
| metrics = copy.copy(task.metrics) | ||
| self._record_first_token_metrics(task, current_time) | ||
| rid = task_id.split("_")[0] |
There was a problem hiding this comment.
request_id现在不直接使用split获取了,可以参考#7564
使用from fastdeploy.utils import get_base_request_id
rid = get_base_request_id(task_id)
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.