Skip to content

[Refactor] Organize token processor metrics/traces code#7762

Draft
liyonghua0910 wants to merge 2 commits into
PaddlePaddle:developfrom
liyonghua0910:develop+20260509_token_processor
Draft

[Refactor] Organize token processor metrics/traces code#7762
liyonghua0910 wants to merge 2 commits into
PaddlePaddle:developfrom
liyonghua0910:develop+20260509_token_processor

Conversation

@liyonghua0910
Copy link
Copy Markdown
Collaborator

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 9, 2026

Thanks for your contribution!

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 9, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-11 11:42:40

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

⚠️ 1 个 Required 任务失败,需优先处理;另有 3 个 Required 任务运行中,1 个等待中。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
36(0) 36 27 3 4 2 0

2 任务状态汇总

2.1 Required任务 : 5/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 7s PR问题:修改envs.py和log_request需2位指定RD审批 请联系jiangjiajun/xyxinyang等指定负责人approve Job -
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 运行中 - Job -
Extracted partial CE model tasks to run in CI. / run_ce_cases - 运行中 - Job -
xpu_4cards_case_test / run_xpu_4cards_cases - 运行中 - Job -
⏸️ Run Four Cards Tests / run_4_cards_tests - 等待中 - - -
其余 5 个必选任务通过 - - - - -

2.2 可选任务 — 22/26 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Check PR Template 12s Job -
Trigger Jenkins for PR 18m1s Job -
其余 22 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — 代码规范(审批缺失)(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: 代码规范(审批缺失)
  • 置信度: 高
  • 根因摘要: PR修改envs.py和log_request需2位指定RD审批,当前0/2已审批
  • 分析器: 通用分析(fallback)

根因详情:
本 PR 触发了 FastDeploy 的代码审批规则检查脚本 check_approval.sh。diff 中检测到 log_request( 新增调用,同时修改了 fastdeploy/envs.py 文件,均属于受保护的变更范畴。当前两条审批要求均未满足,exit code 6(2个审批错误)。

关键日志:

0. You must have one FastDeploy RD (Jiang-Jia-Jun(jiangjiajun), yuanlehome(liuyuanle), rainyfly(chenjian26), Wanglongzhi2001(wanglongzhi)) approval for modifying [fastdeploy/envs.py].
1. You must have one FastDeploy RD (xyxinyang(zhouchong), zyyzghb(zhangyongyue)) approval for modifying logging behavior (.info/.debug/.error/log_request).
There are 2 approved errors.
##[error]Process completed with exit code 6.

修复建议:

  1. @jiangjiajun、@liuyuanle、@chenjian26 或 @wanglongzhi 中至少一位 review 并 approve(针对 fastdeploy/envs.py 修改)
  2. @zhouchong@zhangyongyue 中至少一位 review 并 approve(针对 log_request 日志行为修改)

修复建议摘要: 请指定RD负责人approve envs.py和log行为修改

关联变更: fastdeploy/envs.py(新增env var);log_request( 调用(新增日志上报)
链接: 查看日志

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-11 11:20:55

📋 Review 摘要

PR 概述:重构 token_processor.py 中的 metrics/traces 代码,将内联逻辑提取为命名方法,并新增 FD_ENABLE_OBSERVABILITY 可观测性开关
变更范围fastdeploy/output/fastdeploy/entrypoints/openai/fastdeploy/engine/fastdeploy/envs.py
影响面 Tag[DataProcessor] [APIServer] [Engine] [FDConfig]

📝 PR 规范检查

PR 标题使用了非官方 Tag [Refactor](不在官方 Tag 列表中),且描述模板各 Section 均为空(仅含占位注释)。建议修正如下:

标题建议(可直接复制):

  • [DataProcessor] Organize token processor metrics/traces code

PR 描述建议(可直接复制):

## Motivation
重构 `fastdeploy/output/token_processor.py` 中分散的 metrics/traces 代码,将内联逻辑提取为命名方法,并新增 `FD_ENABLE_OBSERVABILITY` 环境变量开关(默认启用)统一控制 Prometheus metrics、tracing span 和结构化日志的上报行为,提升代码可维护性并支持低开销部署场景。

## Modifications
- `fastdeploy/output/token_processor.py`:重构为以下独立方法,并统一受 `self._observability_enabled` 开关控制:
  - `_setup_trace_context`:初始化 trace 传播上下文
  - `_record_task_metrics_on_first_token` / `_on_subsequent_token` / `_on_completion`:task.metrics 时间戳更新
  - `_record_trace_on_first_token` / `_on_completion`:trace span 上报
  - `_record_prometheus_metrics_on_first_token` / `_on_token` / `_on_completion`:Prometheus 指标观测
  - `_log_request_on_completion`:LIFECYCLE 结构化日志
- `fastdeploy/engine/request.py`:为 `record_recv_first_token``record_decode_recv_second_token` 新增可选 `cur_time` 参数,避免重复调用 `time.time()`
- `fastdeploy/envs.py`:新增 `FD_ENABLE_OBSERVABILITY` 环境变量(默认 `"1"` 即启用)
- `fastdeploy/entrypoints/openai/serving_chat.py` / `serving_completion.py`:对 metrics 字段读取加 `or 0` 防御 None 值

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [ ] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别 文件 概述
🟡 建议 fastdeploy/output/token_processor.py:253 _record_trace_on_completion(task) 未传 rid,ZMQ 路径新增了 rid=None 的 DECODE span,原代码此路径无此 span,属行为改变

总体评价

重构思路清晰,方法拆分和命名语义准确,FD_ENABLE_OBSERVABILITY 开关设计合理且向后兼容。核心问题是 ZMQ 路径(_process_per_token)新增了携带 rid=None 的 DECODE trace span,与原代码行为不一致,需确认是有意为之还是重构遗漏,建议补全 rid 或显式处理。

preempted_count=getattr(task.metrics, "preempted_count", 0),
)
self._record_task_metrics_on_completion(task, current_time, recovery_stop)
self._record_trace_on_completion(task)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 _record_trace_on_completion(task) 未传入 rid,默认 rid=None

原代码在 _process_per_token(ZMQ 路径)中调用的 _record_completion_metrics 从未上报 DECODE span。重构后 _record_trace_on_completion 内部调用了 tracing.trace_report_span(name=DECODE, rid=rid, ...),相当于在 ZMQ 路径新增了一个 rid=None 的 DECODE span,属于行为改变。

tracing.trace_report_span 无法优雅处理 rid=None,可能导致 tracing 异常或孤立 span。

建议修复方式:

# 在 _process_per_token 中计算 rid 并传入
rid = task_id.split("_")[0]
self._record_trace_on_completion(task, rid)

或在 _record_trace_on_completion 内部处理 rid is None 的情况:

if rid is None:
    return  # ZMQ 路径不上报 DECODE span,保持原行为

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 77.58621% with 26 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@e67ed1a). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/output/token_processor.py 75.49% 12 Missing and 13 partials ⚠️
...astdeploy/entrypoints/openai/serving_completion.py 80.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7762   +/-   ##
==========================================
  Coverage           ?   72.17%           
==========================================
  Files              ?      396           
  Lines              ?    55736           
  Branches           ?     8720           
==========================================
  Hits               ?    40226           
  Misses             ?    12724           
  Partials           ?     2786           
Flag Coverage Δ
GPU 72.17% <77.58%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

task.metrics.cal_cost_time()
metrics = copy.copy(task.metrics)
self._record_first_token_metrics(task, current_time)
rid = task_id.split("_")[0]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request_id现在不直接使用split获取了,可以参考#7564
使用from fastdeploy.utils import get_base_request_id
rid = get_base_request_id(task_id)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants