Skip to content

[FDConfig] 默认开启 FD_ENABLE_E2W_TENSOR_CONVERT 和 FD_ENGINE_TASK_QUEUE_WITH_SHM#7746

Open
sunlei1024 wants to merge 12 commits into
PaddlePaddle:developfrom
sunlei1024:feat/default-enable-shm
Open

[FDConfig] 默认开启 FD_ENABLE_E2W_TENSOR_CONVERT 和 FD_ENGINE_TASK_QUEUE_WITH_SHM#7746
sunlei1024 wants to merge 12 commits into
PaddlePaddle:developfrom
sunlei1024:feat/default-enable-shm

Conversation

@sunlei1024
Copy link
Copy Markdown
Collaborator

Motivation

默认开启 FD_ENABLE_E2W_TENSOR_CONVERTFD_ENGINE_TASK_QUEUE_WITH_SHM
以提升 Engine-to-Worker 的张量传递效率,以及引擎任务队列基于共享内存(SHM)的通信性能。

该优化在大模型推理场景下可以减少序列化/反序列化开销,提高吞吐和延迟表现。

Modifications

  • fastdeploy/envs.py
    • FD_ENABLE_E2W_TENSOR_CONVERT 默认值由 0 改为 1
    • FD_ENGINE_TASK_QUEUE_WITH_SHM 默认值由 0 改为 1

行为变更说明

  • 未显式设置环境变量时,将默认启用上述优化能力
  • 如需保持旧行为,可手动设置:
    • FD_ENABLE_E2W_TENSOR_CONVERT=0
    • FD_ENGINE_TASK_QUEUE_WITH_SHM=0
  • 在容器环境中需确保 /dev/shm 空间充足(建议 ≥ 1GB,视模型规模而定)

Usage or Command

默认无需额外配置,升级后自动生效。

如需关闭相关功能,可通过环境变量控制:

export FD_ENABLE_E2W_TENSOR_CONVERT=0
export FD_ENGINE_TASK_QUEUE_WITH_SHM=0

Docker 使用示例(配置共享内存):

docker run --shm-size=1g ...

Accuracy Tests

本次修改仅涉及环境变量默认值调整,不涉及模型计算逻辑变更。

验证结果:

  • ✅ 功能验证:服务启动、推理流程正常
  • ✅ 性能验证:E2W tensor convert 与 SHM queue 正常工作
  • ✅ 一致性验证:关闭开关(设为0)后结果与旧版本一致(无精度差异)

Checklist

  • Add at least one tag in PR title (e.g., [FDConfig])
  • Code formatted and pre-commit passed
  • Unit tests added
    • 原因:本次修改仅为默认配置变更,无新增逻辑路径
  • Accuracy results provided
  • Backward compatibility considered(可通过环境变量回退)
  • Not a Cherry-Pick PR / OR follows Cherry-Pick rules if applicable

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 7, 2026

Thanks for your contribution!

@sunlei1024 sunlei1024 changed the title [test] Stop server with /dev/shm cleanup [FDConfig] 默认开启 FD_ENABLE_E2W_TENSOR_CONVERT 和 FD_ENGINE_TASK_QUEUE_WITH_SHM May 7, 2026
PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 7, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-11 21:31:06

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

CI 尚未全部完成:1 个 Required 任务失败(需处理),1 个运行中,2 个等待中。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
36(0) 36 29 2 2 3 0

2 任务状态汇总

2.1 Required任务 : 6/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 8s PR问题:修改envs.py及日志行为需指定RD审批 请jiangjiajun等/zhouchong等对应RD审批 Job -
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 运行中 - Job -
⏸️ Run Four Cards Tests / run_4_cards_tests - 等待中 - - -
⏸️ xpu_8cards_case_test / run_xpu_8cards_cases - 等待中 - - -
其余 6 个必选任务通过 - - - - -

2.2 可选任务 — 23/26 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Trigger Jenkins for PR 23m49s Job -
Run iluvatar Tests / run_iluvatar_cases - Job -
⏸️ CI_HPU - - -
其余 23 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — 代码规范(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: 代码规范
  • 置信度: 高
  • 根因摘要: PR修改envs.py及添加日志调用,需指定RD审批
  • 分析器: 通用分析(fallback)

根因详情:
PR 修改了 fastdeploy/envs.py 且新增了 .error() 日志调用,触发了两项强制审批规则:
① 修改 fastdeploy/envs.py 需要 jiangjiajun/liuyuanle/chenjian26/wanglongzhi 之一 Approve;
② 修改日志行为(.info/.debug/.error/log_request)需要 zhouchong/zhangyongyue 之一 Approve。

关键日志:

0. You must have one FastDeploy RD (Jiang-Jia-Jun(jiangjiajun), yuanlehome(liuyuanle), rainyfly(chenjian26), Wanglongzhi2001(wanglongzhi)) approval for modifying [fastdeploy/envs.py].
1. You must have one FastDeploy RD (xyxinyang(zhouchong), zyyzghb(zhangyongyue)) approval for modifying logging behavior (.info/.debug/.error/log_request).
There are 2 approved errors.
##[error]Process completed with exit code 6.

修复建议:

  1. 请 jiangjiajun(Jiang-Jia-Jun)、liuyuanle(yuanlehome)、chenjian26(rainyfly)或 wanglongzhi(Wanglongzhi2001)之一 Approve 对 fastdeploy/envs.py 的修改
  2. 请 zhouchong(xyxinyang)或 zhangyongyue(zyyzghb)之一 Approve 对日志行为的修改

修复建议摘要: 请相应 RD(jiangjiajun/zhouchong 等)在 GitHub 上 Approve 此 PR

关联变更: PR 新增 .error() 日志调用并修改 fastdeploy/envs.py

链接: 查看日志

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 7, 2026

Codecov Report

❌ Patch coverage is 23.33333% with 23 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@983b1a3). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/utils.py 33.33% 9 Missing and 3 partials ⚠️
...stdeploy/inter_communicator/engine_worker_queue.py 11.11% 8 Missing ⚠️
fastdeploy/engine/common_engine.py 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7746   +/-   ##
==========================================
  Coverage           ?   72.28%           
==========================================
  Files              ?      396           
  Lines              ?    55858           
  Branches           ?     8729           
==========================================
  Hits               ?    40377           
  Misses             ?    12719           
  Partials           ?     2762           
Flag Coverage Δ
GPU 72.28% <23.33%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

Jiang-Jia-Jun
Jiang-Jia-Jun previously approved these changes May 11, 2026
Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-11 20:59:30

📋 Review 摘要

PR 概述:默认开启 FD_ENABLE_E2W_TENSOR_CONVERTFD_ENGINE_TASK_QUEUE_WITH_SHM 两个环境变量,提升引擎张量传递与任务队列效率,并新增 SHM 模式下的 broken queue 检测、Unix socket 可用性检测及相关测试重构。

变更范围fastdeploy/envs.pyfastdeploy/engine/common_engine.pyfastdeploy/inter_communicator/engine_worker_queue.pyfastdeploy/utils.pytests/

影响面 Tag[FDConfig] [Engine] [CI]

📝 PR 规范检查

PR 标题使用了官方 Tag [FDConfig],格式合规。PR 描述包含全部必填 section(Motivation、Modifications、Usage or Command、Accuracy Tests、Checklist),结构合规,无需修改。

问题

级别 文件 概述
🟡 建议 tests/ci_use/Qwen2-7B-Instruct_offline/test_Qwen2-7B-Instruct_offline.py:87 time.sleep(2) 替换了基于端口检测的初始化等待逻辑,存在测试可靠性隐患
❓ 疑问 fastdeploy/inter_communicator/engine_worker_queue.py:859 is_broken()except Exception: return False 对非连接类异常静默返回"未损坏"
❓ 疑问 tests/ci_use/EB_Lite_with_adapter/test_eblite_serving.py:93 rm -rf /dev/shm/* 在共享 CI 环境中可能误删其他进程的 SHM 文件
❓ 疑问 tests/xpu_ci/conftest.py:106 同上,rm -rf /dev/shm/* 在共享环境存在误删风险

总体评价

本 PR 将两个性能优化开关默认开启,整体思路清晰,配套修改了端口检测逻辑和测试基础设施。主要关注点是测试可靠性(硬编码 2s sleep)和 CI 清理逻辑(rm -rf /dev/shm/* 过于激进),建议作者确认后再合入。

)
time.sleep(1)

time.sleep(2)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 time.sleep(2) 替换了原来基于端口轮询(最长等待 60s)的初始化检测,存在可靠性隐患。

原逻辑轮询 FD_ENGINE_QUEUE_PORT 是否已监听判断引擎就绪,启用 SHM 后 TCP 端口检测不再适用,但硬编码 2 秒在负载较高或模型较大时可能导致偶发测试失败。

建议改为轮询 SHM socket 文件或设合理上限重试(如 30s),以保持与原逻辑等价的健壮性。

except (ConnectionRefusedError, ConnectionResetError, BrokenPipeError, EOFError, OSError):
llm_logger.error("Failed to connect to engine worker queue")
return True
except Exception:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 except Exception: return False 对所有非连接类异常(如 AttributeError)静默返回 False("队列正常"),可能掩盖潜在问题。

self.manager 在某些时序下为 Noneself.manager.connect() 会抛 AttributeError,调用方会误判队列健康而跳过重试。建议至少补充 warning 日志:

except Exception as e:
    llm_logger.warning(f"Unexpected error in is_broken check: {e}")
    return False

"""
# 清理/dev/shm中的临时文件
try:
subprocess.run("rm -rf /dev/shm/*", shell=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 rm -rf /dev/shm/* 会删除 /dev/shm 下的所有文件,在并发 CI 场景下可能误删其他 job 使用的 SHM 段。

建议改为只清理本框架生成的 socket 文件:

rm -f /dev/shm/fd_task_queue_*.sock

Comment thread tests/xpu_ci/conftest.py

try:
# 清理/dev/shm下的所有文件
subprocess.run("rm -rf /dev/shm/*", shell=True, check=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问test_eblite_serving.pyrm -rf /dev/shm/* 在共享 CI 节点存在误删其他进程 SHM 文件的风险。

建议改为:

rm -f /dev/shm/fd_task_queue_*.sock

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants