[BugFix] Seperate prometheus multiproc dir for single-server multi-dp…#8062
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release/online/20260415 #8062 +/- ##
==========================================================
Coverage ? 71.87%
==========================================================
Files ? 389
Lines ? 54539
Branches ? 8550
==========================================================
Hits ? 39202
Misses ? 12628
Partials ? 2709
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
CI报告基于以下代码生成(30分钟更新一次): 1 Required任务 : 5/7 通过
2 失败详情🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题(置信度: 高)错误类型: PR问题 | 置信度: 高
关键日志:
修复建议:
关联变更: 🔴 Pre Commit — PR问题(置信度: 高)错误类型: PR问题 | 置信度: 高
关键日志:
修复建议:
关联变更: |
b604581 to
4e3b6e6
Compare
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-18 17:38:50
📋 Review 摘要
PR 概述:为单机多 DP 服务拆分 Prometheus multiprocess 目录,并复用 helper 到 engine/common_engine/multi_api_server。
变更范围:fastdeploy/metrics/、fastdeploy/engine/、fastdeploy/entrypoints/openai/
影响面 Tag:[Engine] [APIServer]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | fastdeploy/metrics/prometheus_multiprocess_setup.py:48 |
用户自定义 Prometheus 目录的 warning 被静默移除,既有测试未同步且 stale .db 风险仍在 |
| 🔴 Bug | fastdeploy/metrics/prometheus_multiprocess_setup.py:65 |
DP 目录命名改为嵌套 dp{i},但 multi_api_server 现有单测仍断言 _dp{i} |
历史 Findings 修复情况
| Finding | 问题 | 状态 |
|---|---|---|
| F1 | 单 API server 多 DP 模式会丢失 DP1+ 的 Prometheus 指标 | ✅ 已修复 |
| F2 | AsyncLLM/EngineService 的 DP 启动路径也会让 DP1+ 指标不可见 |
✅ 已修复 |
📝 PR 规范检查
标题和描述均不完全符合规范:目标分支是 release/online/20260415,标题缺少 [Cherry-Pick] 和原 PR 号;PR 描述保留模板占位内容,Modifications / Usage or Command / Accuracy Tests 未填写。
标题建议(可直接复制):
[Cherry-Pick][BugFix] Separate Prometheus multiprocess dirs for multi-DP services(#<develop PR ID>)
PR 描述建议(点击展开,可直接复制)
## Motivation
Fix metric interference when multiple data-parallel services run on a single server by isolating Prometheus multiprocess directories per DP rank.
## Modifications
- Add `setup_dp_prometheus_dir()` to derive a DP-specific `PROMETHEUS_MULTIPROC_DIR`.
- Use the helper when launching extra DP services from `LLMEngine` and `EngineService`.
- Reuse the helper in `multi_api_server` when starting per-DP API server processes.
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
核心修复路径已覆盖两个历史 DP 启动入口;但当前提交没有同步 metrics 相关单测,并且用户自定义 multiprocess 目录的 stale metrics warning 被静默移除,建议修复后再合入。
| user_dir = os.environ["PROMETHEUS_MULTIPROC_DIR"] | ||
| _original_prom_dir = user_dir | ||
| os.makedirs(user_dir, exist_ok=True) | ||
| llm_logger.info(f"PROMETHEUS_MULTIPROC_DIR is set to {user_dir}") |
There was a problem hiding this comment.
🔴 Bug 这里把用户自定义 PROMETHEUS_MULTIPROC_DIR 的告警降成了普通 info,但没有清理这个目录。
既有 tests/metrics/test_prometheus_multiprocess_setup.py::test_when_env_var_already_set 仍然 patch llm_logger.warning 并断言原 warning 文案;当前提交没有同步测试,会导致该单测稳定失败。运行时也会静默保留用户目录中的旧 .db 文件,继续存在 stale metrics 风险。
建议修复方式:保留 warning 语义,或明确改成清理/隔离用户目录并同步更新对应测试断言。
| base_dir: Original PROMETHEUS_MULTIPROC_DIR (before any dp suffix). | ||
| env_dict: If provided, write to this dict instead of os.environ. | ||
| """ | ||
| prom_dir_dp = os.path.join(base_dir, f"dp{dp_id}") |
There was a problem hiding this comment.
🔴 Bug 新 helper 现在生成 {base_dir}/dp{i},但现有 multi-api-server 单测仍断言目录名包含 _dp{i}。
tests/entrypoints/openai/test_multi_api_server.py::test_prometheus_multiprocess_dir_per_dp 会捕获传给 subprocess.Popen 的 env,并在当前仓库代码中检查 self.assertIn(f"_dp{i}", prom_dir, ...)。这个 PR 没有同步测试,所以该路径会稳定失败。
建议修复方式:如果新目录约定是嵌套 dp{i},同步修改该测试断言;如果需要保持现有兼容性,则让 helper 继续生成原来的 sibling _dp{i} 路径。
… services
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.