Skip to content

[BugFix] Seperate prometheus multiproc dir for single-server multi-dp services#8059

Open
liyonghua0910 wants to merge 1 commit into
PaddlePaddle:developfrom
liyonghua0910:develop+20260616_fix_dp_metrics
Open

[BugFix] Seperate prometheus multiproc dir for single-server multi-dp services#8059
liyonghua0910 wants to merge 1 commit into
PaddlePaddle:developfrom
liyonghua0910:develop+20260616_fix_dp_metrics

Conversation

@liyonghua0910

Copy link
Copy Markdown
Collaborator

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@codecov-commenter

codecov-commenter commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 35.13514% with 24 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@cbb0811). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...astdeploy/metrics/prometheus_multiprocess_setup.py 47.82% 12 Missing ⚠️
fastdeploy/engine/common_engine.py 16.66% 5 Missing ⚠️
fastdeploy/engine/engine.py 16.66% 5 Missing ⚠️
fastdeploy/entrypoints/openai/multi_api_server.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             develop   #8059   +/-   ##
=========================================
  Coverage           ?   6.96%           
=========================================
  Files              ?     475           
  Lines              ?   66884           
  Branches           ?   10317           
=========================================
  Hits               ?    4660           
  Misses             ?   62133           
  Partials           ?      91           
Flag Coverage Δ
XPU 6.96% <35.13%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot

PaddlePaddle-bot commented Jun 18, 2026

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-18 18:44:05

CI报告基于以下代码生成(30分钟更新一次):
PR commit: 9a08ce4 | Merge base: cbb0811 (branch: develop)


1 Required任务 : 7/10 通过

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
41(0) 41 34 4 2 1 0
任务 错误类型 置信度 日志
Pre Commit PR问题:isort import 排序不合规 Job
Approval 需要 Approval Job

2 失败详情

🔴 Pre Commit — PR问题(置信度: 高)

分析器: 通用分析(fallback)

失败用例:

用例 错误摘要
isort fastdeploy/engine/common_engine.pyfastdeploy/engine/engine.py import 顺序/格式不符合 isort,hook 自动修改文件后失败

关键日志:

isort....................................................................Failed
- hook id: isort
- files were modified by this hook
Fixing /home/paddle-8/actions-runner/_work/FastDeploy/FastDeploy/fastdeploy/engine/common_engine.py
Fixing /home/paddle-8/actions-runner/_work/FastDeploy/FastDeploy/fastdeploy/engine/engine.py
pre-commit run --files fastdeploy/engine/common_engine.py fastdeploy/engine/engine.py fastdeploy/entrypoints/openai/multi_api_server.py fastdeploy/metrics/prometheus_multiprocess_setup.py
  • 根因摘要: 新增 import 未按 isort 排序

PR 在两个 engine 文件中新增 from fastdeploy.metrics.prometheus_multiprocess_setup import setup_dp_prometheus_dir, get_original_prom_dir,顺序和换行不符合 isort 规则。isort --check-only --diff 显示应拆成括号多行导入,并将 get_original_prom_dir 排在 setup_dp_prometheus_dir 前,因此这是 PR 变更直接导致的代码风格失败。

修复建议:

  1. fastdeploy/engine/common_engine.py:68fastdeploy/engine/engine.py:53 中把该导入改为 isort 输出的多行格式:先 get_original_prom_dir,再 setup_dp_prometheus_dir
  2. 本地运行 pre-commit run --files fastdeploy/engine/common_engine.py fastdeploy/engine/engine.py fastdeploy/entrypoints/openai/multi_api_server.py fastdeploy/metrics/prometheus_multiprocess_setup.py 确认。

关联变更: fastdeploy/engine/common_engine.py:68fastdeploy/engine/engine.py:53

🔴 Approval — 需要 Approval(置信度: 高)

该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。

  • 根因摘要: Workflow 等待人工审批

修复建议:

  1. 请通过人工审批。

关联变更: 无,属于审批流程状态。

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-18 17:35:56

📋 Review 摘要

PR 概述:为单 API server 多 DP 服务按 DP rank 切分 Prometheus multiprocess 目录。
变更范围:Engine/EngineService DP 子服务启动、multi_api_server 子进程环境、Prometheus multiprocess setup。
影响面 Tag[Engine] [APIServer]

问题

本轮未发现新的非历史阻塞性问题;历史 finding 状态见下表,PR 规范问题在下面章节报。

历史 Findings 修复情况

Finding 问题 状态
F1 父进程环境切到 dp0 后,公开 /metrics 仍只采集当前目录。 ⚠️ 仍存在
F2 EngineService/AsyncLLM 启动路径也切回 dp0,但 metrics HTTP 入口未聚合多个 DP 子目录。 ⚠️ 仍存在
F3 每次 /metrics scrape 把完整 metrics 文本写入 INFO 日志。 ✅ 已修复

📝 PR 规范检查

PR 标题 tag 符合规范,但描述各章节仍是模板占位内容,建议替换为下面可复制版本。

标题建议(可直接复制):

  • [BugFix] Separate prometheus multiproc dir for single-server multi-dp services
PR 描述建议(点击展开,可直接复制)
## Motivation
修复单 API server 多 DP 部署中 Prometheus multiprocess 目录共享导致 Counter/Histogram 指标容易混淆的问题。

## Modifications
- `fastdeploy/metrics/prometheus_multiprocess_setup.py`: 新增 `setup_dp_prometheus_dir`,按 DP id 创建独立的 Prometheus multiprocess 目录。
- `fastdeploy/engine/engine.py``fastdeploy/engine/common_engine.py`: 启动 DP 子服务前切换到对应 DP 的 `PROMETHEUS_MULTIPROC_DIR`,子服务启动后切回 DP0 目录。

## Usage or Command
N/A

## Accuracy Tests
N/A(仅调整 metrics multiprocess 目录,不影响模型精度)

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

当前目录切分只解决了写入路径隔离的一部分,采集端仍未聚合 dp0/dp1/... 多个子目录,因此 F1/F2 的核心问题还未闭环。建议先让 /metrics 显式遍历各 DP 目录合并采集,再合入。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants