Skip to content

[BugFix] Seperate prometheus multiproc dir for single-server multi-dp…#8062

Open
liyonghua0910 wants to merge 1 commit into
PaddlePaddle:release/online/20260415from
liyonghua0910:release/online/20260415+20260616_fix_dp_metrics
Open

[BugFix] Seperate prometheus multiproc dir for single-server multi-dp…#8062
liyonghua0910 wants to merge 1 commit into
PaddlePaddle:release/online/20260415from
liyonghua0910:release/online/20260415+20260616_fix_dp_metrics

Conversation

@liyonghua0910

Copy link
Copy Markdown
Collaborator

… services

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter

codecov-commenter commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 75.67568% with 9 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/online/20260415@eb7ea99). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/engine/engine.py 16.66% 5 Missing ⚠️
fastdeploy/engine/common_engine.py 33.33% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@                    Coverage Diff                     @@
##             release/online/20260415    #8062   +/-   ##
==========================================================
  Coverage                           ?   71.87%           
==========================================================
  Files                              ?      389           
  Lines                              ?    54539           
  Branches                           ?     8550           
==========================================================
  Hits                               ?    39202           
  Misses                             ?    12628           
  Partials                           ?     2709           
Flag Coverage Δ
GPU 71.87% <75.67%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PaddlePaddle-bot

PaddlePaddle-bot commented Jun 18, 2026

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-18 20:40:13 UTC+08:00

CI报告基于以下代码生成(30分钟更新一次):
PR commit: 4e3b6e6 | Merge base: eb7ea99 (branch: release/online/20260415)


1 Required任务 : 5/7 通过

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
20(0) 20 18 2 0 0 0
任务 错误类型 置信度 日志
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage PR问题 Job
Pre Commit PR问题 Job

2 失败详情

🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题(置信度: 高)

错误类型: PR问题 | 置信度: 高
分析器: 通用分析(fallback)
失败用例:

用例 错误摘要
entrypoints/openai/test_multi_api_server.py::TestMultiApiServer::test_prometheus_multiprocess_dir_per_dp 实现生成 /dp0,测试仍断言包含 _dp0
metrics/test_prometheus_multiprocess_setup.py::TestSetupMultiprocessPrometheus::test_when_env_var_already_set 实现改为 llm_logger.info,测试仍期望 llm_logger.warning
cache_manager/test_cache_transfer_manager.py::TestCacheTransferManager::test_init_storage_buffer_registers_scale_buffers, graph_optimization/test_graph_opt_backend.py::TestGraphOptBackend::test_static_graph 另有非本 PR 改动文件中的 AttributeError/TypeError 失败,需要单独确认

关键日志:

E AssertionError: '_dp0' not found in '/tmp/prom_main_.../dp0'
E AssertionError: Expected 'warning' to be called once. Called 0 times.
E AttributeError: 'CacheTransferManager' object has no attribute 'cache_scale_shape'
E TypeError: data forward_meta.seq_lens_kv has type annotation Tensor but got type <class 'NoneType'>
  • 根因摘要: Prometheus DP目录实现与测试预期不一致
    PR 在 fastdeploy/metrics/prometheus_multiprocess_setup.py:65 新增 os.path.join(base_dir, f"dp{dp_id}")fastdeploy/entrypoints/openai/multi_api_server.py:111 调用后传给子进程的是 .../dp0;但 tests/entrypoints/openai/test_multi_api_server.py:234 仍检查 _dp0。同时 setup_multiprocess_prometheus() 对已有环境变量的分支从 warning 改成 info,tests/metrics/test_prometheus_multiprocess_setup.py:54 仍 mock warning,因此断言失败。日志还包含 CacheTransferManager.cache_scale_shape 缺失和 forward_meta.seq_lens_kv=None 两个失败,PR diff 未改相关模块,未展开为 PR 直接根因。

修复建议:

  1. 统一 DP 目录规范:若新规范是子目录 dp{i},更新 tests/entrypoints/openai/test_multi_api_server.py:234 为检查 basename/后缀 dp{i};若仍要求旧格式,则修改 setup_dp_prometheus_dir() 生成 _dp{i}
  2. 同步已有环境变量分支测试:更新 tests/metrics/test_prometheus_multiprocess_setup.py:54 改为 mock llm_logger.info 及新日志,或恢复 warning 行为。
  3. 对 cache/graph 两个非 PR 变更文件失败补充 owner 确认或拆分处理,避免继续阻塞主测试任务。

关联变更: fastdeploy/metrics/prometheus_multiprocess_setup.py:45, fastdeploy/metrics/prometheus_multiprocess_setup.py:65, fastdeploy/entrypoints/openai/multi_api_server.py:111

🔴 Pre Commit — PR问题(置信度: 高)

错误类型: PR问题 | 置信度: 高
分析器: 通用分析(fallback)
失败用例:

用例 错误摘要
pre-commit/isort isort 修改了 fastdeploy/engine/common_engine.pyfastdeploy/engine/engine.py

关键日志:

isort....................................................................Failed
- hook id: isort
- files were modified by this hook
Fixing .../fastdeploy/engine/common_engine.py
Fixing .../fastdeploy/engine/engine.py
  • 根因摘要: 新增 import 未按 isort 排序
    PR 在两个 engine 文件新增 from fastdeploy.metrics.prometheus_multiprocess_setup import setup_dp_prometheus_dir, get_original_prom_dir,isort 对同一 import 中的符号顺序做了自动修正,因此 pre-commit 以 exit code 1 失败。

修复建议:

  1. 运行日志给出的命令:pre-commit run --files fastdeploy/engine/common_engine.py fastdeploy/engine/engine.py fastdeploy/entrypoints/openai/multi_api_server.py fastdeploy/metrics/prometheus_multiprocess_setup.py
  2. 手动修复时,将两个 engine 文件中的导入顺序调整为 get_original_prom_dir, setup_dp_prometheus_dir,然后重新触发 CI。

关联变更: fastdeploy/engine/common_engine.py, fastdeploy/engine/engine.py

@liyonghua0910 liyonghua0910 force-pushed the release/online/20260415+20260616_fix_dp_metrics branch from b604581 to 4e3b6e6 Compare June 18, 2026 09:24

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-18 17:38:50

📋 Review 摘要

PR 概述:为单机多 DP 服务拆分 Prometheus multiprocess 目录,并复用 helper 到 engine/common_engine/multi_api_server。
变更范围fastdeploy/metrics/fastdeploy/engine/fastdeploy/entrypoints/openai/
影响面 Tag[Engine] [APIServer]

问题

级别 文件 概述
🔴 Bug fastdeploy/metrics/prometheus_multiprocess_setup.py:48 用户自定义 Prometheus 目录的 warning 被静默移除,既有测试未同步且 stale .db 风险仍在
🔴 Bug fastdeploy/metrics/prometheus_multiprocess_setup.py:65 DP 目录命名改为嵌套 dp{i},但 multi_api_server 现有单测仍断言 _dp{i}

历史 Findings 修复情况

Finding 问题 状态
F1 单 API server 多 DP 模式会丢失 DP1+ 的 Prometheus 指标 ✅ 已修复
F2 AsyncLLM/EngineService 的 DP 启动路径也会让 DP1+ 指标不可见 ✅ 已修复

📝 PR 规范检查

标题和描述均不完全符合规范:目标分支是 release/online/20260415,标题缺少 [Cherry-Pick] 和原 PR 号;PR 描述保留模板占位内容,Modifications / Usage or Command / Accuracy Tests 未填写。

标题建议(可直接复制):

  • [Cherry-Pick][BugFix] Separate Prometheus multiprocess dirs for multi-DP services(#<develop PR ID>)
PR 描述建议(点击展开,可直接复制)
## Motivation
Fix metric interference when multiple data-parallel services run on a single server by isolating Prometheus multiprocess directories per DP rank.

## Modifications
- Add `setup_dp_prometheus_dir()` to derive a DP-specific `PROMETHEUS_MULTIPROC_DIR`.
- Use the helper when launching extra DP services from `LLMEngine` and `EngineService`.
- Reuse the helper in `multi_api_server` when starting per-DP API server processes.

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

核心修复路径已覆盖两个历史 DP 启动入口;但当前提交没有同步 metrics 相关单测,并且用户自定义 multiprocess 目录的 stale metrics warning 被静默移除,建议修复后再合入。

user_dir = os.environ["PROMETHEUS_MULTIPROC_DIR"]
_original_prom_dir = user_dir
os.makedirs(user_dir, exist_ok=True)
llm_logger.info(f"PROMETHEUS_MULTIPROC_DIR is set to {user_dir}")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug 这里把用户自定义 PROMETHEUS_MULTIPROC_DIR 的告警降成了普通 info,但没有清理这个目录。

既有 tests/metrics/test_prometheus_multiprocess_setup.py::test_when_env_var_already_set 仍然 patch llm_logger.warning 并断言原 warning 文案;当前提交没有同步测试,会导致该单测稳定失败。运行时也会静默保留用户目录中的旧 .db 文件,继续存在 stale metrics 风险。

建议修复方式:保留 warning 语义,或明确改成清理/隔离用户目录并同步更新对应测试断言。

base_dir: Original PROMETHEUS_MULTIPROC_DIR (before any dp suffix).
env_dict: If provided, write to this dict instead of os.environ.
"""
prom_dir_dp = os.path.join(base_dir, f"dp{dp_id}")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug 新 helper 现在生成 {base_dir}/dp{i},但现有 multi-api-server 单测仍断言目录名包含 _dp{i}

tests/entrypoints/openai/test_multi_api_server.py::test_prometheus_multiprocess_dir_per_dp 会捕获传给 subprocess.Popen 的 env,并在当前仓库代码中检查 self.assertIn(f"_dp{i}", prom_dir, ...)。这个 PR 没有同步测试,所以该路径会稳定失败。

建议修复方式:如果新目录约定是嵌套 dp{i},同步修改该测试断言;如果需要保持现有兼容性,则让 helper 继续生成原来的 sibling _dp{i} 路径。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants