Skip to content

Clarify v0-vs-v1 latency metric semantics and low-traffic percentiles#4735

Open
dustin-temporal wants to merge 1 commit into
mainfrom
docs/openmetrics-percentile-sample-size
Open

Clarify v0-vs-v1 latency metric semantics and low-traffic percentiles#4735
dustin-temporal wants to merge 1 commit into
mainfrom
docs/openmetrics-percentile-sample-size

Conversation

@dustin-temporal

@dustin-temporal dustin-temporal commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

What

Two small additions to the OpenMetrics docs to close gaps that caused a customer-side latency alerting false alarm after migrating from the v0 query endpoint to the v1 OpenMetrics endpoint.

1. Migration guide — Percentile metrics section

Adds a caution that the v0 latency metrics are a histogram, not a percentile:

  • v0_service_latency_sum / v0_service_latency_count is an average (≈ p50).
  • v0_service_latency_bucket{le="..."} only counts requests under a threshold.
  • Neither is comparable to v1_service_latency_p95 / _p99, so v1 will report higher values for identical traffic — a measurement change, not a regression.
  • Includes safe-migration steps (start on p50 to match an average-based alert, move to p95/p99 deliberately) and a pointer to the p99 latency SLO.

2. Metrics reference — Metric Conventions section

Adds a note that percentile metrics on low-traffic namespaces are computed from small per-minute samples, so a single slow request dominates p50/p95/p99. Recommends gating latency alerts on a minimum request count (e.g. service_request_count) so sparse windows don't trigger them. Also notes that these pre-calculated percentiles cannot be re-aggregated into an accurate longer-window percentile, so widening the evaluation window does not by itself make a sparse sample meaningful — consistent with the existing per-metric "avoid aggregating this metric" caution.

Why

A customer migrated to v1 cloud metrics, set a per-namespace p95 alert against the 200ms p99 SLO, and saw frequent StartWorkflowExecution latency spikes that turned out to be a metrics artifact: their low-RPS namespaces produced tiny per-minute samples where one slow request defined the whole quantile, and their v0 baseline had been an average rather than a percentile. No actual latency regression — the v0→v1 measurement change just made existing tail latency visible. These docs would have pre-empted the confusion.

Scope

Prose-only; no metric behavior changes. Two .mdx files, additive callouts only.

🤖 Generated with Claude Code

@vercel

vercel Bot commented Jun 17, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
temporal-documentation Ready Ready Preview, Comment Jun 17, 2026 2:25pm

Request Review

@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

📖 Docs PR preview links

Two OpenMetrics doc gaps surfaced by a customer alerting false alarm after
migrating from the v0 query endpoint to v1 OpenMetrics:

- Migration guide: add a caution that v0 service_latency_sum/count is an
  average (~p50) and _bucket is a count, not a percentile. Comparing either
  against v1 _p95/_p99 reports higher values for identical traffic. Includes
  safe-migration steps and a pointer to the p99 latency SLO.
- Metrics reference: add a note that percentile metrics on low-traffic
  namespaces are computed from small per-minute samples, so a single slow
  request dominates p50/p95/p99. Recommends gating latency alerts on a
  minimum request count, and notes that pre-calculated percentiles cannot be
  re-aggregated into an accurate longer-window percentile.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@dustin-temporal dustin-temporal force-pushed the docs/openmetrics-percentile-sample-size branch from 0b98052 to 8faa8b1 Compare June 17, 2026 14:24
@dustin-temporal dustin-temporal marked this pull request as ready for review June 17, 2026 14:27
@dustin-temporal dustin-temporal requested a review from a team as a code owner June 17, 2026 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants