Clarify v0-vs-v1 latency metric semantics and low-traffic percentiles by dustin-temporal · Pull Request #4735 · temporalio/documentation

dustin-temporal · 2026-06-17T14:14:43Z

What

Two small additions to the OpenMetrics docs to close gaps that caused a customer-side latency alerting false alarm after migrating from the v0 query endpoint to the v1 OpenMetrics endpoint.

1. Migration guide — `Percentile metrics` section

Adds a caution that the v0 latency metrics are a histogram, not a percentile:

v0_service_latency_sum / v0_service_latency_count is an average (≈ p50).
v0_service_latency_bucket{le="..."} only counts requests under a threshold.
Neither is comparable to v1_service_latency_p95 / _p99, so v1 will report higher values for identical traffic — a measurement change, not a regression.
Includes safe-migration steps (start on p50 to match an average-based alert, move to p95/p99 deliberately) and a pointer to the p99 latency SLO.

2. Metrics reference — `Metric Conventions` section

Adds a note that percentile metrics on low-traffic namespaces are computed from small per-minute samples, so a single slow request dominates p50/p95/p99. Recommends gating latency alerts on a minimum request count (e.g. service_request_count) so sparse windows don't trigger them. Also notes that these pre-calculated percentiles cannot be re-aggregated into an accurate longer-window percentile, so widening the evaluation window does not by itself make a sparse sample meaningful — consistent with the existing per-metric "avoid aggregating this metric" caution.

Why

A customer migrated to v1 cloud metrics, set a per-namespace p95 alert against the 200ms p99 SLO, and saw frequent StartWorkflowExecution latency spikes that turned out to be a metrics artifact: their low-RPS namespaces produced tiny per-minute samples where one slow request defined the whole quantile, and their v0 baseline had been an average rather than a percentile. No actual latency regression — the v0→v1 measurement change just made existing tail latency visible. These docs would have pre-empted the confusion.

Scope

Prose-only; no metric behavior changes. Two .mdx files, additive callouts only.

🤖 Generated with Claude Code

vercel · 2026-06-17T14:14:50Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
temporal-documentation	Ready	Preview, Comment	Jun 17, 2026 2:25pm

github-actions · 2026-06-17T14:15:18Z

📖 Docs PR preview links

Cloud
- Metrics
  - Openmetrics
    - Metrics reference
    - Migration guide

Two OpenMetrics doc gaps surfaced by a customer alerting false alarm after migrating from the v0 query endpoint to v1 OpenMetrics: - Migration guide: add a caution that v0 service_latency_sum/count is an average (~p50) and _bucket is a count, not a percentile. Comparing either against v1 _p95/_p99 reports higher values for identical traffic. Includes safe-migration steps and a pointer to the p99 latency SLO. - Metrics reference: add a note that percentile metrics on low-traffic namespaces are computed from small per-minute samples, so a single slow request dominates p50/p95/p99. Recommends gating latency alerts on a minimum request count, and notes that pre-calculated percentiles cannot be re-aggregated into an accurate longer-window percentile. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

vercel Bot deployed to Preview June 17, 2026 14:15 View deployment

dustin-temporal force-pushed the docs/openmetrics-percentile-sample-size branch from 0b98052 to 8faa8b1 Compare June 17, 2026 14:24

vercel Bot deployed to Preview June 17, 2026 14:25 View deployment

dustin-temporal marked this pull request as ready for review June 17, 2026 14:27

dustin-temporal requested a review from a team as a code owner June 17, 2026 14:27

TimSimmons approved these changes Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify v0-vs-v1 latency metric semantics and low-traffic percentiles#4735

Clarify v0-vs-v1 latency metric semantics and low-traffic percentiles#4735
dustin-temporal wants to merge 1 commit into
mainfrom
docs/openmetrics-percentile-sample-size

dustin-temporal commented Jun 17, 2026 •

edited

Loading

Uh oh!

vercel Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dustin-temporal commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

1. Migration guide — Percentile metrics section

2. Metrics reference — Metric Conventions section

Why

Scope

Uh oh!

vercel Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📖 Docs PR preview links

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dustin-temporal commented Jun 17, 2026 •

edited

Loading

1. Migration guide — `Percentile metrics` section

2. Metrics reference — `Metric Conventions` section

vercel Bot commented Jun 17, 2026 •

edited

Loading

github-actions Bot commented Jun 17, 2026 •

edited

Loading