Skip to content

Add Guidance wrt Labelling to Naming and Rules Best Practices#2691

Open
conallob wants to merge 27 commits into
prometheus:mainfrom
conallob:main
Open

Add Guidance wrt Labelling to Naming and Rules Best Practices#2691
conallob wants to merge 27 commits into
prometheus:mainfrom
conallob:main

Conversation

@conallob
Copy link
Copy Markdown

Add Guidance wrt Labelling to Naming and Rules Best Practices to docs/practices/naming.md and docs/practices/rules.md, specifically:

  • The primary purposes of job and instance
  • Include WARNINGS about accidentally stripping the job label, especially in multi-tenant systems

This Fixes #2690

conallob added 2 commits July 15, 2025 13:39
Signed-off-by: Conall O'Brien <conall@conall.net>
Signed-off-by: Conall O'Brien <conall@conall.net>
@conallob
Copy link
Copy Markdown
Author

Friendly ping @SuperQ @beorn7 ?

@conallob
Copy link
Copy Markdown
Author

conallob commented Jul 28, 2025

Obligatory post-it note reminder: https://photos.app.goo.gl/Bkfir4wRiLtNVG4W8

@beorn7 beorn7 requested review from SuperQ and juliusv July 29, 2025 10:48
@beorn7
Copy link
Copy Markdown
Contributor

beorn7 commented Jul 29, 2025

With my current patchy availability, there is little chance I get to this anytime soon. Maybe @juliusv has a qualified opinion here?

@conallob
Copy link
Copy Markdown
Author

conallob commented Aug 3, 2025

Friendly ping @SuperQ @juliusv

@andrechalella
Copy link
Copy Markdown

Hey @conallob, congrats for the nice PR!

Just one thing: maybe you could fix the eaach typo in

  • The job label is a primary key to differentiate metrics from eaach other.

There is a small confusion I would love to see fixed, in a paragraph just below one of your edits. It is:

To keep the operations clean, _sum is omitted if there are other operations,
as sum().

I don't understand the as sum() part. Like, "x is omitted if there are other operations such as x"? It doesn't make sense to me, in a very basic way. I know it's out of the scope of this PR, but maybe you could touch it to clarify.

Signed-off-by: Conall O'Brien <conall@conall.net>
@conallob
Copy link
Copy Markdown
Author

Hey @conallob, congrats for the nice PR!

Just one thing: maybe you could fix the eaach typo in

  • The job label is a primary key to differentiate metrics from eaach other.

There is a small confusion I would love to see fixed, in a paragraph just below one of your edits. It is:

Fixed the typo.

To keep the operations clean, _sum is omitted if there are other operations,
as sum().

I don't understand the as sum() part. Like, "x is omitted if there are other operations such as x"? It doesn't make sense to me, in a very basic way. I know it's out of the scope of this PR, but maybe you could touch it to clarify.

I'm afraid that best practice is unrelated.

It also makes sense as written, once you've written enough rules. It's weighing up the trade-off between tracking the chain of operations across a pipeline of rules vs the rule name growing unwieldy. Many of these best practices trace back to specific philosophies from Prometheus' predecessor.

If you still think it needs a polish, please a separate doc bug.

@conallob
Copy link
Copy Markdown
Author

Ping @juliusv , since @SuperQ is currently unavailable for life reasons

Comment thread docs/practices/naming.md Outdated
Co-authored-by: Ben Kochie <superq@gmail.com>
Signed-off-by: Conall O'Brien <conall@conall.net>
Comment thread docs/practices/naming.md Outdated
Co-authored-by: Ben Kochie <superq@gmail.com>
Signed-off-by: Conall O'Brien <conall@conall.net>
@conallob conallob requested a review from SuperQ August 15, 2025 15:54
Comment thread docs/practices/rules.md Outdated
Iterate on the description of the job label, removing "primary key", given it's association with SQL

Signed-off-by: Conall O'Brien <conall@conall.net>
@conallob
Copy link
Copy Markdown
Author

PTAL

@conallob
Copy link
Copy Markdown
Author

Friendly ping?

@conallob conallob requested a review from SuperQ August 25, 2025 08:36
@conallob
Copy link
Copy Markdown
Author

conallob commented Oct 6, 2025

Friendly, you're not at SRECon EMEA this week, ping?

Comment thread docs/practices/naming.md Outdated
Comment thread docs/practices/naming.md Outdated
Comment thread docs/practices/naming.md Outdated
Comment thread docs/practices/naming.md Outdated

* `job`
* The `job` label is one of the few ubiquitious labels, set at scrape time, and is used to identify metrics scraped from the same target/exporter.
* If not specified in PromQL expressions, they will match unrelated metrics with the same name. This is especially true in a multi system or multi tenant installation
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is really a useful note here, as this applies to all label matching.

Suggested change
* If not specified in PromQL expressions, they will match unrelated metrics with the same name. This is especially true in a multi system or multi tenant installation

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It applies to all labels. But job and instance are two uniform labels found on every metric, including ubiquitous synthetic metrics such as up

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it's not related to job, but related to "target labels" and discovery. That is a different thing and related to querying, not creating labels.

Comment thread docs/practices/naming.md Outdated
Co-authored-by: Ben Kochie <superq@gmail.com>
Signed-off-by: Conall O'Brien <conall@conall.net>
@conallob
Copy link
Copy Markdown
Author

conallob commented Oct 7, 2025

For perspective, one of the motivations behind this PR is the anti-patterm of writing alert expressions intended for a single tenant system, which has evolved into a multi-tenant system.

e.g up{} for 5m without defining a job label works for one job.

Once you start adding additional jobs that match on the same labels (e.g Daemonsets, fleet-wide node_exporter, etc), teams start getting paged for systems they don't own or care about

@jan--f jan--f added the kind/enhancement Improvements to existing documentation label Dec 5, 2025
@jan--f
Copy link
Copy Markdown
Contributor

jan--f commented Mar 11, 2026

Hello from the bug scrub!
@conallob Looks like there some feedback to address still, are you still working on this?

@conallob
Copy link
Copy Markdown
Author

Coming back to this long neglected PR, afaik I can tell, the main issue here is whether to put label best practices in the variable naming best practices or not. I only did that since there is no label best practices section currently.

Would creating a new section for label best practices help to unstick this?

Signed-off-by: Conall O'Brien <conall@conall.net>
Signed-off-by: Conall O'Brien <conall@conall.net>
Signed-off-by: Conall O'Brien <conall@conall.net>
Signed-off-by: Conall O'Brien <conall@conall.net>
… a primary key

Signed-off-by: Conall O'Brien <conall@conall.net>
Signed-off-by: Conall O'Brien <conall@conall.net>
Copy link
Copy Markdown
Author

@conallob conallob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviving this PR, PTAL @SuperQ

  • I've split the label sections of naming.md out into a new labels.md, as discussed in OOB chat with @SuperQ
  • I've also reworded the line mentioning "primary key" to be about defining scope in PromQL expressions, which is the ultimate goal.
  • It looks like I need a maintainer to approve the Netlify CI checks though

Comment thread docs/practices/naming.md Outdated
Comment thread docs/practices/naming.md Outdated
Comment thread docs/practices/rules.md Outdated
Comment thread docs/practices/naming.md Outdated
@conallob
Copy link
Copy Markdown
Author

@SuperQ Friendly ping?

Copy link
Copy Markdown
Member

@krajorama krajorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. I feel these additions are currently too specific for your use cases to be included in the generic documentation.

Comment thread docs/practices/labels.md Outdated
Comment thread docs/practices/labels.md Outdated
Comment thread docs/practices/labels.md Outdated

- `job`
- The `job` is a default target label set by the scrape configs and is used to identify metrics scraped from the same target/exporter.
- If not specified in PromQL expressions, they will match unrelated metrics with the same name. This is especially true in a multi system or multi tenant installation
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be the intention of the user to not distinguish between unrelated metrics, for example when aggregating across jobs. So I'd turn this into some positive statement instead (use "specify" instead of "not specified"). Multi system seems vague and I don't think multi tenancy has anything to do with this.

Suggested change
- If not specified in PromQL expressions, they will match unrelated metrics with the same name. This is especially true in a multi system or multi tenant installation
- If specified in PromQL expressions, they will match metrics scraped by the same job.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be the intention of the user to not distinguish between unrelated metrics, for example when aggregating across jobs. So I'd turn this into some positive statement instead (use "specify" instead of "not specified").

I agree that the user may want to aggregate across all, or a subset of jobs. But to do so, they should be explicit with the job label values. But I expect unintended label stripping to vastly outnumber cross-job aggregation use cases.

Multi system seems vague and I don't think multi tenancy has anything to do with this.

"Multi-tenent systems" may not be the best term, but I'm referring to a Prometheus, run as a platform for multiple teams (e.g by a DevEx or Platform Engineering team), to prevent every team running their own siloed Prometheus stack. In such a setup, all PromQL expressions should be scoped with a job label, to ensure the metrics are from the the expected exporters.

Or framed another way, in such a centralised stack, always write up{job=bla}, never up{}

Comment thread docs/practices/labels.md Outdated
Comment thread docs/practices/labels.md Outdated
- The `job` is a default target label set by the scrape configs and is used to identify metrics scraped from the same target/exporter.
- If not specified in PromQL expressions, they will match unrelated metrics with the same name. This is especially true in a multi system or multi tenant installation

WARNING: When using `without`, be careful not to strip out the `job` label accidentally.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be intentional , so I think this needs to be conditional.

Comment thread docs/practices/labels.md

- `instance`
- The `instance` label will include the `ip:port` what was scraped

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you need a similar warning for "instance" , depending on usage?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While instance is one of the standard scrape time labels, like job, stripping it doesn't have the same blast radius. Stripping instance will make metrics hard to debug, but should will still work.

For certain use cases that require using multiple layers of rules (e.g in a multi region, multi layered tree of Prometheus), you may want to strip out instance at the higher aggregation layers to manage label cardinality (e.g instance labels make sense to the per region aggregation, but can be problematic if aggregated at the global level)

I've added a warning that stripping instance can make it harder to debug scrape time issues with a metric though.

Comment thread docs/practices/rules.md Outdated
Comment thread docs/practices/rules.md

NOTE: Omitting a label in a PromQL expression is the functional equivalent of specifying `label=*`

* In both recorded rules and alerting expressions, always specify a `job` label to prevent expression mismatches from occuring.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the need to specify job is very circumstantial , so again I think it needs to be conditional on what you want to achieve. Also specify job is very vague in itself.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on why job is circumstantial?

Afaik, job will always be set on metrics unless it is explicitly stripped away.

Comment thread docs/practices/rules.md
NOTE: Omitting a label in a PromQL expression is the functional equivalent of specifying `label=*`

* In both recorded rules and alerting expressions, always specify a `job` label to prevent expression mismatches from occuring.
This is especially important in multi-tenant systems where the same metric names may be exported by different jobs or the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think multi-tenant has anything to do with job and instance labels.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, "Multi-tenent systems" may not be the best term, but I'm referring to a Prometheus, run as a platform for multiple teams (e.g by a DevEx or Platform Engineering team), to prevent every team running their own siloed Prometheus stack. In such a setup, all PromQL expressions should be scoped with a job label, to ensure the metrics are from the the expected exporters.

Or framed another way, in such a centralised stack, always write up{job=bla}, never up{}

Comment thread docs/practices/rules.md
conallob and others added 6 commits June 1, 2026 22:19
Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>
Signed-off-by: Conall O'Brien <conall@conall.net>
Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>
Signed-off-by: Conall O'Brien <conall@conall.net>
Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>
Signed-off-by: Conall O'Brien <conall@conall.net>
Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>
Signed-off-by: Conall O'Brien <conall@conall.net>
…RNING to stripping instance

Signed-off-by: Conall O'Brien <conall@conall.net>
Signed-off-by: Conall O'Brien <conall@conall.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/enhancement Improvements to existing documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Best Practice Docs Don't Call Out the Importance of Job Label

6 participants