Add Guidance wrt Labelling to Naming and Rules Best Practices#2691
Add Guidance wrt Labelling to Naming and Rules Best Practices#2691conallob wants to merge 27 commits into
Conversation
Signed-off-by: Conall O'Brien <conall@conall.net>
Signed-off-by: Conall O'Brien <conall@conall.net>
|
Obligatory post-it note reminder: https://photos.app.goo.gl/Bkfir4wRiLtNVG4W8 |
|
With my current patchy availability, there is little chance I get to this anytime soon. Maybe @juliusv has a qualified opinion here? |
|
Hey @conallob, congrats for the nice PR! Just one thing: maybe you could fix the
There is a small confusion I would love to see fixed, in a paragraph just below one of your edits. It is:
I don't understand the |
Fixed the typo.
I'm afraid that best practice is unrelated. It also makes sense as written, once you've written enough rules. It's weighing up the trade-off between tracking the chain of operations across a pipeline of rules vs the rule name growing unwieldy. Many of these best practices trace back to specific philosophies from Prometheus' predecessor. If you still think it needs a polish, please a separate doc bug. |
Co-authored-by: Ben Kochie <superq@gmail.com> Signed-off-by: Conall O'Brien <conall@conall.net>
Co-authored-by: Ben Kochie <superq@gmail.com> Signed-off-by: Conall O'Brien <conall@conall.net>
Iterate on the description of the job label, removing "primary key", given it's association with SQL Signed-off-by: Conall O'Brien <conall@conall.net>
|
PTAL |
|
Friendly ping? |
|
Friendly, you're not at SRECon EMEA this week, ping? |
|
|
||
| * `job` | ||
| * The `job` label is one of the few ubiquitious labels, set at scrape time, and is used to identify metrics scraped from the same target/exporter. | ||
| * If not specified in PromQL expressions, they will match unrelated metrics with the same name. This is especially true in a multi system or multi tenant installation |
There was a problem hiding this comment.
I'm not sure this is really a useful note here, as this applies to all label matching.
| * If not specified in PromQL expressions, they will match unrelated metrics with the same name. This is especially true in a multi system or multi tenant installation |
There was a problem hiding this comment.
It applies to all labels. But job and instance are two uniform labels found on every metric, including ubiquitous synthetic metrics such as up
There was a problem hiding this comment.
Yes, but it's not related to job, but related to "target labels" and discovery. That is a different thing and related to querying, not creating labels.
Co-authored-by: Ben Kochie <superq@gmail.com> Signed-off-by: Conall O'Brien <conall@conall.net>
|
For perspective, one of the motivations behind this PR is the anti-patterm of writing alert expressions intended for a single tenant system, which has evolved into a multi-tenant system. e.g Once you start adding additional jobs that match on the same labels (e.g Daemonsets, fleet-wide node_exporter, etc), teams start getting paged for systems they don't own or care about |
|
Hello from the bug scrub! |
|
Coming back to this long neglected PR, afaik I can tell, the main issue here is whether to put label best practices in the variable naming best practices or not. I only did that since there is no label best practices section currently. Would creating a new section for label best practices help to unstick this? |
Signed-off-by: Conall O'Brien <conall@conall.net>
Signed-off-by: Conall O'Brien <conall@conall.net>
Signed-off-by: Conall O'Brien <conall@conall.net>
Signed-off-by: Conall O'Brien <conall@conall.net>
… a primary key Signed-off-by: Conall O'Brien <conall@conall.net>
Signed-off-by: Conall O'Brien <conall@conall.net>
conallob
left a comment
There was a problem hiding this comment.
Reviving this PR, PTAL @SuperQ
- I've split the label sections of naming.md out into a new labels.md, as discussed in OOB chat with @SuperQ
- I've also reworded the line mentioning "primary key" to be about defining scope in PromQL expressions, which is the ultimate goal.
- It looks like I need a maintainer to approve the Netlify CI checks though
|
@SuperQ Friendly ping? |
krajorama
left a comment
There was a problem hiding this comment.
Thanks for working on this. I feel these additions are currently too specific for your use cases to be included in the generic documentation.
|
|
||
| - `job` | ||
| - The `job` is a default target label set by the scrape configs and is used to identify metrics scraped from the same target/exporter. | ||
| - If not specified in PromQL expressions, they will match unrelated metrics with the same name. This is especially true in a multi system or multi tenant installation |
There was a problem hiding this comment.
It might be the intention of the user to not distinguish between unrelated metrics, for example when aggregating across jobs. So I'd turn this into some positive statement instead (use "specify" instead of "not specified"). Multi system seems vague and I don't think multi tenancy has anything to do with this.
| - If not specified in PromQL expressions, they will match unrelated metrics with the same name. This is especially true in a multi system or multi tenant installation | |
| - If specified in PromQL expressions, they will match metrics scraped by the same job. |
There was a problem hiding this comment.
It might be the intention of the user to not distinguish between unrelated metrics, for example when aggregating across jobs. So I'd turn this into some positive statement instead (use "specify" instead of "not specified").
I agree that the user may want to aggregate across all, or a subset of jobs. But to do so, they should be explicit with the job label values. But I expect unintended label stripping to vastly outnumber cross-job aggregation use cases.
Multi system seems vague and I don't think multi tenancy has anything to do with this.
"Multi-tenent systems" may not be the best term, but I'm referring to a Prometheus, run as a platform for multiple teams (e.g by a DevEx or Platform Engineering team), to prevent every team running their own siloed Prometheus stack. In such a setup, all PromQL expressions should be scoped with a job label, to ensure the metrics are from the the expected exporters.
Or framed another way, in such a centralised stack, always write up{job=bla}, never up{}
| - The `job` is a default target label set by the scrape configs and is used to identify metrics scraped from the same target/exporter. | ||
| - If not specified in PromQL expressions, they will match unrelated metrics with the same name. This is especially true in a multi system or multi tenant installation | ||
|
|
||
| WARNING: When using `without`, be careful not to strip out the `job` label accidentally. |
There was a problem hiding this comment.
It might be intentional , so I think this needs to be conditional.
|
|
||
| - `instance` | ||
| - The `instance` label will include the `ip:port` what was scraped | ||
|
|
There was a problem hiding this comment.
Don't you need a similar warning for "instance" , depending on usage?
There was a problem hiding this comment.
While instance is one of the standard scrape time labels, like job, stripping it doesn't have the same blast radius. Stripping instance will make metrics hard to debug, but should will still work.
For certain use cases that require using multiple layers of rules (e.g in a multi region, multi layered tree of Prometheus), you may want to strip out instance at the higher aggregation layers to manage label cardinality (e.g instance labels make sense to the per region aggregation, but can be problematic if aggregated at the global level)
I've added a warning that stripping instance can make it harder to debug scrape time issues with a metric though.
|
|
||
| NOTE: Omitting a label in a PromQL expression is the functional equivalent of specifying `label=*` | ||
|
|
||
| * In both recorded rules and alerting expressions, always specify a `job` label to prevent expression mismatches from occuring. |
There was a problem hiding this comment.
I think the need to specify job is very circumstantial , so again I think it needs to be conditional on what you want to achieve. Also specify job is very vague in itself.
There was a problem hiding this comment.
Can you elaborate on why job is circumstantial?
Afaik, job will always be set on metrics unless it is explicitly stripped away.
| NOTE: Omitting a label in a PromQL expression is the functional equivalent of specifying `label=*` | ||
|
|
||
| * In both recorded rules and alerting expressions, always specify a `job` label to prevent expression mismatches from occuring. | ||
| This is especially important in multi-tenant systems where the same metric names may be exported by different jobs or the |
There was a problem hiding this comment.
I don't think multi-tenant has anything to do with job and instance labels.
There was a problem hiding this comment.
As above, "Multi-tenent systems" may not be the best term, but I'm referring to a Prometheus, run as a platform for multiple teams (e.g by a DevEx or Platform Engineering team), to prevent every team running their own siloed Prometheus stack. In such a setup, all PromQL expressions should be scoped with a job label, to ensure the metrics are from the the expected exporters.
Or framed another way, in such a centralised stack, always write up{job=bla}, never up{}
Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com> Signed-off-by: Conall O'Brien <conall@conall.net>
Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com> Signed-off-by: Conall O'Brien <conall@conall.net>
Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com> Signed-off-by: Conall O'Brien <conall@conall.net>
Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com> Signed-off-by: Conall O'Brien <conall@conall.net>
…RNING to stripping instance Signed-off-by: Conall O'Brien <conall@conall.net>
Signed-off-by: Conall O'Brien <conall@conall.net>
Add Guidance wrt Labelling to Naming and Rules Best Practices to docs/practices/naming.md and docs/practices/rules.md, specifically:
jobandinstancejoblabel, especially in multi-tenant systemsThis Fixes #2690