Skip to content

feat(tracing): enrich OTel resource, compress OTLP, annotate spans#5502

Open
martinconic wants to merge 6 commits into
feat/opentelemetry-migrationfrom
feat/otel-resource-enrichment
Open

feat(tracing): enrich OTel resource, compress OTLP, annotate spans#5502
martinconic wants to merge 6 commits into
feat/opentelemetry-migrationfrom
feat/otel-resource-enrichment

Conversation

@martinconic

@martinconic martinconic commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Checklist

  • I have read the coding guide.
  • My change requires a documentation update, and I have done it.
  • I have added tests to cover my changes.
  • I have filled out the description and linked the related issues.

Description

Follow-up to the OpenTelemetry migration (#5456) that addresses the reviewers' feedback. It targets feat/opentelemetry-migration rather than master, so the migration PR stays focused on the OpenTracing→OTel swap while the review changes are collected and reviewed here. Once merged, these commits become part of #5456.

Without these changes, every node's spans look identical in an OTLP backend and a request cannot be followed end to end. This PR makes spans attributable per node, network and release, trims export bandwidth, surfaces tracing problems, and lets a value such as the postage batch id follow an upload across hops.

Resource attributesservice.version, deployment.environment (derived from the network id), host.name, and service.instance.id (the node's overlay address). Env-based options (WithFromEnv/WithTelemetrySDK/WithHost) are applied before the configured attributes, so configured values win while in NewBee was moved below overlay finalization so the (immutable) resource can carry the final overlay as the instance id.

Exporter & processor — gzip compression on the HTTP and gRPC OTLP clients; the batch span processor keeps SDK defaults and honors the standard OTEL_BSP_* env vars (documented, no new flags).

Observability of tracing itself — a confirmation log line once tracing is wired up (endpoint, protocol, sampling ratio), and OTLP exporter errors (e.g. an unreachable collector) are routed through the OTel global error handler to the node logger instead of being silently dropped.

Span attributes & new spans — chunk address on the netstore get/put spans and peer_address on the pingpong spans; new spans on the previously untraced subsystems: postage-batch-create (with batch_id), kademlia-connect (with peer_address), and salud-round (parenting the per-peer status snapshots).

Baggage propagation — W3C baggage now propagates across HTTP (composite TraceContext + Baggage propagator) and p2p (a separate additive tracing-baggage header; the existing span-context carrier is unchanged and peers that don't understand the header simply ignore it). The API attaches the batch_id as baggage when stamping a chunk, so it follows the chunk into the direct-upload/pushsync path.

Flag rename (breaking) — the tracing flags drop the otlp infix: tracing-otlp-endpoint/-insecure/-ca-file/-protocoltracing-endpoint/-insecure/-ca-file/-protocol (and the nested tracing.otlp-* config keys likewise). tracing-endpoint is reused as the active OTLP collector endpoint, so it is removed from the deprecated no-op keys (tracing-host/tracing-port remain deprecated).

Open API Spec Version Changes (if applicable)

Motivation and Context (Optional)

Addresses both review passes from @darkobas2 on #5456 (resource enrichment / exporter hardening / span coverage / baggage, and the flag rename).

Related Issue (Optional)

Stacked on #5456.

Screenshots (if appropriate):

AI Disclosure

  • This PR contains code that has been generated by an LLM.
  • I have reviewed the AI generated code thoroughly.
  • I possess the technical expertise to responsibly review the code generated in this PR.

@darkobas2

Copy link
Copy Markdown
Contributor

🤖 AI-assisted review (Claude + Gemini second opinion) — +1, looks solid / ready.

Nicely addresses the resource-attribution and flag feedback from #5456. Spans are attributable per node now (service.instance.id=overlay, service.version, deployment.environment, host.name), OTLP export is gzipped, the startup "tracing enabled" log + exporter-error warnings are a good operability add, and the W3C baggage propagation (incl. the additive tracing-baggage p2p header) is a clean way to carry batch_id end-to-end. Everything is additive instrumentation — no control-flow/consensus paths touched, and nil-tracer paths stay noop-safe.

Two small things worth confirming (non-blocking):

  1. postage-batch-create starts its span from context.Background() and discards the returned ctx, so it's a detached root span and batch_id doesn't propagate into the rest of Create — intentional, or worth threading the caller ctx if a parent is available?
  2. Remote-peer baggage is parsed into the local context with parse errors ignored — fine while it only feeds span attributes, but worth a comment that baggage values must never become metric labels / unbounded log fields (cardinality + abuse vector from arbitrary peers).

+1

@martinconic

Copy link
Copy Markdown
Contributor Author

🤖 AI-assisted review (Claude + Gemini second opinion) — +1, looks solid / ready.

Nicely addresses the resource-attribution and flag feedback from #5456. Spans are attributable per node now (service.instance.id=overlay, service.version, deployment.environment, host.name), OTLP export is gzipped, the startup "tracing enabled" log + exporter-error warnings are a good operability add, and the W3C baggage propagation (incl. the additive tracing-baggage p2p header) is a clean way to carry batch_id end-to-end. Everything is additive instrumentation — no control-flow/consensus paths touched, and nil-tracer paths stay noop-safe.

Two small things worth confirming (non-blocking):

  1. postage-batch-create starts its span from context.Background() and discards the returned ctx, so it's a detached root span and batch_id doesn't propagate into the rest of Create — intentional, or worth threading the caller ctx if a parent is available?
  2. Remote-peer baggage is parsed into the local context with parse errors ignored — fine while it only feeds span attributes, but worth a comment that baggage values must never become metric labels / unbounded log fields (cardinality + abuse vector from arbitrary peers).

+1

Addressed with comments (no behavior change):

  1. postage-batch-create root span — intentional. Create is part of the EventUpdater interface and is driven by on-chain postage events, not a request, so there's no caller context with a meaningful parent to thread. Added a comment making the detached-root-span choice explicit.
  2. Untrusted remote baggage — agreed. Added a comment at the p2p extraction point: baggage from arbitrary peers must only ever feed span attributes, never metric labels or unbounded log fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants