Skip to content

feat(0026): enrichment CloudWatch metrics + one-shot drain mode#66

Merged
karczuRF merged 5 commits into
developfrom
feat/0026_enrichment-cloudwatch-and-oneshot
Jun 29, 2026
Merged

feat(0026): enrichment CloudWatch metrics + one-shot drain mode#66
karczuRF merged 5 commits into
developfrom
feat/0026_enrichment-cloudwatch-and-oneshot

Conversation

@karczuRF

Copy link
Copy Markdown
Collaborator

Summary

Option 2 of the 0026 remaining-work split — worker-code observability + historical backfill drain. Prepare-only, no deploy. Stacked on #65 (merged).

  • Metrics: new metrics.rs publishes the four spec §5 metrics — EnrichmentRowsEnriched, EnrichmentOracleMiss, EnrichmentRowsRemainingAtVolumeZero, EnrichmentBatchDurationMs — via aws_sdk_cloudwatch under the Prices/Enrichment namespace. The stats→metric mapping is a pure, unit-tested function; the put_metric_data publish is lambda-gated and best-effort (a metric failure logs and never fails the pass). ChPassStats gains oracle_misses (remaining after the oracle tier) and duration_ms.
  • One-shot mode: MAX_BATCHES=0 drains the whole backlog in a single invocation (spec §4) — both tier loops use effective_max_batches() (0 ⇒ unbounded); the existing no-progress/drained breaks still bound it. Covered by a new integration test one_shot_drains_full_backlog.
  • Infra: grants cloudwatch:PutMetricData to the enrichment role (scoped to the Prices/Enrichment namespace), and adds the EnrichmentRowsRemainingAtVolumeZero backlog alarm to observability-stack (scaffold threshold; task 0056 tunes + owns the dashboard widgets).

Verification (local)

  • 25 unit + 2 e2e + 4 live-CH integration tests pass (incl. the one-shot drain), against prod-pinned CH 26.3.10.60.
  • clippy/fmt clean (default + lambda); cargo lambda build bootstrap builds with the AWS SDK.
  • infra lint/build/typecheck + cdk synth green — the PutMetricData grant and the alarm are confirmed in the synthesized templates, and the Prices/Enrichment namespace is consistent across worker, IAM condition, and alarm.

Scope / deferred

Closes the metric-emission half of the telemetry ACs and lands the one-shot mode. Remaining (Option 3, deploy-gated): actual cdk deploy, live dashboard visibility (task 0056), and the post-backfill credibility check (≥3 XLM-quoted assets vs Horizon). Task 0026 stays active.

karczuRF added 5 commits June 29, 2026 16:32
Option 2 — worker-code observability + backfill drain (prepare-only):

- metrics: emit the four spec §5 metrics (EnrichmentRowsEnriched,
  EnrichmentOracleMiss, EnrichmentRowsRemainingAtVolumeZero,
  EnrichmentBatchDurationMs) via aws_sdk_cloudwatch under the
  `Prices/Enrichment` namespace. New `metrics.rs` keeps the stats→metric mapping
  a pure (unit-tested) function; the `put_metric_data` publish is `lambda`-gated
  and best-effort (a metric failure logs and never fails the pass). Extend
  ChPassStats with `oracle_misses` (remaining after the oracle tier) and
  `duration_ms` to back two of the metrics.
- one-shot: `MAX_BATCHES=0` drains the whole backlog in one invocation (spec §4)
  — both tier loops use `effective_max_batches()` (0 ⇒ unbounded); the existing
  no-progress/drained breaks still bound it. Verified by a new integration test.
- infra: grant `cloudwatch:PutMetricData` to the enrichment role (scoped to the
  `Prices/Enrichment` namespace), and add the `EnrichmentRowsRemainingAtVolumeZero`
  backlog alarm to observability-stack (scaffold threshold; 0056 tunes).

Verified: 25 unit + 2 e2e pass; 4 live-CH integration tests green incl. the
one-shot drain; clippy/fmt clean (default + lambda); cargo-lambda bootstrap
builds; infra lint/build/typecheck + `cdk synth` green (PutMetricData grant +
alarm confirmed in the templates, namespace consistent across worker/IAM/alarm).
From the PR #66 review (high-effort):

- #1 EnrichmentOracleMiss was inflated: `oracle_misses` was the whole
  remainder after Tier 1, so a budget-exhausted (un-drained) oracle tier
  reported every un-reached row as a miss. Now `oracle_misses = remaining`
  only when the tier drained (fixed point); 0 otherwise. Regression guard
  added to the budget-exhaustion integration test.
- #2 EnrichmentRowsRemainingAtVolumeZero counted candidates_after, which is
  `volume_quote_usd = 0 OR close_usd = 0` — not the volume-zero population the
  metric/alarm are named for. Add `count_remaining_at_volume_zero` (the
  `volume_quote_usd = 0` count) + a `ChPassStats.rows_remaining_at_volume_zero`
  field and map the metric to it. The alarm prose is now accurate.
- #3 one-shot was a `MAX_BATCHES=0` magic sentinel that (a) meant the opposite
  of the prototype CLI's `max_batches=0` (no-op) and (b) read as "off". Replace
  with an explicit `ChEnrichConfig.one_shot` flag (env `ENRICHMENT_ONE_SHOT`);
  `max_batches` keeps its literal meaning. CDK comments corrected — one-shot is
  a dedicated operator drain, never the hourly function (5-min timeout).

25 unit + 2 e2e + 4 live-CH integration tests green; clippy/fmt clean (default
+ lambda); infra lint/build/typecheck green.
Record PR #66 review findings #5 (the EnrichmentRowsRemainingAtVolumeZero
backlog alarm latches on the permanent exotic-quote floor and storms during a
legitimate catch-up) and #7 (EnrichmentBatchDurationMs is whole-pass wall-clock,
not per-batch) in task 0056's notes — it owns the enrichment dashboard + alarm
tuning. 0026 left the alarm as an explicit scaffold for this reason.
Two PR #66 review cleanups:

- #8: `ChPassStats` derives `Serialize`; the Lambda handler returns
  `serde_json::to_value(&stats)?` instead of a hand-built `json!` that
  re-listed every field (a third place to forget when a stat is added).
- #9: hoist the optional-with-default env helpers into
  `prices_clickhouse::env` (`env_or` / `env_parse_or`), companion to
  `mtls::require_env`. enrichment-worker (main + example), oracle-worker,
  supply-worker, and asset-discovery now share them instead of hand-rolling
  `std::env::var(..).unwrap_or_else(..)` per crate.

Default + lambda builds clean across all touched crates; 25 unit + 2 e2e + 4
live-CH integration tests green; clippy/fmt clean.
@karczuRF karczuRF merged commit ccde1a3 into develop Jun 29, 2026
3 checks passed
@karczuRF karczuRF deleted the feat/0026_enrichment-cloudwatch-and-oneshot branch June 29, 2026 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant