feat(0026): enrichment CloudWatch metrics + one-shot drain mode#66
Merged
Merged
Conversation
Option 2 — worker-code observability + backfill drain (prepare-only): - metrics: emit the four spec §5 metrics (EnrichmentRowsEnriched, EnrichmentOracleMiss, EnrichmentRowsRemainingAtVolumeZero, EnrichmentBatchDurationMs) via aws_sdk_cloudwatch under the `Prices/Enrichment` namespace. New `metrics.rs` keeps the stats→metric mapping a pure (unit-tested) function; the `put_metric_data` publish is `lambda`-gated and best-effort (a metric failure logs and never fails the pass). Extend ChPassStats with `oracle_misses` (remaining after the oracle tier) and `duration_ms` to back two of the metrics. - one-shot: `MAX_BATCHES=0` drains the whole backlog in one invocation (spec §4) — both tier loops use `effective_max_batches()` (0 ⇒ unbounded); the existing no-progress/drained breaks still bound it. Verified by a new integration test. - infra: grant `cloudwatch:PutMetricData` to the enrichment role (scoped to the `Prices/Enrichment` namespace), and add the `EnrichmentRowsRemainingAtVolumeZero` backlog alarm to observability-stack (scaffold threshold; 0056 tunes). Verified: 25 unit + 2 e2e pass; 4 live-CH integration tests green incl. the one-shot drain; clippy/fmt clean (default + lambda); cargo-lambda bootstrap builds; infra lint/build/typecheck + `cdk synth` green (PutMetricData grant + alarm confirmed in the templates, namespace consistent across worker/IAM/alarm).
From the PR #66 review (high-effort): - #1 EnrichmentOracleMiss was inflated: `oracle_misses` was the whole remainder after Tier 1, so a budget-exhausted (un-drained) oracle tier reported every un-reached row as a miss. Now `oracle_misses = remaining` only when the tier drained (fixed point); 0 otherwise. Regression guard added to the budget-exhaustion integration test. - #2 EnrichmentRowsRemainingAtVolumeZero counted candidates_after, which is `volume_quote_usd = 0 OR close_usd = 0` — not the volume-zero population the metric/alarm are named for. Add `count_remaining_at_volume_zero` (the `volume_quote_usd = 0` count) + a `ChPassStats.rows_remaining_at_volume_zero` field and map the metric to it. The alarm prose is now accurate. - #3 one-shot was a `MAX_BATCHES=0` magic sentinel that (a) meant the opposite of the prototype CLI's `max_batches=0` (no-op) and (b) read as "off". Replace with an explicit `ChEnrichConfig.one_shot` flag (env `ENRICHMENT_ONE_SHOT`); `max_batches` keeps its literal meaning. CDK comments corrected — one-shot is a dedicated operator drain, never the hourly function (5-min timeout). 25 unit + 2 e2e + 4 live-CH integration tests green; clippy/fmt clean (default + lambda); infra lint/build/typecheck green.
Record PR #66 review findings #5 (the EnrichmentRowsRemainingAtVolumeZero backlog alarm latches on the permanent exotic-quote floor and storms during a legitimate catch-up) and #7 (EnrichmentBatchDurationMs is whole-pass wall-clock, not per-batch) in task 0056's notes — it owns the enrichment dashboard + alarm tuning. 0026 left the alarm as an explicit scaffold for this reason.
Two PR #66 review cleanups: - #8: `ChPassStats` derives `Serialize`; the Lambda handler returns `serde_json::to_value(&stats)?` instead of a hand-built `json!` that re-listed every field (a third place to forget when a stat is added). - #9: hoist the optional-with-default env helpers into `prices_clickhouse::env` (`env_or` / `env_parse_or`), companion to `mtls::require_env`. enrichment-worker (main + example), oracle-worker, supply-worker, and asset-discovery now share them instead of hand-rolling `std::env::var(..).unwrap_or_else(..)` per crate. Default + lambda builds clean across all touched crates; 25 unit + 2 e2e + 4 live-CH integration tests green; clippy/fmt clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Option 2 of the 0026 remaining-work split — worker-code observability + historical backfill drain. Prepare-only, no deploy. Stacked on #65 (merged).
metrics.rspublishes the four spec §5 metrics —EnrichmentRowsEnriched,EnrichmentOracleMiss,EnrichmentRowsRemainingAtVolumeZero,EnrichmentBatchDurationMs— viaaws_sdk_cloudwatchunder thePrices/Enrichmentnamespace. The stats→metric mapping is a pure, unit-tested function; theput_metric_datapublish islambda-gated and best-effort (a metric failure logs and never fails the pass).ChPassStatsgainsoracle_misses(remaining after the oracle tier) andduration_ms.MAX_BATCHES=0drains the whole backlog in a single invocation (spec §4) — both tier loops useeffective_max_batches()(0 ⇒ unbounded); the existing no-progress/drained breaks still bound it. Covered by a new integration testone_shot_drains_full_backlog.cloudwatch:PutMetricDatato the enrichment role (scoped to thePrices/Enrichmentnamespace), and adds theEnrichmentRowsRemainingAtVolumeZerobacklog alarm to observability-stack (scaffold threshold; task 0056 tunes + owns the dashboard widgets).Verification (local)
lambda);cargo lambda buildbootstrap builds with the AWS SDK.cdk synthgreen — thePutMetricDatagrant and the alarm are confirmed in the synthesized templates, and thePrices/Enrichmentnamespace is consistent across worker, IAM condition, and alarm.Scope / deferred
Closes the metric-emission half of the telemetry ACs and lands the one-shot mode. Remaining (Option 3, deploy-gated): actual
cdk deploy, live dashboard visibility (task 0056), and the post-backfill credibility check (≥3 XLM-quoted assets vs Horizon). Task 0026 staysactive.