Skip to content

feat: osv advisories ingestion#4149

Open
joanreyero wants to merge 7 commits into
mainfrom
feat/osv-advisories
Open

feat: osv advisories ingestion#4149
joanreyero wants to merge 7 commits into
mainfrom
feat/osv-advisories

Conversation

@joanreyero
Copy link
Copy Markdown
Contributor

@joanreyero joanreyero commented May 27, 2026

Summary

Adds the osv-sync sub-worker to packages_worker — ingests OSV advisories for npm and Maven, normalizes them into advisories + advisory_packages + advisory_affected_ranges, and derives packages.has_critical_vulnerability from the ecosystem-specific range comparator. Engineer-5 slice of the Tier 2 / Project Osprey sprint.

Verified end-to-end against the live OSV bucket: 226,258 advisories ingested (213,414 MAL-, 10,365 V3-scored, 1,387 qualitative-fallback, 1,092 NULL). Spot-checks land: log4shell at CVSS=10.0, lodash CVE-2021-23337 at 7.2.

What's in this slice

Schema (already in c25111e5d):

  • advisories.cvss_source — provenance for the numeric score (osv_cvss_v3 / osv_cvss_v4 / osv_qualitative_fallback / osv_malicious_package).
  • packages.has_critical_vulnerability — boolean with partial index on TRUE.

Worker (services/apps/packages_worker/src/osv/):

  • fetchEcosystemZip.ts — streamed download + unzipper iteration (memory-bounded for the 196 MB npm zip).
  • cvssScoring.ts — inline CVSS v3.1 base-score from the FIRST spec.
  • extractSeverity.ts — MAL- short-circuit, V3 path, qualitative fallback.
  • parseOsvRecord.ts — Maven groupId:artifactId split, npm @scope/ split, ecosystem allowlist, range flattening.
  • upsertAdvisory.ts — transactional UPSERT batches, dedupe on the range unique index.
  • versionCompare.ts — semver for npm + inline ComparableVersion-style for Maven.
  • deriveCriticalFlag.ts — paged derivation, MAL- override, catch-up package_id resolver.
  • index.ts — per-ecosystem pass, derive, sleep loop, shutdown-aware.

Docker compose service: scripts/services/osv-sync.yaml.

ADRs

  • ADR-0003has_critical_vulnerability semantics: option (b), TRUE iff latest_version is inside a critical advisory's affected range, plus a MAL- override.
  • ADR-0004 — standalone bin vs Temporal for batch sub-workers in packages_worker. OSV uses standalone; the forthcoming npm package sync will use Temporal.
  • ADR-0005 — CVSS scoring: inline v3.1 from the FIRST spec; v4 numeric scoring deferred (V4-only records fall back to qualitative).

Testing

68 vitest tests across 5 files (pnpm test in services/apps/packages_worker):

  • cvssScoring.test.ts (10) — FIRST reference vectors + malformed input.
  • extractSeverity.test.ts (8) — MAL-, V3, V4-only fall through, qualitative, empty.
  • parseOsvRecord.test.ts (13) — name splits, allowlist filter, range flattening, GIT-skip, multi-affected coalescing.
  • versionCompare.test.ts (30) — npm semver + Maven qualifier ranks/aliases/edge cases.
  • deriveCriticalFlag.integration.test.ts (7) — hits the local packages-db (skipped automatically without DB env): lodash boundaries, log4j-core boundaries, MAL- override, 1.0-final regression guard, catch-up resolver.

The Maven version comparator unit tests caught a real bug — compareMaven was using num:0 padding for missing tokens regardless of the other side's kind, which mis-ordered 1.0-final vs 1.0 and 1.0 vs 1.0-sp1. Fixed in the same PR.

Also verified out-of-band:

  • Idempotency: rerunning OSV sync leaves row counts and the md5 hash of (osv_id, cvss, cvss_source) bit-identical.
  • SIGINT mid-pass: handler runs, current batch flushes, derive + sleep skip, process exits 0.

Known gaps (follow-ups, not blockers)

  • CVSS v4 numeric scoring is deferred (ADR-0005). ~1.1% of advisories land with cvss = NULL because they are V4-only with no qualitative tag. The fix is local to cvssScoring.ts + extractSeverity.ts.
  • Docker-compose deploy path (./scripts/cli service osv-sync up) hasn't been smoke-tested locally — the worker was run via tsx directly. Same code, same env vars, same DB.
  • Network failure retry path (withRetry in index.ts) didn't fire in the live runs because no errors hit. Logic is straightforward — three retries with exponential backoff — but unexercised.

Heads-up for the reviewer

  • No JIRA key. Per osv-plan.md §8 item 4, this branch was opened without a CM-XXX ticket. The PR title lint at .github/workflows/pr-title-jira-key-lint.yml may reject; if so, happy to either add a ticket or get the lint relaxed for this branch.
  • Three commits, kept separate intentionally: the schema migration was already on main-bound earlier in the slice; the implementation, the tests, and the ADRs are split for review clarity. Diff is ~2,000 LOC across implementation + tests + docs.

Test plan

  • CI: lint, prettier, tsc all green in services/apps/packages_worker.
  • CI: pnpm test in services/apps/packages_worker passes the 5 unit-test files (integration suite skipped without DB env, as expected).
  • Local: ./scripts/cli service osv-sync up builds the image and reaches OSV sync done for npm.
  • Local: after a sync pass, SELECT COUNT(*) FROM advisories; shows ≈226k.
  • Local: a spot-check query for a known critical CVE (e.g. log4shell) shows it in advisories with is_critical = TRUE.

🤖 Generated with Claude Code


Note

High Risk
Security-sensitive ingestion and denormalized vulnerability flags depend on CVSS scoring, version comparators, and range-index correctness; bugs can mis-rank critical exposure across large package sets.

Overview
Adds an osv-sync sub-worker to packages_worker that periodically pulls OSV bulk zips for npm and Maven, normalizes records, upserts into packages-db, and recomputes packages.has_critical_vulnerability from ecosystem version comparators (semver + Maven-style), including MAL-* handling.

Schema: advisories.cvss_source; denormalized has_critical_vulnerability with a partial index; migration widening advisory_affected_ranges uniqueness to the full (introduced, fixed, last_affected) tuple so cross-distro ranges are not dropped.

Worker: bin/osv-sync.ts + src/osv/* (zip download/stream, inline CVSS v3.1, severity extraction, parse/upsert batches, deriveCriticalFlag); getOsvConfig and env samples; Docker compose osv-sync.yaml; semver, unzipper, vitest in package.json.

Docs: ADRs 0003–0006 (critical-flag semantics, standalone bin vs Temporal, CVSS strategy, range dedupe/index) and index update.

Reviewed by Cursor Bugbot for commit d2e92b5. Bugbot is set up for automated code reviews on this repo. Configure here.

Signed-off-by: Joan Reyero <joan@reyero.io>
Adds the osv-sync sub-worker inside packages_worker. Pulls OSV's daily per-ecosystem
zip for npm and Maven, normalizes each record, and upserts into advisories,
advisory_packages, and advisory_affected_ranges (transactional UPSERT, idempotent
on osv_id + the range unique index). MAL- malicious-package reports are ingested
with cvss=NULL and cvss_source='osv_malicious_package'.

A deriveCriticalFlag step runs at the end of each pass and flips
packages.has_critical_vulnerability TRUE iff a critical advisory
(cvss>=7.0 OR osv_id LIKE 'MAL-%') has an affected range covering
the package's current latest_version, using ecosystem-specific
comparators (semver for npm, ComparableVersion-style for Maven).
See ADR-0003 for the semantics.

CVSS scoring computes v3.1 base scores inline from the FIRST spec;
v4 numeric scoring is deferred (V4-only records fall back to the
qualitative tag from database_specific.severity).

Verified locally against the full OSV dataset (226,258 advisories;
log4shell CVSS=10.0, lodash CVE-2021-23337 CVSS=7.2; 213,414 MAL-
entries ingested).

Signed-off-by: Joan Reyero <joan@reyero.io>
Adds vitest with five test files covering the OSV pipeline:

- cvssScoring: 10 cases pinning the inline v3.1 implementation against
  FIRST-published scores (log4shell 10.0, shellshock 9.8, heartbleed 7.5,
  and others). Catches future regressions in the formula.
- extractSeverity: MAL- short-circuit, V3 vector path, V4-only fall through,
  qualitative fallback, malformed-vector handling.
- parseOsvRecord: Maven groupId:artifactId split, npm @scope/ split,
  ecosystem allowlist filter, range flattening (introduced -> fixed,
  introduced -> last_affected, MAL- always-vulnerable, GIT skipped),
  multi-affected[] coalescing.
- versionCompare: npm semver ordering + coercion; Maven dotted versions,
  qualifier ranks (alpha < beta < milestone < rc < snapshot < ga/final < sp),
  qualifier aliases, numeric > alpha at same depth, cross-ecosystem null.
- deriveCriticalFlag (integration, real packages-db, skipped without DB env):
  lodash 4.17.20 flips TRUE / 5.0.0 clears, log4j-core 2.14.1 flips TRUE /
  2.17.0 clears, MAL- target flips via the osv_id LIKE prefix override,
  catch-up resolver populates advisory_packages.package_id for late-arriving
  packages, and a regression guard around the Maven 1.0-final edge case.

The versionCompare suite caught a real bug: compareMaven used a num:0 pad
for missing tokens in both kinds of comparison. That made 1.0-final < 1.0
(should be equal: 'final' is an alias for the empty 'ga' qualifier) and
1.0 > 1.0-sp1 (should be less: 'sp' outranks 'ga'). Fixed by picking the
pad type based on the other side's kind (num:0 vs str:''), matching the
Maven ComparableVersion algorithm.

Also verified out-of-band (not in suite):
- Idempotency: rerunning OSV sync leaves advisories, advisory_packages,
  advisory_affected_ranges row counts and the md5 hash of
  (osv_id, cvss, cvss_source) bit-identical.
- SIGINT mid-pass: shutdown handler runs, current batch flushes,
  derive + sleep skip, process exits 0.

68 tests / 5 files pass; lint + prettier + tsc clean.

Signed-off-by: Joan Reyero <joan@reyero.io>
ADR-0004 captures the standalone-bin vs Temporal decision for batch
sub-workers in packages_worker (OSV uses standalone; npm package sync
will use Temporal). ADR-0005 captures the CVSS scoring strategy
(inline v3.1 from the FIRST spec, V4 numeric scoring deferred to a
follow-up, qualitative fallback in the meantime).

Both record the alternatives that were considered and rejected so the
next engineer touching these areas has the rationale in one place.

Signed-off-by: Joan Reyero <joan@reyero.io>
Copilot AI review requested due to automatic review settings May 27, 2026 14:51
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 27, 2026

CLA assistant check
All committers have signed the CLA.

@github-actions
Copy link
Copy Markdown
Contributor

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

  • feat: add user authentication (CM-123)
  • feat: add user authentication (IN-123)

Projects:

  • CM: Community Data Platform
  • IN: Insights

Please add a Jira issue key to your PR title.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an osv-sync sub-worker to packages_worker that ingests OSV bulk advisories (npm + Maven), normalizes them into packages-db advisory tables, and derives the denormalized packages.has_critical_vulnerability flag based on ecosystem-specific version comparisons and scored severity.

Changes:

  • Introduces OSV ingestion pipeline (download/parse/score/upsert) and a post-pass derivation step for has_critical_vulnerability.
  • Adds a Maven ComparableVersion-style comparator and npm semver comparator plus unit/integration tests (Vitest).
  • Adds docs (ADRs) and local/docker service wiring for running osv-sync.

Reviewed changes

Copilot reviewed 24 out of 27 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
services/apps/packages_worker/vitest.config.ts Adds Vitest configuration for the packages_worker test suite.
services/apps/packages_worker/src/osv/versionCompare.ts Implements ecosystem-specific version comparison (npm semver + Maven-like comparator).
services/apps/packages_worker/src/osv/upsertAdvisory.ts Writes normalized OSV advisories/packages/ranges to packages-db via transactional upserts.
services/apps/packages_worker/src/osv/types.ts Defines OSV raw/normalized types and a FetchError for ingestion.
services/apps/packages_worker/src/osv/index.ts Orchestrates per-ecosystem sync loop with retries and post-pass critical-flag derivation.
services/apps/packages_worker/src/osv/fetchEcosystemZip.ts Streams OSV zip download to disk and iterates JSON entries for parsing.
services/apps/packages_worker/src/osv/extractSeverity.ts Extracts/seeds severity and CVSS score from OSV records (MAL-/V3/qualitative fallback).
services/apps/packages_worker/src/osv/deriveCriticalFlag.ts Recomputes packages.has_critical_vulnerability by checking latest_version against critical ranges.
services/apps/packages_worker/src/osv/cvssScoring.ts Implements inline CVSS v3.1 base-score calculation.
services/apps/packages_worker/src/osv/tests/versionCompare.test.ts Unit tests for npm and Maven version ordering.
services/apps/packages_worker/src/osv/tests/parseOsvRecord.test.ts Unit tests for OSV record parsing behaviors (name splitting, allowlist, range flattening).
services/apps/packages_worker/src/osv/tests/extractSeverity.test.ts Unit tests for severity extraction paths (MAL-, V3, V4-only qualitative fallback).
services/apps/packages_worker/src/osv/tests/deriveCriticalFlag.integration.test.ts DB-backed integration tests for end-to-end derivation behavior (skipped without env).
services/apps/packages_worker/src/osv/tests/cvssScoring.test.ts Reference-vector tests to pin CVSS scoring implementation.
services/apps/packages_worker/src/config.ts Adds OSV-specific worker config sourced from env vars.
services/apps/packages_worker/src/bin/osv-sync.ts Adds standalone entrypoint binary for the OSV sync worker with shutdown handling.
services/apps/packages_worker/package.json Adds scripts and dependencies/devDependencies for osv-sync + tests.
scripts/services/osv-sync.yaml Adds docker-compose service definition for running osv-sync locally/composed.
pnpm-lock.yaml Locks newly added dependencies (semver/unzipper/vitest and transitive deps).
docs/adr/README.md Registers ADRs 0003–0005 in the ADR index.
docs/adr/0005-cvss-scoring-strategy.md Documents CVSS scoring strategy (inline v3.1, defer v4).
docs/adr/0004-standalone-bin-vs-temporal-for-batch-sub-workers.md Documents rationale for standalone-bin execution shape for batch sub-workers.
docs/adr/0003-has-critical-vulnerability-semantics.md Documents semantics for has_critical_vulnerability and derivation strategy.
backend/src/osspckgs/migrations/V1779871327__add_has_critical_vulnerability_to_packages.sql Adds the has_critical_vulnerability column + partial index to packages-db.
backend/src/osspckgs/migrations/V1779871303__add_cvss_source_to_advisories.sql Adds advisories.cvss_source for score provenance.
backend/.env.dist.local Adds default local env vars for running osv-sync.
Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread services/apps/packages_worker/src/osv/extractSeverity.ts Outdated
Comment thread services/apps/packages_worker/src/osv/index.ts
Comment thread services/apps/packages_worker/src/osv/index.ts Outdated
Comment thread services/apps/packages_worker/src/osv/upsertAdvisory.ts
Comment thread services/apps/packages_worker/src/osv/fetchEcosystemZip.ts
- fetchEcosystemZip: move clearTimeout to cover the pipeline body
  stream, not just the headers fetch. Map pipeline rejection to a
  NETWORK FetchError so withRetry handles stalled mid-transfer
  connections instead of hanging past the daily window.
- index.ts: hoist counters and buffer into the withRetry closure so a
  transient retry restarts from zero. UPSERT is idempotent on osv_id,
  so re-flushed batches are safe.
- index.ts: switch error/warn logs from err.message to { err } so the
  structured logger preserves stack and metadata, matching the rest
  of the service.
- extractSeverity.ts: rewrite the lede comment to match ADR-0005
  (V4 numeric scoring deferred; v1 skips V4 entirely and falls
  through to qualitative for V4-only records).
- V1779871303 migration: list all four cvss_source values so the
  schema doc matches the contract in types.ts and ADR-0005.
- deriveCriticalFlag integration test: extend HAVE_DB to require
  CROWD_PACKAGES_DB_DATABASE and CROWD_PACKAGES_DB_PASSWORD too, so
  half-set envs skip cleanly instead of failing in beforeAll.

Signed-off-by: Joan Reyero <joan@reyero.io>
Comment thread services/apps/packages_worker/src/osv/cvssScoring.ts
cvssScoring.ts read `v.S` directly into the `s === 'C' ? ... : ...`
branches but never validated it. A vector missing `S` or carrying an
invalid value like `S:X` would silently take the Scope:Unchanged
branch in every formula and return a wrong numeric score instead of
null. The 10 reference-vector tests didn't catch it because every
test vector had a valid S:U or S:C.

This is the exact failure mode ADR-0005 named as the headline risk
of choosing inline scoring over the cvss npm package — wrong scores
feed advisories.is_critical and packages.has_critical_vulnerability,
i.e. the entire security overlay.

Fix: validate `s` against {U, C} up front and return null otherwise.
Added two regression tests covering the missing-S and invalid-S
paths.

Caught by Cursor's bot review on cbaf41d.

Signed-off-by: Joan Reyero <joan@reyero.io>
Copilot AI review requested due to automatic review settings May 27, 2026 15:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 27 changed files in this pull request and generated 2 comments.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

Comment thread services/apps/packages_worker/src/osv/extractSeverity.ts
Comment thread services/apps/packages_worker/src/osv/cvssScoring.ts
Comment thread services/apps/packages_worker/src/osv/upsertAdvisory.ts
The unique index on advisory_affected_ranges shipped in V1779710880
keyed on (advisory_package_id, COALESCE(introduced_version, '')) —
strictly narrower than the natural uniqueness of a range tuple, and
narrower than the principle locked in osv-plan §2 #1 ("one package
has many version ranges; no denormalization").

dedupeRanges in upsertAdvisory.ts was keying on introduced_version
alone to match that index, with the side effect that two ranges
sharing an introduced_version but differing in fixed_version or
last_affected (cross-distro patches, partial fixes) silently
collapsed to the first occurrence. When the surviving range was
the narrower one, isInRange returned FALSE for versions inside the
wider window — a missed critical alert.

Three changes:

- V1779897650__widen_advisory_affected_ranges_unique_index.sql:
  drop the narrow unique index (located via pg_indexes since the
  initial migration didn't name it) and replace with the full-tuple
  unique index over (advisory_package_id,
  COALESCE(introduced_version,''), COALESCE(fixed_version,''),
  COALESCE(last_affected,'')).
- upsertAdvisory.ts dedupeRanges: key on the full tuple so the
  application-side pre-flight matches the database constraint.
  Exported for unit testing.
- upsertAdvisory.test.ts: 5 cases pinning the new semantics
  (same-introduced-different-fixed preserved, same-introduced-
  different-last_affected preserved, identical-tuple collapsed,
  null-introduced disambiguated by other fields, first-wins on
  truly identical tuples).

ADR-0006 captures the decision and the alternatives considered
(coalesce-to-widest at parse time, drop the constraint, dedup at
query time). Cursor's bot review on 1b978ac surfaced the bug.

Signed-off-by: Joan Reyero <joan@reyero.io>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit d2e92b5. Configure here.

if (c !== 0) return c
}
return 0
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maven comparator silently handles unparseable versions unlike npm

Medium Severity

compareMaven never returns null, violating the documented compareVersion contract ("returns null when either operand cannot be parsed — we treat that as 'do not match'"). Unlike compareNpm, which returns null for unparseable input, tokenizeMaven treats any string as a valid version (e.g., an empty string becomes an empty token list, comparing as if it were version 0). In isInRange, a null result means "no match" (safe), but Maven garbage inputs silently produce a numeric comparison result, which can cause false positives — flagging packages as critically vulnerable when latest_version is empty or otherwise non-version-like.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d2e92b5. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants