feat: osv advisories ingestion by joanreyero · Pull Request #4149 · linuxfoundation/crowd.dev

joanreyero · 2026-05-27T14:51:49Z

Summary

Adds the osv-sync sub-worker to packages_worker — ingests OSV advisories for npm and Maven, normalizes them into advisories + advisory_packages + advisory_affected_ranges, and derives packages.has_critical_vulnerability from the ecosystem-specific range comparator. Engineer-5 slice of the Tier 2 / Project Osprey sprint.

Verified end-to-end against the live OSV bucket: 226,258 advisories ingested (213,414 MAL-, 10,365 V3-scored, 1,387 qualitative-fallback, 1,092 NULL). Spot-checks land: log4shell at CVSS=10.0, lodash CVE-2021-23337 at 7.2.

What's in this slice

Schema (already in c25111e5d):

advisories.cvss_source — provenance for the numeric score (osv_cvss_v3 / osv_cvss_v4 / osv_qualitative_fallback / osv_malicious_package).
packages.has_critical_vulnerability — boolean with partial index on TRUE.

Worker (services/apps/packages_worker/src/osv/):

fetchEcosystemZip.ts — streamed download + unzipper iteration (memory-bounded for the 196 MB npm zip).
cvssScoring.ts — inline CVSS v3.1 base-score from the FIRST spec.
extractSeverity.ts — MAL- short-circuit, V3 path, qualitative fallback.
parseOsvRecord.ts — Maven groupId:artifactId split, npm @scope/ split, ecosystem allowlist, range flattening.
upsertAdvisory.ts — transactional UPSERT batches, dedupe on the range unique index.
versionCompare.ts — semver for npm + inline ComparableVersion-style for Maven.
deriveCriticalFlag.ts — paged derivation, MAL- override, catch-up package_id resolver.
index.ts — per-ecosystem pass, derive, sleep loop, shutdown-aware.

Docker compose service: scripts/services/osv-sync.yaml.

ADRs

ADR-0003 — has_critical_vulnerability semantics: option (b), TRUE iff latest_version is inside a critical advisory's affected range, plus a MAL- override.
ADR-0004 — standalone bin vs Temporal for batch sub-workers in packages_worker. OSV uses standalone; the forthcoming npm package sync will use Temporal.
ADR-0005 — CVSS scoring: inline v3.1 from the FIRST spec; v4 numeric scoring deferred (V4-only records fall back to qualitative).

Testing

68 vitest tests across 5 files (pnpm test in services/apps/packages_worker):

cvssScoring.test.ts (10) — FIRST reference vectors + malformed input.
extractSeverity.test.ts (8) — MAL-, V3, V4-only fall through, qualitative, empty.
parseOsvRecord.test.ts (13) — name splits, allowlist filter, range flattening, GIT-skip, multi-affected coalescing.
versionCompare.test.ts (30) — npm semver + Maven qualifier ranks/aliases/edge cases.
deriveCriticalFlag.integration.test.ts (7) — hits the local packages-db (skipped automatically without DB env): lodash boundaries, log4j-core boundaries, MAL- override, 1.0-final regression guard, catch-up resolver.

The Maven version comparator unit tests caught a real bug — compareMaven was using num:0 padding for missing tokens regardless of the other side's kind, which mis-ordered 1.0-final vs 1.0 and 1.0 vs 1.0-sp1. Fixed in the same PR.

Also verified out-of-band:

Idempotency: rerunning OSV sync leaves row counts and the md5 hash of (osv_id, cvss, cvss_source) bit-identical.
SIGINT mid-pass: handler runs, current batch flushes, derive + sleep skip, process exits 0.

Known gaps (follow-ups, not blockers)

CVSS v4 numeric scoring is deferred (ADR-0005). ~1.1% of advisories land with cvss = NULL because they are V4-only with no qualitative tag. The fix is local to cvssScoring.ts + extractSeverity.ts.
Docker-compose deploy path (./scripts/cli service osv-sync up) hasn't been smoke-tested locally — the worker was run via tsx directly. Same code, same env vars, same DB.
Network failure retry path (withRetry in index.ts) didn't fire in the live runs because no errors hit. Logic is straightforward — three retries with exponential backoff — but unexercised.

Heads-up for the reviewer

No JIRA key. Per osv-plan.md §8 item 4, this branch was opened without a CM-XXX ticket. The PR title lint at .github/workflows/pr-title-jira-key-lint.yml may reject; if so, happy to either add a ticket or get the lint relaxed for this branch.
Three commits, kept separate intentionally: the schema migration was already on main-bound earlier in the slice; the implementation, the tests, and the ADRs are split for review clarity. Diff is ~2,000 LOC across implementation + tests + docs.

Test plan

CI: lint, prettier, tsc all green in services/apps/packages_worker.
CI: pnpm test in services/apps/packages_worker passes the 5 unit-test files (integration suite skipped without DB env, as expected).
Local: ./scripts/cli service osv-sync up builds the image and reaches OSV sync done for npm.
Local: after a sync pass, SELECT COUNT(*) FROM advisories; shows ≈226k.
Local: a spot-check query for a known critical CVE (e.g. log4shell) shows it in advisories with is_critical = TRUE.

🤖 Generated with Claude Code

Note

High Risk
Security-sensitive ingestion and denormalized vulnerability flags depend on CVSS scoring, version comparators, and range-index correctness; bugs can mis-rank critical exposure across large package sets.

Overview
Adds an osv-sync sub-worker to packages_worker that periodically pulls OSV bulk zips for npm and Maven, normalizes records, upserts into packages-db, and recomputes packages.has_critical_vulnerability from ecosystem version comparators (semver + Maven-style), including MAL-* handling.

Schema: advisories.cvss_source; denormalized has_critical_vulnerability with a partial index; migration widening advisory_affected_ranges uniqueness to the full (introduced, fixed, last_affected) tuple so cross-distro ranges are not dropped.

Worker: bin/osv-sync.ts + src/osv/* (zip download/stream, inline CVSS v3.1, severity extraction, parse/upsert batches, deriveCriticalFlag); getOsvConfig and env samples; Docker compose osv-sync.yaml; semver, unzipper, vitest in package.json.

Docs: ADRs 0003–0006 (critical-flag semantics, standalone bin vs Temporal, CVSS strategy, range dedupe/index) and index update.

^{Reviewed by Cursor Bugbot for commit d2e92b5. Bugbot is set up for automated code reviews on this repo. Configure here.}

Signed-off-by: Joan Reyero <joan@reyero.io>

Adds the osv-sync sub-worker inside packages_worker. Pulls OSV's daily per-ecosystem zip for npm and Maven, normalizes each record, and upserts into advisories, advisory_packages, and advisory_affected_ranges (transactional UPSERT, idempotent on osv_id + the range unique index). MAL- malicious-package reports are ingested with cvss=NULL and cvss_source='osv_malicious_package'. A deriveCriticalFlag step runs at the end of each pass and flips packages.has_critical_vulnerability TRUE iff a critical advisory (cvss>=7.0 OR osv_id LIKE 'MAL-%') has an affected range covering the package's current latest_version, using ecosystem-specific comparators (semver for npm, ComparableVersion-style for Maven). See ADR-0003 for the semantics. CVSS scoring computes v3.1 base scores inline from the FIRST spec; v4 numeric scoring is deferred (V4-only records fall back to the qualitative tag from database_specific.severity). Verified locally against the full OSV dataset (226,258 advisories; log4shell CVSS=10.0, lodash CVE-2021-23337 CVSS=7.2; 213,414 MAL- entries ingested). Signed-off-by: Joan Reyero <joan@reyero.io>

Adds vitest with five test files covering the OSV pipeline: - cvssScoring: 10 cases pinning the inline v3.1 implementation against FIRST-published scores (log4shell 10.0, shellshock 9.8, heartbleed 7.5, and others). Catches future regressions in the formula. - extractSeverity: MAL- short-circuit, V3 vector path, V4-only fall through, qualitative fallback, malformed-vector handling. - parseOsvRecord: Maven groupId:artifactId split, npm @scope/ split, ecosystem allowlist filter, range flattening (introduced -> fixed, introduced -> last_affected, MAL- always-vulnerable, GIT skipped), multi-affected[] coalescing. - versionCompare: npm semver ordering + coercion; Maven dotted versions, qualifier ranks (alpha < beta < milestone < rc < snapshot < ga/final < sp), qualifier aliases, numeric > alpha at same depth, cross-ecosystem null. - deriveCriticalFlag (integration, real packages-db, skipped without DB env): lodash 4.17.20 flips TRUE / 5.0.0 clears, log4j-core 2.14.1 flips TRUE / 2.17.0 clears, MAL- target flips via the osv_id LIKE prefix override, catch-up resolver populates advisory_packages.package_id for late-arriving packages, and a regression guard around the Maven 1.0-final edge case. The versionCompare suite caught a real bug: compareMaven used a num:0 pad for missing tokens in both kinds of comparison. That made 1.0-final < 1.0 (should be equal: 'final' is an alias for the empty 'ga' qualifier) and 1.0 > 1.0-sp1 (should be less: 'sp' outranks 'ga'). Fixed by picking the pad type based on the other side's kind (num:0 vs str:''), matching the Maven ComparableVersion algorithm. Also verified out-of-band (not in suite): - Idempotency: rerunning OSV sync leaves advisories, advisory_packages, advisory_affected_ranges row counts and the md5 hash of (osv_id, cvss, cvss_source) bit-identical. - SIGINT mid-pass: shutdown handler runs, current batch flushes, derive + sleep skip, process exits 0. 68 tests / 5 files pass; lint + prettier + tsc clean. Signed-off-by: Joan Reyero <joan@reyero.io>

ADR-0004 captures the standalone-bin vs Temporal decision for batch sub-workers in packages_worker (OSV uses standalone; npm package sync will use Temporal). ADR-0005 captures the CVSS scoring strategy (inline v3.1 from the FIRST spec, V4 numeric scoring deferred to a follow-up, qualitative fallback in the meantime). Both record the alternatives that were considered and rejected so the next engineer touching these areas has the rationale in one place. Signed-off-by: Joan Reyero <joan@reyero.io>

CLAassistant · 2026-05-27T14:51:59Z

All committers have signed the CLA.

github-actions · 2026-05-27T14:52:03Z

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

feat: add user authentication (CM-123)
feat: add user authentication (IN-123)

Projects:

CM: Community Data Platform
IN: Insights

Please add a Jira issue key to your PR title.

Copilot

Pull request overview

Adds an osv-sync sub-worker to packages_worker that ingests OSV bulk advisories (npm + Maven), normalizes them into packages-db advisory tables, and derives the denormalized packages.has_critical_vulnerability flag based on ecosystem-specific version comparisons and scored severity.

Changes:

Introduces OSV ingestion pipeline (download/parse/score/upsert) and a post-pass derivation step for has_critical_vulnerability.
Adds a Maven ComparableVersion-style comparator and npm semver comparator plus unit/integration tests (Vitest).
Adds docs (ADRs) and local/docker service wiring for running osv-sync.

Reviewed changes

Copilot reviewed 24 out of 27 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
services/apps/packages_worker/vitest.config.ts	Adds Vitest configuration for the packages_worker test suite.
services/apps/packages_worker/src/osv/versionCompare.ts	Implements ecosystem-specific version comparison (npm semver + Maven-like comparator).
services/apps/packages_worker/src/osv/upsertAdvisory.ts	Writes normalized OSV advisories/packages/ranges to packages-db via transactional upserts.
services/apps/packages_worker/src/osv/types.ts	Defines OSV raw/normalized types and a FetchError for ingestion.
services/apps/packages_worker/src/osv/index.ts	Orchestrates per-ecosystem sync loop with retries and post-pass critical-flag derivation.
services/apps/packages_worker/src/osv/fetchEcosystemZip.ts	Streams OSV zip download to disk and iterates JSON entries for parsing.
services/apps/packages_worker/src/osv/extractSeverity.ts	Extracts/seeds severity and CVSS score from OSV records (MAL-/V3/qualitative fallback).
services/apps/packages_worker/src/osv/deriveCriticalFlag.ts	Recomputes `packages.has_critical_vulnerability` by checking latest_version against critical ranges.
services/apps/packages_worker/src/osv/cvssScoring.ts	Implements inline CVSS v3.1 base-score calculation.
services/apps/packages_worker/src/osv/tests/versionCompare.test.ts	Unit tests for npm and Maven version ordering.
services/apps/packages_worker/src/osv/tests/parseOsvRecord.test.ts	Unit tests for OSV record parsing behaviors (name splitting, allowlist, range flattening).
services/apps/packages_worker/src/osv/tests/extractSeverity.test.ts	Unit tests for severity extraction paths (MAL-, V3, V4-only qualitative fallback).
services/apps/packages_worker/src/osv/tests/deriveCriticalFlag.integration.test.ts	DB-backed integration tests for end-to-end derivation behavior (skipped without env).
services/apps/packages_worker/src/osv/tests/cvssScoring.test.ts	Reference-vector tests to pin CVSS scoring implementation.
services/apps/packages_worker/src/config.ts	Adds OSV-specific worker config sourced from env vars.
services/apps/packages_worker/src/bin/osv-sync.ts	Adds standalone entrypoint binary for the OSV sync worker with shutdown handling.
services/apps/packages_worker/package.json	Adds scripts and dependencies/devDependencies for osv-sync + tests.
scripts/services/osv-sync.yaml	Adds docker-compose service definition for running osv-sync locally/composed.
pnpm-lock.yaml	Locks newly added dependencies (semver/unzipper/vitest and transitive deps).
docs/adr/README.md	Registers ADRs 0003–0005 in the ADR index.
docs/adr/0005-cvss-scoring-strategy.md	Documents CVSS scoring strategy (inline v3.1, defer v4).
docs/adr/0004-standalone-bin-vs-temporal-for-batch-sub-workers.md	Documents rationale for standalone-bin execution shape for batch sub-workers.
docs/adr/0003-has-critical-vulnerability-semantics.md	Documents semantics for `has_critical_vulnerability` and derivation strategy.
backend/src/osspckgs/migrations/V1779871327__add_has_critical_vulnerability_to_packages.sql	Adds the `has_critical_vulnerability` column + partial index to packages-db.
backend/src/osspckgs/migrations/V1779871303__add_cvss_source_to_advisories.sql	Adds `advisories.cvss_source` for score provenance.
backend/.env.dist.local	Adds default local env vars for running osv-sync.

Files not reviewed (1)

pnpm-lock.yaml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- fetchEcosystemZip: move clearTimeout to cover the pipeline body stream, not just the headers fetch. Map pipeline rejection to a NETWORK FetchError so withRetry handles stalled mid-transfer connections instead of hanging past the daily window. - index.ts: hoist counters and buffer into the withRetry closure so a transient retry restarts from zero. UPSERT is idempotent on osv_id, so re-flushed batches are safe. - index.ts: switch error/warn logs from err.message to { err } so the structured logger preserves stack and metadata, matching the rest of the service. - extractSeverity.ts: rewrite the lede comment to match ADR-0005 (V4 numeric scoring deferred; v1 skips V4 entirely and falls through to qualitative for V4-only records). - V1779871303 migration: list all four cvss_source values so the schema doc matches the contract in types.ts and ADR-0005. - deriveCriticalFlag integration test: extend HAVE_DB to require CROWD_PACKAGES_DB_DATABASE and CROWD_PACKAGES_DB_PASSWORD too, so half-set envs skip cleanly instead of failing in beforeAll. Signed-off-by: Joan Reyero <joan@reyero.io>

cvssScoring.ts read `v.S` directly into the `s === 'C' ? ... : ...` branches but never validated it. A vector missing `S` or carrying an invalid value like `S:X` would silently take the Scope:Unchanged branch in every formula and return a wrong numeric score instead of null. The 10 reference-vector tests didn't catch it because every test vector had a valid S:U or S:C. This is the exact failure mode ADR-0005 named as the headline risk of choosing inline scoring over the cvss npm package — wrong scores feed advisories.is_critical and packages.has_critical_vulnerability, i.e. the entire security overlay. Fix: validate `s` against {U, C} up front and return null otherwise. Added two regression tests covering the missing-S and invalid-S paths. Caught by Cursor's bot review on cbaf41d. Signed-off-by: Joan Reyero <joan@reyero.io>

Copilot

Pull request overview

Copilot reviewed 24 out of 27 changed files in this pull request and generated 2 comments.

Files not reviewed (1)

pnpm-lock.yaml: Language not supported

The unique index on advisory_affected_ranges shipped in V1779710880 keyed on (advisory_package_id, COALESCE(introduced_version, '')) — strictly narrower than the natural uniqueness of a range tuple, and narrower than the principle locked in osv-plan §2 #1 ("one package has many version ranges; no denormalization"). dedupeRanges in upsertAdvisory.ts was keying on introduced_version alone to match that index, with the side effect that two ranges sharing an introduced_version but differing in fixed_version or last_affected (cross-distro patches, partial fixes) silently collapsed to the first occurrence. When the surviving range was the narrower one, isInRange returned FALSE for versions inside the wider window — a missed critical alert. Three changes: - V1779897650__widen_advisory_affected_ranges_unique_index.sql: drop the narrow unique index (located via pg_indexes since the initial migration didn't name it) and replace with the full-tuple unique index over (advisory_package_id, COALESCE(introduced_version,''), COALESCE(fixed_version,''), COALESCE(last_affected,'')). - upsertAdvisory.ts dedupeRanges: key on the full tuple so the application-side pre-flight matches the database constraint. Exported for unit testing. - upsertAdvisory.test.ts: 5 cases pinning the new semantics (same-introduced-different-fixed preserved, same-introduced- different-last_affected preserved, identical-tuple collapsed, null-introduced disambiguated by other fields, first-wins on truly identical tuples). ADR-0006 captures the decision and the alternatives considered (coalesce-to-widest at parse time, drop the constraint, dedup at query time). Cursor's bot review on 1b978ac surfaced the bug. Signed-off-by: Joan Reyero <joan@reyero.io>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit d2e92b5. Configure here.}

cursor · 2026-05-27T16:16:13Z

+    if (c !== 0) return c
+  }
+  return 0
+}


Maven comparator silently handles unparseable versions unlike npm

Medium Severity

compareMaven never returns null, violating the documented compareVersion contract ("returns null when either operand cannot be parsed — we treat that as 'do not match'"). Unlike compareNpm, which returns null for unparseable input, tokenizeMaven treats any string as a valid version (e.g., an empty string becomes an empty token list, comparing as if it were version 0). In isInRange, a null result means "no match" (safe), but Maven garbage inputs silently produce a numeric comparison result, which can cause false positives — flagging packages as critically vulnerable when latest_version is empty or otherwise non-version-like.

Additional Locations (1)

services/apps/packages_worker/src/osv/deriveCriticalFlag.ts#L27-L49

^{Reviewed by Cursor Bugbot for commit d2e92b5. Configure here.}

joanreyero added 4 commits May 27, 2026 10:41

feat: add cvss_source and has_critical_vulnerability columns

c25111e

Signed-off-by: Joan Reyero <joan@reyero.io>

Copilot AI review requested due to automatic review settings May 27, 2026 14:51

Copilot started reviewing on behalf of joanreyero May 27, 2026 14:52 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

cursor Bot reviewed May 27, 2026

View reviewed changes

Comment thread services/apps/packages_worker/src/osv/fetchEcosystemZip.ts

cursor Bot reviewed May 27, 2026

View reviewed changes

Comment thread services/apps/packages_worker/src/osv/cvssScoring.ts

Copilot AI review requested due to automatic review settings May 27, 2026 15:39

Copilot started reviewing on behalf of joanreyero May 27, 2026 15:39 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

Comment thread services/apps/packages_worker/src/osv/extractSeverity.ts

Comment thread services/apps/packages_worker/src/osv/cvssScoring.ts

cursor Bot reviewed May 27, 2026

View reviewed changes

Comment thread services/apps/packages_worker/src/osv/upsertAdvisory.ts

cursor Bot reviewed May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: osv advisories ingestion#4149

feat: osv advisories ingestion#4149
joanreyero wants to merge 7 commits into
mainfrom
feat/osv-advisories

joanreyero commented May 27, 2026 •

edited by cursor Bot

Loading

Uh oh!

CLAassistant commented May 27, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

joanreyero commented May 27, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in this slice

ADRs

Testing

Known gaps (follow-ups, not blockers)

Heads-up for the reviewer

Test plan

Uh oh!

CLAassistant commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 27, 2026

Choose a reason for hiding this comment

Maven comparator silently handles unparseable versions unlike npm

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joanreyero commented May 27, 2026 •

edited by cursor Bot

Loading

CLAassistant commented May 27, 2026 •

edited

Loading