feat: osv advisories ingestion#4149
Conversation
Signed-off-by: Joan Reyero <joan@reyero.io>
Adds the osv-sync sub-worker inside packages_worker. Pulls OSV's daily per-ecosystem zip for npm and Maven, normalizes each record, and upserts into advisories, advisory_packages, and advisory_affected_ranges (transactional UPSERT, idempotent on osv_id + the range unique index). MAL- malicious-package reports are ingested with cvss=NULL and cvss_source='osv_malicious_package'. A deriveCriticalFlag step runs at the end of each pass and flips packages.has_critical_vulnerability TRUE iff a critical advisory (cvss>=7.0 OR osv_id LIKE 'MAL-%') has an affected range covering the package's current latest_version, using ecosystem-specific comparators (semver for npm, ComparableVersion-style for Maven). See ADR-0003 for the semantics. CVSS scoring computes v3.1 base scores inline from the FIRST spec; v4 numeric scoring is deferred (V4-only records fall back to the qualitative tag from database_specific.severity). Verified locally against the full OSV dataset (226,258 advisories; log4shell CVSS=10.0, lodash CVE-2021-23337 CVSS=7.2; 213,414 MAL- entries ingested). Signed-off-by: Joan Reyero <joan@reyero.io>
Adds vitest with five test files covering the OSV pipeline: - cvssScoring: 10 cases pinning the inline v3.1 implementation against FIRST-published scores (log4shell 10.0, shellshock 9.8, heartbleed 7.5, and others). Catches future regressions in the formula. - extractSeverity: MAL- short-circuit, V3 vector path, V4-only fall through, qualitative fallback, malformed-vector handling. - parseOsvRecord: Maven groupId:artifactId split, npm @scope/ split, ecosystem allowlist filter, range flattening (introduced -> fixed, introduced -> last_affected, MAL- always-vulnerable, GIT skipped), multi-affected[] coalescing. - versionCompare: npm semver ordering + coercion; Maven dotted versions, qualifier ranks (alpha < beta < milestone < rc < snapshot < ga/final < sp), qualifier aliases, numeric > alpha at same depth, cross-ecosystem null. - deriveCriticalFlag (integration, real packages-db, skipped without DB env): lodash 4.17.20 flips TRUE / 5.0.0 clears, log4j-core 2.14.1 flips TRUE / 2.17.0 clears, MAL- target flips via the osv_id LIKE prefix override, catch-up resolver populates advisory_packages.package_id for late-arriving packages, and a regression guard around the Maven 1.0-final edge case. The versionCompare suite caught a real bug: compareMaven used a num:0 pad for missing tokens in both kinds of comparison. That made 1.0-final < 1.0 (should be equal: 'final' is an alias for the empty 'ga' qualifier) and 1.0 > 1.0-sp1 (should be less: 'sp' outranks 'ga'). Fixed by picking the pad type based on the other side's kind (num:0 vs str:''), matching the Maven ComparableVersion algorithm. Also verified out-of-band (not in suite): - Idempotency: rerunning OSV sync leaves advisories, advisory_packages, advisory_affected_ranges row counts and the md5 hash of (osv_id, cvss, cvss_source) bit-identical. - SIGINT mid-pass: shutdown handler runs, current batch flushes, derive + sleep skip, process exits 0. 68 tests / 5 files pass; lint + prettier + tsc clean. Signed-off-by: Joan Reyero <joan@reyero.io>
ADR-0004 captures the standalone-bin vs Temporal decision for batch sub-workers in packages_worker (OSV uses standalone; npm package sync will use Temporal). ADR-0005 captures the CVSS scoring strategy (inline v3.1 from the FIRST spec, V4 numeric scoring deferred to a follow-up, qualitative fallback in the meantime). Both record the alternatives that were considered and rejected so the next engineer touching these areas has the rationale in one place. Signed-off-by: Joan Reyero <joan@reyero.io>
|
Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability. Example:
Projects:
Please add a Jira issue key to your PR title. |
There was a problem hiding this comment.
Pull request overview
Adds an osv-sync sub-worker to packages_worker that ingests OSV bulk advisories (npm + Maven), normalizes them into packages-db advisory tables, and derives the denormalized packages.has_critical_vulnerability flag based on ecosystem-specific version comparisons and scored severity.
Changes:
- Introduces OSV ingestion pipeline (download/parse/score/upsert) and a post-pass derivation step for
has_critical_vulnerability. - Adds a Maven ComparableVersion-style comparator and npm semver comparator plus unit/integration tests (Vitest).
- Adds docs (ADRs) and local/docker service wiring for running
osv-sync.
Reviewed changes
Copilot reviewed 24 out of 27 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| services/apps/packages_worker/vitest.config.ts | Adds Vitest configuration for the packages_worker test suite. |
| services/apps/packages_worker/src/osv/versionCompare.ts | Implements ecosystem-specific version comparison (npm semver + Maven-like comparator). |
| services/apps/packages_worker/src/osv/upsertAdvisory.ts | Writes normalized OSV advisories/packages/ranges to packages-db via transactional upserts. |
| services/apps/packages_worker/src/osv/types.ts | Defines OSV raw/normalized types and a FetchError for ingestion. |
| services/apps/packages_worker/src/osv/index.ts | Orchestrates per-ecosystem sync loop with retries and post-pass critical-flag derivation. |
| services/apps/packages_worker/src/osv/fetchEcosystemZip.ts | Streams OSV zip download to disk and iterates JSON entries for parsing. |
| services/apps/packages_worker/src/osv/extractSeverity.ts | Extracts/seeds severity and CVSS score from OSV records (MAL-/V3/qualitative fallback). |
| services/apps/packages_worker/src/osv/deriveCriticalFlag.ts | Recomputes packages.has_critical_vulnerability by checking latest_version against critical ranges. |
| services/apps/packages_worker/src/osv/cvssScoring.ts | Implements inline CVSS v3.1 base-score calculation. |
| services/apps/packages_worker/src/osv/tests/versionCompare.test.ts | Unit tests for npm and Maven version ordering. |
| services/apps/packages_worker/src/osv/tests/parseOsvRecord.test.ts | Unit tests for OSV record parsing behaviors (name splitting, allowlist, range flattening). |
| services/apps/packages_worker/src/osv/tests/extractSeverity.test.ts | Unit tests for severity extraction paths (MAL-, V3, V4-only qualitative fallback). |
| services/apps/packages_worker/src/osv/tests/deriveCriticalFlag.integration.test.ts | DB-backed integration tests for end-to-end derivation behavior (skipped without env). |
| services/apps/packages_worker/src/osv/tests/cvssScoring.test.ts | Reference-vector tests to pin CVSS scoring implementation. |
| services/apps/packages_worker/src/config.ts | Adds OSV-specific worker config sourced from env vars. |
| services/apps/packages_worker/src/bin/osv-sync.ts | Adds standalone entrypoint binary for the OSV sync worker with shutdown handling. |
| services/apps/packages_worker/package.json | Adds scripts and dependencies/devDependencies for osv-sync + tests. |
| scripts/services/osv-sync.yaml | Adds docker-compose service definition for running osv-sync locally/composed. |
| pnpm-lock.yaml | Locks newly added dependencies (semver/unzipper/vitest and transitive deps). |
| docs/adr/README.md | Registers ADRs 0003–0005 in the ADR index. |
| docs/adr/0005-cvss-scoring-strategy.md | Documents CVSS scoring strategy (inline v3.1, defer v4). |
| docs/adr/0004-standalone-bin-vs-temporal-for-batch-sub-workers.md | Documents rationale for standalone-bin execution shape for batch sub-workers. |
| docs/adr/0003-has-critical-vulnerability-semantics.md | Documents semantics for has_critical_vulnerability and derivation strategy. |
| backend/src/osspckgs/migrations/V1779871327__add_has_critical_vulnerability_to_packages.sql | Adds the has_critical_vulnerability column + partial index to packages-db. |
| backend/src/osspckgs/migrations/V1779871303__add_cvss_source_to_advisories.sql | Adds advisories.cvss_source for score provenance. |
| backend/.env.dist.local | Adds default local env vars for running osv-sync. |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- fetchEcosystemZip: move clearTimeout to cover the pipeline body
stream, not just the headers fetch. Map pipeline rejection to a
NETWORK FetchError so withRetry handles stalled mid-transfer
connections instead of hanging past the daily window.
- index.ts: hoist counters and buffer into the withRetry closure so a
transient retry restarts from zero. UPSERT is idempotent on osv_id,
so re-flushed batches are safe.
- index.ts: switch error/warn logs from err.message to { err } so the
structured logger preserves stack and metadata, matching the rest
of the service.
- extractSeverity.ts: rewrite the lede comment to match ADR-0005
(V4 numeric scoring deferred; v1 skips V4 entirely and falls
through to qualitative for V4-only records).
- V1779871303 migration: list all four cvss_source values so the
schema doc matches the contract in types.ts and ADR-0005.
- deriveCriticalFlag integration test: extend HAVE_DB to require
CROWD_PACKAGES_DB_DATABASE and CROWD_PACKAGES_DB_PASSWORD too, so
half-set envs skip cleanly instead of failing in beforeAll.
Signed-off-by: Joan Reyero <joan@reyero.io>
cvssScoring.ts read `v.S` directly into the `s === 'C' ? ... : ...`
branches but never validated it. A vector missing `S` or carrying an
invalid value like `S:X` would silently take the Scope:Unchanged
branch in every formula and return a wrong numeric score instead of
null. The 10 reference-vector tests didn't catch it because every
test vector had a valid S:U or S:C.
This is the exact failure mode ADR-0005 named as the headline risk
of choosing inline scoring over the cvss npm package — wrong scores
feed advisories.is_critical and packages.has_critical_vulnerability,
i.e. the entire security overlay.
Fix: validate `s` against {U, C} up front and return null otherwise.
Added two regression tests covering the missing-S and invalid-S
paths.
Caught by Cursor's bot review on cbaf41d.
Signed-off-by: Joan Reyero <joan@reyero.io>
The unique index on advisory_affected_ranges shipped in V1779710880 keyed on (advisory_package_id, COALESCE(introduced_version, '')) — strictly narrower than the natural uniqueness of a range tuple, and narrower than the principle locked in osv-plan §2 #1 ("one package has many version ranges; no denormalization"). dedupeRanges in upsertAdvisory.ts was keying on introduced_version alone to match that index, with the side effect that two ranges sharing an introduced_version but differing in fixed_version or last_affected (cross-distro patches, partial fixes) silently collapsed to the first occurrence. When the surviving range was the narrower one, isInRange returned FALSE for versions inside the wider window — a missed critical alert. Three changes: - V1779897650__widen_advisory_affected_ranges_unique_index.sql: drop the narrow unique index (located via pg_indexes since the initial migration didn't name it) and replace with the full-tuple unique index over (advisory_package_id, COALESCE(introduced_version,''), COALESCE(fixed_version,''), COALESCE(last_affected,'')). - upsertAdvisory.ts dedupeRanges: key on the full tuple so the application-side pre-flight matches the database constraint. Exported for unit testing. - upsertAdvisory.test.ts: 5 cases pinning the new semantics (same-introduced-different-fixed preserved, same-introduced- different-last_affected preserved, identical-tuple collapsed, null-introduced disambiguated by other fields, first-wins on truly identical tuples). ADR-0006 captures the decision and the alternatives considered (coalesce-to-widest at parse time, drop the constraint, dedup at query time). Cursor's bot review on 1b978ac surfaced the bug. Signed-off-by: Joan Reyero <joan@reyero.io>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit d2e92b5. Configure here.
| if (c !== 0) return c | ||
| } | ||
| return 0 | ||
| } |
There was a problem hiding this comment.
Maven comparator silently handles unparseable versions unlike npm
Medium Severity
compareMaven never returns null, violating the documented compareVersion contract ("returns null when either operand cannot be parsed — we treat that as 'do not match'"). Unlike compareNpm, which returns null for unparseable input, tokenizeMaven treats any string as a valid version (e.g., an empty string becomes an empty token list, comparing as if it were version 0). In isInRange, a null result means "no match" (safe), but Maven garbage inputs silently produce a numeric comparison result, which can cause false positives — flagging packages as critically vulnerable when latest_version is empty or otherwise non-version-like.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit d2e92b5. Configure here.


Summary
Adds the
osv-syncsub-worker topackages_worker— ingests OSV advisories for npm and Maven, normalizes them intoadvisories+advisory_packages+advisory_affected_ranges, and derivespackages.has_critical_vulnerabilityfrom the ecosystem-specific range comparator. Engineer-5 slice of the Tier 2 / Project Osprey sprint.Verified end-to-end against the live OSV bucket: 226,258 advisories ingested (213,414 MAL-, 10,365 V3-scored, 1,387 qualitative-fallback, 1,092 NULL). Spot-checks land: log4shell at CVSS=10.0, lodash CVE-2021-23337 at 7.2.
What's in this slice
Schema (already in
c25111e5d):advisories.cvss_source— provenance for the numeric score (osv_cvss_v3/osv_cvss_v4/osv_qualitative_fallback/osv_malicious_package).packages.has_critical_vulnerability— boolean with partial index on TRUE.Worker (
services/apps/packages_worker/src/osv/):fetchEcosystemZip.ts— streamed download +unzipperiteration (memory-bounded for the 196 MB npm zip).cvssScoring.ts— inline CVSS v3.1 base-score from the FIRST spec.extractSeverity.ts— MAL- short-circuit, V3 path, qualitative fallback.parseOsvRecord.ts— MavengroupId:artifactIdsplit, npm@scope/split, ecosystem allowlist, range flattening.upsertAdvisory.ts— transactional UPSERT batches, dedupe on the range unique index.versionCompare.ts— semver for npm + inlineComparableVersion-style for Maven.deriveCriticalFlag.ts— paged derivation, MAL- override, catch-uppackage_idresolver.index.ts— per-ecosystem pass, derive, sleep loop, shutdown-aware.Docker compose service:
scripts/services/osv-sync.yaml.ADRs
has_critical_vulnerabilitysemantics: option (b), TRUE ifflatest_versionis inside a critical advisory's affected range, plus a MAL- override.packages_worker. OSV uses standalone; the forthcoming npm package sync will use Temporal.Testing
68 vitest tests across 5 files (
pnpm testinservices/apps/packages_worker):cvssScoring.test.ts(10) — FIRST reference vectors + malformed input.extractSeverity.test.ts(8) — MAL-, V3, V4-only fall through, qualitative, empty.parseOsvRecord.test.ts(13) — name splits, allowlist filter, range flattening, GIT-skip, multi-affected coalescing.versionCompare.test.ts(30) — npm semver + Maven qualifier ranks/aliases/edge cases.deriveCriticalFlag.integration.test.ts(7) — hits the local packages-db (skipped automatically without DB env): lodash boundaries, log4j-core boundaries, MAL- override,1.0-finalregression guard, catch-up resolver.The Maven version comparator unit tests caught a real bug —
compareMavenwas usingnum:0padding for missing tokens regardless of the other side's kind, which mis-ordered1.0-finalvs1.0and1.0vs1.0-sp1. Fixed in the same PR.Also verified out-of-band:
(osv_id, cvss, cvss_source)bit-identical.Known gaps (follow-ups, not blockers)
cvss = NULLbecause they are V4-only with no qualitative tag. The fix is local tocvssScoring.ts+extractSeverity.ts../scripts/cli service osv-sync up) hasn't been smoke-tested locally — the worker was run viatsxdirectly. Same code, same env vars, same DB.withRetryinindex.ts) didn't fire in the live runs because no errors hit. Logic is straightforward — three retries with exponential backoff — but unexercised.Heads-up for the reviewer
osv-plan.md§8 item 4, this branch was opened without aCM-XXXticket. The PR title lint at.github/workflows/pr-title-jira-key-lint.ymlmay reject; if so, happy to either add a ticket or get the lint relaxed for this branch.main-bound earlier in the slice; the implementation, the tests, and the ADRs are split for review clarity. Diff is ~2,000 LOC across implementation + tests + docs.Test plan
services/apps/packages_worker.pnpm testinservices/apps/packages_workerpasses the 5 unit-test files (integration suite skipped without DB env, as expected)../scripts/cli service osv-sync upbuilds the image and reachesOSV sync done for npm.SELECT COUNT(*) FROM advisories;shows ≈226k.advisorieswithis_critical = TRUE.🤖 Generated with Claude Code
Note
High Risk
Security-sensitive ingestion and denormalized vulnerability flags depend on CVSS scoring, version comparators, and range-index correctness; bugs can mis-rank critical exposure across large package sets.
Overview
Adds an
osv-syncsub-worker topackages_workerthat periodically pulls OSV bulk zips for npm and Maven, normalizes records, upserts into packages-db, and recomputespackages.has_critical_vulnerabilityfrom ecosystem version comparators (semver + Maven-style), includingMAL-*handling.Schema:
advisories.cvss_source; denormalizedhas_critical_vulnerabilitywith a partial index; migration wideningadvisory_affected_rangesuniqueness to the full(introduced, fixed, last_affected)tuple so cross-distro ranges are not dropped.Worker:
bin/osv-sync.ts+src/osv/*(zip download/stream, inline CVSS v3.1, severity extraction, parse/upsert batches,deriveCriticalFlag);getOsvConfigand env samples; Docker composeosv-sync.yaml;semver,unzipper, vitest inpackage.json.Docs: ADRs 0003–0006 (critical-flag semantics, standalone bin vs Temporal, CVSS strategy, range dedupe/index) and index update.
Reviewed by Cursor Bugbot for commit d2e92b5. Bugbot is set up for automated code reviews on this repo. Configure here.