Skip to content

Add gtars genomicdist backend#155

Open
sanghoonio wants to merge 11 commits into
modular-backend-rfrom
modular-backend-gtars
Open

Add gtars genomicdist backend#155
sanghoonio wants to merge 11 commits into
modular-backend-rfrom
modular-backend-gtars

Conversation

@sanghoonio

@sanghoonio sanghoonio commented Apr 3, 2026

Copy link
Copy Markdown
Member

Summary

Add the gtars genomicdist CLI backend to bedboss, replacing R for bedstat analysis. Depends on PR #153 (backend abstraction).

gtars backend (GtarsStatBackend)

Invokes gtars genomicdist as a subprocess per file. All computation happens in the Rust CLI binary — partitions, TSS distances, region distributions, signal matrix overlap, GC content, neighbor distances.

Key features:

  • .fab binary FASTA auto-compilation via get_fab_path() — one-time gtars prep --fasta on first use, cached forever. Gives zero-copy mmap sequence access for GC content.
  • Pre-compiled reference binaries: .gda.bin (gene model), .osm.bin (signal matrix) auto-compiled on first use
  • compress_distributions.py: compresses raw genomicdist arrays (histograms, KDEs) for DB storage
  • Reference resolution via refgenie with seqcol API fallback for chrom.sizes

What was removed:

  • GtarsPyStatBackend (all-Python-bindings backend) — benchmarked at 97s vs CLI's 84s. CLI is faster and architecturally simpler.
  • GtarsMixedStatBackend (CLI + Python GC hybrid) — the .fab format eliminated the tradeoff that justified it.
  • BACKEND_GTARS_PY and BACKEND_GTARS_MIXED constants

Two backends remain: "r" (RStatBackend) and "gtars" (GtarsStatBackend).

Performance

Benchmarked on 30 diverse BED files (3 genomes, 1 region to 1M regions):

Backend Total (30 files) Speedup
gtars CLI (.fab) 90.9s (mean, 3 trials) 6.9x faster
R (persistent service) 631.0s baseline

See tournaments REPORT.md for full per-file breakdown and exploration log.

Files changed

File Change
bedstat/backends/gtars_backend.py New — pure CLI subprocess backend
bedstat/backends/__init__.py Register gtars, remove gtars-py/mixed
bedstat/compress_distributions.py New — histogram/KDE compression for DB
bedstat/ref_utils.py New — refgenie reference resolution + .fab auto-compile
bedboss.py Pass backend to reprocess_bedset → run_bedbuncher
bedbuncher/bedbuncher.py Skip R plots when backend is gtars
cli.py Update help text for 2 backends
const.py Remove BACKEND_GTARS_PY

Test plan

  • Black formatted
  • Backend factory routes correctly for "r" and "gtars"
  • 3 benchmark trials with shuffled file order — consistent results
  • Scalar output parity verified across trials
  • CI

🤖 Generated with Claude Code

sanghoonio and others added 10 commits April 3, 2026 16:03
New files:
- gtars_backend.py: GtarsStatBackend wrapping gtars genomicdist CLI
- compress_distributions.py: fixed-size compression (histograms, KDE, dense arrays)
- ref_utils.py: reference file resolution via refgenie + seqcol fallback

Changes:
- Register GtarsStatBackend in factory (replaces NotImplementedError)
- Only create RServiceManager when backend is "r"
- Skip R bedset plots when backend is "gtars"
- Add --backend CLI override for run_all and run_stats
- Add DEFAULT_PRECISION constant

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reorder chrom.sizes resolution: local refgenie cache → seqcol API
(~50KB) → refgenie FASTA pull (~3GB). Avoids downloading a full
genome FASTA just to get chromosome sizes on cold start.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
R stores partition percentages as fractions (0.0615), not percent
(6.082). Remove the * 100 multiplier so gtars output matches the
existing database convention.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Derives median absolute neighbor distance from the gtars CLI's raw
neighbor_distances list, stores it in the data dict so it gets written
to the new BedStats.median_neighbor_distance column (added in bbconf
PR #113).

Computed BEFORE compress_distributions replaces the flat list with a
KDE. The per-file KDE is still stored in the distributions JSONB blob
for single-file views; only the scalar is used for bedset aggregation
(mean ± sd across files).

Replaces what used to be an aggregated neighbor_distances KDE at the
bedset level — per earlier discussion, per-file KDE variance is low
within bedsets (assay type dominates), and a scalar median gives
enough signal at collection level.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New 'gtars-py' backend that uses gtars Python bindings directly, no
subprocess. Exists alongside the existing 'gtars' CLI-subprocess
backend so we can benchmark both on real data and keep only the
better performer before merge.

Architecture:
- GtarsPyStatBackend in backends/gtars_py_backend.py
- GenomeRefs dataclass caches reference data per genome on the backend
  instance (FASTA, chrom_sizes, GeneModel, PartitionList, TssIndex,
  SignalMatrix). Loaded lazily on first compute() call for a given
  genome, reused for all subsequent files.
- Same compute() signature as GtarsStatBackend
- Same output dict shape (compress_distributions + DB insertion unchanged)
- Same pypiper integration for status tracking (pm.timestamp markers
  between Python steps, no subprocess wrapping)
- median_neighbor_distance scalar derived same way

Factory dispatch: create_backend("gtars-py") returns the new backend.
Extended bbconf's analysis.backend Literal to accept "gtars-py" (bbconf
commit on modular-backend-logic branch).

Output shape normalization: the gtars Python binding returns
calc_partitions as {partition: [names], count: [counts], total: n} but
the CLI JSON schema has {counts: [[name, count], ...], total: n}.
_normalize_partitions() converts between them so downstream code sees
the same shape regardless of backend.

Parity verified on a 126K-region hg38 ENCODE BED file:
- Scalars identical (number_of_regions, mean_region_width,
  median_tss_dist, median_neighbor_distance, gc_content)
- All 14 partition fields (frequency+percentage for 7 categories) match
- widths, tss, neighbor_distances, region_distribution bins byte-match
  after compress_distributions

Performance (5 ENCODE files, hg38):
- gtars (CLI):    total 11.15s, median per-file 2.55s
- gtars-py:       total 5.71s,  median per-file 1.15s
- Speedup:        ~2x batch, ~2.2x per-file median
- Biggest wins: FASTA loaded once (not per file), no subprocess startup,
  no JSON round-trip to disk

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Before: GtarsStatBackend was "mixed" — called gtars CLI for most stats
but used gtars Python bindings for two things inside the backend:
1. RegionSet(bedfile) to compute bed_digest fallback
2. calc_gc_content(rs, assembly, ...) for GC content

This blurred the line between the "CLI" and "Python bindings" backends.

After: GtarsStatBackend is a pure CLI subprocess backend.
- Pass --fasta to gtars CLI (uses PR #249's --fasta flag that outputs
  gc_content in the JSON). Read gc_mean + per_region GC from the CLI
  output instead of calling gtars Python bindings.
- Require bed_digest from the caller. Backends no longer parse BED files.
  Resolution moves up to bedstat() (the orchestration layer), which
  computes the digest via gtars Python bindings once when absent.
- Remove imports of gtars.models.RegionSet and calculate_gc_content.
- Add get_fasta_path() helper to ref_utils.py (pure refgenie, no gtars
  dependency at import time) so the backend can resolve FASTA paths for
  the --fasta flag.

The three backends now have cleanly separated execution domains:
- RStatBackend: R subprocess (+ Python bindings helpers for file loading
  and GC content — kept as-is, R's native GC calc is slow)
- GtarsStatBackend: gtars CLI subprocess ONLY
- GtarsPyStatBackend: gtars Python bindings ONLY (no subprocess)

Parity verified: pure-CLI gtars output matches gtars-py output exactly
on all 19 scalar + partition fields for a 126K-region hg38 file.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove GtarsPyStatBackend (gtars CLI with .fab is faster and simpler)
- Remove BACKEND_GTARS_PY constant and all references
- Add get_fab_path() to ref_utils: auto-compiles .fab from FASTA on
  first use via gtars prep --fasta, cached forever
- GtarsStatBackend prefers .fab over plain .fa for GC content
- Fix reprocess_bedset: pass backend from config to run_bedbuncher
- Update CLI help text and bedbuncher docstrings

Two backends remain: "r" (RStatBackend) and "gtars" (GtarsStatBackend).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
compress_region_distribution now takes chrom_sizes to compute
correct per-chromosome array lengths (longest chr = n_bins,
shorter chrs proportionally fewer). Without this, arrays were
sized by max observed rid which varied by region coverage.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant