Add gtars genomicdist backend#155
Open
sanghoonio wants to merge 11 commits into
Open
Conversation
New files: - gtars_backend.py: GtarsStatBackend wrapping gtars genomicdist CLI - compress_distributions.py: fixed-size compression (histograms, KDE, dense arrays) - ref_utils.py: reference file resolution via refgenie + seqcol fallback Changes: - Register GtarsStatBackend in factory (replaces NotImplementedError) - Only create RServiceManager when backend is "r" - Skip R bedset plots when backend is "gtars" - Add --backend CLI override for run_all and run_stats - Add DEFAULT_PRECISION constant Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reorder chrom.sizes resolution: local refgenie cache → seqcol API (~50KB) → refgenie FASTA pull (~3GB). Avoids downloading a full genome FASTA just to get chromosome sizes on cold start. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
R stores partition percentages as fractions (0.0615), not percent (6.082). Remove the * 100 multiplier so gtars output matches the existing database convention. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Derives median absolute neighbor distance from the gtars CLI's raw neighbor_distances list, stores it in the data dict so it gets written to the new BedStats.median_neighbor_distance column (added in bbconf PR #113). Computed BEFORE compress_distributions replaces the flat list with a KDE. The per-file KDE is still stored in the distributions JSONB blob for single-file views; only the scalar is used for bedset aggregation (mean ± sd across files). Replaces what used to be an aggregated neighbor_distances KDE at the bedset level — per earlier discussion, per-file KDE variance is low within bedsets (assay type dominates), and a scalar median gives enough signal at collection level. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New 'gtars-py' backend that uses gtars Python bindings directly, no
subprocess. Exists alongside the existing 'gtars' CLI-subprocess
backend so we can benchmark both on real data and keep only the
better performer before merge.
Architecture:
- GtarsPyStatBackend in backends/gtars_py_backend.py
- GenomeRefs dataclass caches reference data per genome on the backend
instance (FASTA, chrom_sizes, GeneModel, PartitionList, TssIndex,
SignalMatrix). Loaded lazily on first compute() call for a given
genome, reused for all subsequent files.
- Same compute() signature as GtarsStatBackend
- Same output dict shape (compress_distributions + DB insertion unchanged)
- Same pypiper integration for status tracking (pm.timestamp markers
between Python steps, no subprocess wrapping)
- median_neighbor_distance scalar derived same way
Factory dispatch: create_backend("gtars-py") returns the new backend.
Extended bbconf's analysis.backend Literal to accept "gtars-py" (bbconf
commit on modular-backend-logic branch).
Output shape normalization: the gtars Python binding returns
calc_partitions as {partition: [names], count: [counts], total: n} but
the CLI JSON schema has {counts: [[name, count], ...], total: n}.
_normalize_partitions() converts between them so downstream code sees
the same shape regardless of backend.
Parity verified on a 126K-region hg38 ENCODE BED file:
- Scalars identical (number_of_regions, mean_region_width,
median_tss_dist, median_neighbor_distance, gc_content)
- All 14 partition fields (frequency+percentage for 7 categories) match
- widths, tss, neighbor_distances, region_distribution bins byte-match
after compress_distributions
Performance (5 ENCODE files, hg38):
- gtars (CLI): total 11.15s, median per-file 2.55s
- gtars-py: total 5.71s, median per-file 1.15s
- Speedup: ~2x batch, ~2.2x per-file median
- Biggest wins: FASTA loaded once (not per file), no subprocess startup,
no JSON round-trip to disk
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Before: GtarsStatBackend was "mixed" — called gtars CLI for most stats but used gtars Python bindings for two things inside the backend: 1. RegionSet(bedfile) to compute bed_digest fallback 2. calc_gc_content(rs, assembly, ...) for GC content This blurred the line between the "CLI" and "Python bindings" backends. After: GtarsStatBackend is a pure CLI subprocess backend. - Pass --fasta to gtars CLI (uses PR #249's --fasta flag that outputs gc_content in the JSON). Read gc_mean + per_region GC from the CLI output instead of calling gtars Python bindings. - Require bed_digest from the caller. Backends no longer parse BED files. Resolution moves up to bedstat() (the orchestration layer), which computes the digest via gtars Python bindings once when absent. - Remove imports of gtars.models.RegionSet and calculate_gc_content. - Add get_fasta_path() helper to ref_utils.py (pure refgenie, no gtars dependency at import time) so the backend can resolve FASTA paths for the --fasta flag. The three backends now have cleanly separated execution domains: - RStatBackend: R subprocess (+ Python bindings helpers for file loading and GC content — kept as-is, R's native GC calc is slow) - GtarsStatBackend: gtars CLI subprocess ONLY - GtarsPyStatBackend: gtars Python bindings ONLY (no subprocess) Parity verified: pure-CLI gtars output matches gtars-py output exactly on all 19 scalar + partition fields for a 126K-region hg38 file. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove GtarsPyStatBackend (gtars CLI with .fab is faster and simpler) - Remove BACKEND_GTARS_PY constant and all references - Add get_fab_path() to ref_utils: auto-compiles .fab from FASTA on first use via gtars prep --fasta, cached forever - GtarsStatBackend prefers .fab over plain .fa for GC content - Fix reprocess_bedset: pass backend from config to run_bedbuncher - Update CLI help text and bedbuncher docstrings Two backends remain: "r" (RStatBackend) and "gtars" (GtarsStatBackend). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 tasks
compress_region_distribution now takes chrom_sizes to compute correct per-chromosome array lengths (longest chr = n_bins, shorter chrs proportionally fewer). Without this, arrays were sized by max observed rid which varied by region coverage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add the gtars genomicdist CLI backend to bedboss, replacing R for bedstat analysis. Depends on PR #153 (backend abstraction).
gtars backend (
GtarsStatBackend)Invokes
gtars genomicdistas a subprocess per file. All computation happens in the Rust CLI binary — partitions, TSS distances, region distributions, signal matrix overlap, GC content, neighbor distances.Key features:
.fabbinary FASTA auto-compilation viaget_fab_path()— one-timegtars prep --fastaon first use, cached forever. Gives zero-copy mmap sequence access for GC content..gda.bin(gene model),.osm.bin(signal matrix) auto-compiled on first usecompress_distributions.py: compresses raw genomicdist arrays (histograms, KDEs) for DB storageWhat was removed:
GtarsPyStatBackend(all-Python-bindings backend) — benchmarked at 97s vs CLI's 84s. CLI is faster and architecturally simpler.GtarsMixedStatBackend(CLI + Python GC hybrid) — the.fabformat eliminated the tradeoff that justified it.BACKEND_GTARS_PYandBACKEND_GTARS_MIXEDconstantsTwo backends remain:
"r"(RStatBackend) and"gtars"(GtarsStatBackend).Performance
Benchmarked on 30 diverse BED files (3 genomes, 1 region to 1M regions):
See tournaments REPORT.md for full per-file breakdown and exploration log.
Files changed
bedstat/backends/gtars_backend.pybedstat/backends/__init__.pybedstat/compress_distributions.pybedstat/ref_utils.pybedboss.pybedbuncher/bedbuncher.pycli.pyconst.pyTest plan
🤖 Generated with Claude Code