Add gtars genomicdist backend by sanghoonio · Pull Request #155 · databio/bedboss

sanghoonio · 2026-04-03T22:58:51Z

Summary

Add the gtars genomicdist CLI backend to bedboss, replacing R for bedstat analysis. Depends on PR #153 (backend abstraction).

gtars backend (`GtarsStatBackend`)

Invokes gtars genomicdist as a subprocess per file. All computation happens in the Rust CLI binary — partitions, TSS distances, region distributions, signal matrix overlap, GC content, neighbor distances.

Key features:

.fab binary FASTA auto-compilation via get_fab_path() — one-time gtars prep --fasta on first use, cached forever. Gives zero-copy mmap sequence access for GC content.
Pre-compiled reference binaries: .gda.bin (gene model), .osm.bin (signal matrix) auto-compiled on first use
compress_distributions.py: compresses raw genomicdist arrays (histograms, KDEs) for DB storage
Reference resolution via refgenie with seqcol API fallback for chrom.sizes

What was removed:

GtarsPyStatBackend (all-Python-bindings backend) — benchmarked at 97s vs CLI's 84s. CLI is faster and architecturally simpler.
GtarsMixedStatBackend (CLI + Python GC hybrid) — the .fab format eliminated the tradeoff that justified it.
BACKEND_GTARS_PY and BACKEND_GTARS_MIXED constants

Two backends remain: "r" (RStatBackend) and "gtars" (GtarsStatBackend).

Performance

Benchmarked on 30 diverse BED files (3 genomes, 1 region to 1M regions):

Backend	Total (30 files)	Speedup
gtars CLI (.fab)	90.9s (mean, 3 trials)	6.9x faster
R (persistent service)	631.0s	baseline

See tournaments REPORT.md for full per-file breakdown and exploration log.

Files changed

File	Change
`bedstat/backends/gtars_backend.py`	New — pure CLI subprocess backend
`bedstat/backends/__init__.py`	Register gtars, remove gtars-py/mixed
`bedstat/compress_distributions.py`	New — histogram/KDE compression for DB
`bedstat/ref_utils.py`	New — refgenie reference resolution + .fab auto-compile
`bedboss.py`	Pass backend to reprocess_bedset → run_bedbuncher
`bedbuncher/bedbuncher.py`	Skip R plots when backend is gtars
`cli.py`	Update help text for 2 backends
`const.py`	Remove BACKEND_GTARS_PY

Test plan

Black formatted
Backend factory routes correctly for "r" and "gtars"
3 benchmark trials with shuffled file order — consistent results
Scalar output parity verified across trials
CI

🤖 Generated with Claude Code

New files: - gtars_backend.py: GtarsStatBackend wrapping gtars genomicdist CLI - compress_distributions.py: fixed-size compression (histograms, KDE, dense arrays) - ref_utils.py: reference file resolution via refgenie + seqcol fallback Changes: - Register GtarsStatBackend in factory (replaces NotImplementedError) - Only create RServiceManager when backend is "r" - Skip R bedset plots when backend is "gtars" - Add --backend CLI override for run_all and run_stats - Add DEFAULT_PRECISION constant Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Reorder chrom.sizes resolution: local refgenie cache → seqcol API (~50KB) → refgenie FASTA pull (~3GB). Avoids downloading a full genome FASTA just to get chromosome sizes on cold start. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

R stores partition percentages as fractions (0.0615), not percent (6.082). Remove the * 100 multiplier so gtars output matches the existing database convention. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Derives median absolute neighbor distance from the gtars CLI's raw neighbor_distances list, stores it in the data dict so it gets written to the new BedStats.median_neighbor_distance column (added in bbconf PR #113). Computed BEFORE compress_distributions replaces the flat list with a KDE. The per-file KDE is still stored in the distributions JSONB blob for single-file views; only the scalar is used for bedset aggregation (mean ± sd across files). Replaces what used to be an aggregated neighbor_distances KDE at the bedset level — per earlier discussion, per-file KDE variance is low within bedsets (assay type dominates), and a scalar median gives enough signal at collection level. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New 'gtars-py' backend that uses gtars Python bindings directly, no subprocess. Exists alongside the existing 'gtars' CLI-subprocess backend so we can benchmark both on real data and keep only the better performer before merge. Architecture: - GtarsPyStatBackend in backends/gtars_py_backend.py - GenomeRefs dataclass caches reference data per genome on the backend instance (FASTA, chrom_sizes, GeneModel, PartitionList, TssIndex, SignalMatrix). Loaded lazily on first compute() call for a given genome, reused for all subsequent files. - Same compute() signature as GtarsStatBackend - Same output dict shape (compress_distributions + DB insertion unchanged) - Same pypiper integration for status tracking (pm.timestamp markers between Python steps, no subprocess wrapping) - median_neighbor_distance scalar derived same way Factory dispatch: create_backend("gtars-py") returns the new backend. Extended bbconf's analysis.backend Literal to accept "gtars-py" (bbconf commit on modular-backend-logic branch). Output shape normalization: the gtars Python binding returns calc_partitions as {partition: [names], count: [counts], total: n} but the CLI JSON schema has {counts: [[name, count], ...], total: n}. _normalize_partitions() converts between them so downstream code sees the same shape regardless of backend. Parity verified on a 126K-region hg38 ENCODE BED file: - Scalars identical (number_of_regions, mean_region_width, median_tss_dist, median_neighbor_distance, gc_content) - All 14 partition fields (frequency+percentage for 7 categories) match - widths, tss, neighbor_distances, region_distribution bins byte-match after compress_distributions Performance (5 ENCODE files, hg38): - gtars (CLI): total 11.15s, median per-file 2.55s - gtars-py: total 5.71s, median per-file 1.15s - Speedup: ~2x batch, ~2.2x per-file median - Biggest wins: FASTA loaded once (not per file), no subprocess startup, no JSON round-trip to disk Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Before: GtarsStatBackend was "mixed" — called gtars CLI for most stats but used gtars Python bindings for two things inside the backend: 1. RegionSet(bedfile) to compute bed_digest fallback 2. calc_gc_content(rs, assembly, ...) for GC content This blurred the line between the "CLI" and "Python bindings" backends. After: GtarsStatBackend is a pure CLI subprocess backend. - Pass --fasta to gtars CLI (uses PR #249's --fasta flag that outputs gc_content in the JSON). Read gc_mean + per_region GC from the CLI output instead of calling gtars Python bindings. - Require bed_digest from the caller. Backends no longer parse BED files. Resolution moves up to bedstat() (the orchestration layer), which computes the digest via gtars Python bindings once when absent. - Remove imports of gtars.models.RegionSet and calculate_gc_content. - Add get_fasta_path() helper to ref_utils.py (pure refgenie, no gtars dependency at import time) so the backend can resolve FASTA paths for the --fasta flag. The three backends now have cleanly separated execution domains: - RStatBackend: R subprocess (+ Python bindings helpers for file loading and GC content — kept as-is, R's native GC calc is slow) - GtarsStatBackend: gtars CLI subprocess ONLY - GtarsPyStatBackend: gtars Python bindings ONLY (no subprocess) Parity verified: pure-CLI gtars output matches gtars-py output exactly on all 19 scalar + partition fields for a 126K-region hg38 file. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove GtarsPyStatBackend (gtars CLI with .fab is faster and simpler) - Remove BACKEND_GTARS_PY constant and all references - Add get_fab_path() to ref_utils: auto-compiles .fab from FASTA on first use via gtars prep --fasta, cached forever - GtarsStatBackend prefers .fab over plain .fa for GC content - Fix reprocess_bedset: pass backend from config to run_bedbuncher - Update CLI help text and bedbuncher docstrings Two backends remain: "r" (RStatBackend) and "gtars" (GtarsStatBackend). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

compress_region_distribution now takes chrom_sizes to compute correct per-chromosome array lengths (longest chr = n_bins, shorter chrs proportionally fewer). Without this, arrays were sized by max observed rid which varied by region coverage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sanghoonio and others added 10 commits April 3, 2026 16:03

Apply black formatting to gtars_backend.py

a1ef7e4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge modular-backend-r: StatBackend instance through orchestration

9d8d9e9

Apply black formatting to backends/__init__.py and ref_utils.py

79ecfba

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sanghoonio mentioned this pull request Apr 6, 2026

Extract R analysis into modular backend abstraction #153

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add gtars genomicdist backend#155

Add gtars genomicdist backend#155
sanghoonio wants to merge 11 commits into
modular-backend-rfrom
modular-backend-gtars

sanghoonio commented Apr 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

sanghoonio commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

gtars backend (GtarsStatBackend)

Performance

Files changed

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sanghoonio commented Apr 3, 2026 •

edited

Loading

gtars backend (`GtarsStatBackend`)