Skip to content

Add cached cd hit est reference collapse step to pathoscope#188

Open
ReeceHoffmann wants to merge 6 commits into
virtool:mainfrom
ReeceHoffmann:reecehoffmann/vir-2534-add-cached-cd-hit-est-reference-collapse-step-to-pathoscope
Open

Add cached cd hit est reference collapse step to pathoscope#188
ReeceHoffmann wants to merge 6 commits into
virtool:mainfrom
ReeceHoffmann:reecehoffmann/vir-2534-add-cached-cd-hit-est-reference-collapse-step-to-pathoscope

Conversation

@ReeceHoffmann

@ReeceHoffmann ReeceHoffmann commented Jun 12, 2026

Copy link
Copy Markdown
Member

Add reference collapsing step to pathoscope

- run cd-hit-est per segment with bounded concurrency

- keep each cd-hit-est process single-threaded so proc controls segment-level parallelism
- allow isolates to omit schema segments during reference collapse

- permit empty segment sentinels for unsegmented OTUs
@ReeceHoffmann ReeceHoffmann marked this pull request as ready for review June 12, 2026 21:32
@ReeceHoffmann ReeceHoffmann requested a review from igboyes as a code owner June 12, 2026 21:32

@igboyes igboyes left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utils.py is turning into a pig and should be modularized.

Comment on lines +441 to +445
return {
"isolate_count_before": before_count,
"isolate_count_after": after_count,
"isolate_count_removed": before_count - after_count,
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs type.

Comment thread fixtures.py
Comment on lines +7 to +40
def get_reference_index_path(work_path: Path) -> Path:
return work_path / "reference_index" / "reference"


def get_collapsed_reference_path(work_path: Path) -> Path:
return work_path / "collapsed_reference" / "reference.json"


def get_subtraction_indexes_path(work_path: Path) -> Path:
return work_path / "subtraction_indexes"


def get_isolate_path(work_path: Path) -> Path:
return work_path / "isolates"


def get_isolate_fasta_path(isolate_path: Path) -> Path:
return isolate_path / "isolate_index.fa"


def get_isolate_fastq_path(isolate_path: Path) -> Path:
return isolate_path / "isolate_mapped.fq"


def get_isolate_index_path(isolate_path: Path) -> Path:
return isolate_path / "isolates"


def get_isolate_bam_path(isolate_path: Path) -> Path:
return isolate_path / "to_isolates.bam"


def get_subtracted_bam_path(work_path: Path) -> Path:
return work_path / "subtracted.bam"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be inlined if only used once. No need for separate function declaration.

Comment thread workflow.py
Comment on lines +96 to +97
with open(collapsed_reference_dir / "collapse-manifest.json", "w") as handle:
json.dump(stats, handle)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_thread.

Not to sure what the point of stats and this file is. Seems like debugging that should be a test?

Comment thread workflow.py

log.info("collapsing reference", outcome="miss", source=str(index.json_path))

stats = await collapse_reference_json(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this called stats? Is it not a "collapse-manifest"?

Comment thread workflow.py

log.info("reference collapse complete", **stats)

created = await cache.put(key, collapsed_reference_dir, params=params)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary variable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants