Skip to content

CPTAC download implementation#110

Merged
ypriverol merged 7 commits into
masterfrom
dev
Jun 8, 2026
Merged

CPTAC download implementation#110
ypriverol merged 7 commits into
masterfrom
dev

Conversation

@ypriverol

@ypriverol ypriverol commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Summary by CodeRabbit

Release Notes

  • New Features

    • Added parallel HTTP range-based downloading for faster single-file transfers via new --threads parameter (1–32) across download commands
    • Introduced download-pdc-files command for downloading PDC/CPTAC cancer data files with signed HTTPS URLs
    • Added progress reporting during FTP directory enumeration
  • Documentation

    • Updated README with download-pdc-files command documentation and download example
  • Tests

    • Added comprehensive test coverage for HTTP parallel downloads and PDC file download workflows

Shen-YuFei and others added 7 commits May 20, 2026 20:21
Large MassIVE timsTOF deposits are spread over thousands of .d
directories, each requiring its own PASV+TLS data connection to list.
_walk_ftp_tree enumerates the entire tree before any download and was
silent throughout, so a multi-minute enumeration looked like a hang
(e.g. ~1719 .d in MSV000098940 ≈ 62 min of no output -> users Ctrl-C).

Add an INFO progress heartbeat every 100 directories plus a final
summary. Pure instrumentation: the returned file list and recursion
are unchanged (covered by a mock-tree unit test); existing tests pass.
fix(massive/ftp): emit progress while walking the FTPS tree
Add multi-segment download feature and restore threads option
@coderabbitai

coderabbitai Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds per-file download thread concurrency (download_threads parameter) throughout the download stack for HTTP byte-range parallelism, introduces HTTP multipart range downloads and FTP progress heartbeats, and adds a complete PDC/CPTAC file download system with GraphQL client, file validation, retries, and CLI command.

Changes

Per-file threading infrastructure

Layer / File(s) Summary
Provider threading parameter
pridepy/download/base.py
download_all_raw, download_category, download_by_filenames, and download_files accept and forward download_threads: int = 1.
URL-based dispatch threading
pridepy/download/by_url.py
_dispatch_url_scheme, _download_single_url, and download_files_by_url accept download_threads and route HTTP URLs to multipart or single-stream; refactored to always use ThreadPoolExecutor.
Client facade threading
pridepy/download/client.py
download_http_urls, download_all_raw_files, download_all_category_files, download_files_by_list, and download_files_by_url accept and forward download_threads to underlying calls.
PRIDE provider threading
pridepy/download/pride.py
_globus_download_one branches on download_threads to use multipart or single-stream; threading flows through download_files_from_globus, _batch_download_by_protocol, _download_with_fallback, download_files, and _download_files_batch.
HTTP multipart byte-range downloads
pridepy/download/transport.py
New helpers (_download_range, _read_http_download_metadata, _prepare_multipart_target, _build_download_ranges, _download_multipart_ranges, _multipart_download) implement parallel byte-range fetching with per-range resume and fallback; integrated into _http_download_one and download_http_urls.
FTP tree enumeration progress
pridepy/download/transport.py
Enhanced _walk_ftp_tree tracks and logs directory/file scan counts periodically during mlsd and fallback LIST traversal.
CLI download commands with --threads
pridepy/pridepy.py, pridepy/tests/test_cli_flatten.py, pridepy/tests/test_download_by_url.py
download-all-public-raw-files, download-all-public-category-files, download-files-by-list, and download-files-by-url CLI commands replace "parallel files" with --threads (1–32 constrained); tests verify threading parameter wiring and multipart behavior.

PDC file download system

Layer / File(s) Summary
PDC GraphQL client and parsing
pridepy/pdc/client.py
New module with dataclasses (PDCFile, PDCDownloadRequest, PDCFileTypeFilter); accession parsing from inline text or CSV; file-type normalization and filtering; GraphQL query execution; and signed-URL refresh under process-wide lock.
PDC file validation and orchestration
pridepy/pdc/downloader.py
File-existence and checksum validation; atomic .part-file downloads; retry with exponential backoff and 403 signed-URL refresh; study-level fetch and per-file download loops; public entry point that parses requests, creates output directory, tracks statistics, writes failure manifest on errors, and raises if any downloads fail.
PDC package exports
pridepy/pdc/__init__.py
Re-exports public PDC symbols for API.
PDC CLI command
pridepy/pridepy.py
New download-pdc-files command with accession, file-type (optional), output folder, checksum (default enabled), threads (1–32), and retry options; routes to download_pdc_files with exception-to-Click error mapping and logs completion summary.
PDC tests
pridepy/tests/test_pdc_client.py, pridepy/tests/test_pdc_downloader.py, pridepy/tests/test_pdc_cli.py
Accession parsing and CSV handling; file-type filtering; GraphQL mapping; URL refresh; file skipping with checksums; multi-study CSV downloads; multipart threading; failure handling with retry; and CLI argument mapping.
README documentation
README.md
Added download-pdc-files command entry and quick example.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • PRIDE-Archive/pridepy#61: Both PRs touch pridepy.py's PRIDE download CLI—retrieved PR adds download-all-public-raw-files/download-all-public-category-files, while the main PR modifies those same commands' option wiring to use download_threads instead of parallel_files.
  • PRIDE-Archive/pridepy#93: Both PRs change HTTP/byte-range download concurrency implementation by wiring thread parameters into range-download logic and implementing multipart-style parallel range behavior, aligning the main PR's download_threads work with earlier parallelism changes.

Suggested labels

enhancement, feature, downloads, threading, http, pdc

Suggested reviewers

  • chakrabandla
  • selvaebi

Poem

🐰 A thread for each byte-range in sight,
PDC files downloaded swift and tight,
FTP trees now whisper their progress song,
While checksums verify nothing goes wrong! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'CPTAC download implementation' is vague and does not accurately reflect the comprehensive scope of changes, which include PDC/CPTAC downloads, multipart HTTP downloading, and CLI enhancements across multiple modules. Consider a more specific title such as 'Add PDC/CPTAC download support with multipart HTTP and threading' to better convey the main changes in the pull request.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@qodo-code-review

Copy link
Copy Markdown
Contributor

Review Summary by Qodo

Add multi-segment HTTP downloads, PDC support, and FTP progress reporting

✨ Enhancement 🧪 Tests

Grey Divider

Walkthroughs

Description
• Add multi-segment HTTP Range download support via download_threads parameter
  - Replaces -w/--parallel-files with -t/--threads for per-file parallelism
  - Implements _multipart_download() for concurrent Range requests
• Add PDC/CPTAC download support with GraphQL metadata and signed URLs
  - New pridepy/pdc/ module with client and downloader implementations
  - Supports mzid, psm, raw, mzml file types with CSV accession lists
• Improve FTP tree enumeration progress reporting
  - Emit INFO heartbeats every 100 directories during FTPS tree walk
  - Prevents multi-minute hangs on large deposits like MassIVE timsTOF
• Refactor URL download logic to always use ThreadPoolExecutor
  - Simplifies serial/parallel branching; always uses workers pool
Diagram
flowchart LR
  A["CLI: -t/--threads"] --> B["download_threads param"]
  B --> C["HTTP URLs"]
  B --> D["PRIDE/MassIVE"]
  B --> E["PDC/CPTAC"]
  C --> F["_multipart_download"]
  D --> G["transport layer"]
  E --> H["PDC GraphQL API"]
  F --> I["HTTP Range requests"]
  H --> J["Signed URLs"]
  G --> K["FTP tree walk progress"]

Loading

Grey Divider

File Changes

1. pridepy/download/base.py ✨ Enhancement +8/-0

Add download_threads parameter to all download methods

pridepy/download/base.py


2. pridepy/download/by_url.py ✨ Enhancement +43/-30

Implement multipart download dispatch and refactor URL logic

pridepy/download/by_url.py


3. pridepy/download/client.py ✨ Enhancement +10/-0

Thread download_threads parameter through transport shim

pridepy/download/client.py


View more (12)
4. pridepy/download/pride.py ✨ Enhancement +35/-5

Support multipart downloads in PRIDE globus and batch handlers

pridepy/download/pride.py


5. pridepy/download/transport.py ✨ Enhancement +199/-4

Implement _multipart_download with Range requests and FTP progress

pridepy/download/transport.py


6. pridepy/pdc/__init__.py ✨ Enhancement +20/-0

New PDC module exports for client and downloader

pridepy/pdc/init.py


7. pridepy/pdc/client.py ✨ Enhancement +281/-0

PDC GraphQL client with file filtering and accession parsing

pridepy/pdc/client.py


8. pridepy/pdc/downloader.py ✨ Enhancement +386/-0

PDC download orchestrator with retry and checksum validation

pridepy/pdc/downloader.py


9. pridepy/pridepy.py ✨ Enhancement +114/-26

Replace -w/--parallel-files with -t/--threads and add PDC command

pridepy/pridepy.py


10. pridepy/tests/test_cli_flatten.py 🧪 Tests +40/-0

Add tests for download_threads parameter in CLI

pridepy/tests/test_cli_flatten.py


11. pridepy/tests/test_download_by_url.py 🧪 Tests +50/-1

Add multipart download and failure cleanup tests

pridepy/tests/test_download_by_url.py


12. pridepy/tests/test_pdc_cli.py 🧪 Tests +90/-0

New PDC CLI command tests

pridepy/tests/test_pdc_cli.py


13. pridepy/tests/test_pdc_client.py 🧪 Tests +177/-0

New PDC client and accession parsing tests

pridepy/tests/test_pdc_client.py


14. pridepy/tests/test_pdc_downloader.py 🧪 Tests +260/-0

New PDC downloader integration and retry tests

pridepy/tests/test_pdc_downloader.py


15. README.md 📝 Documentation +4/-0

Document new download-pdc-files command

README.md


Grey Divider

ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan

Qodo Logo

@qodo-code-review

qodo-code-review Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0)

Grey Divider


Action required

1. Multipart missing completion check 🐞 Bug ≡ Correctness
Description
transport._download_range returns success without verifying it wrote the full requested byte
range; because _multipart_download preallocates the file to the expected size, a truncated segment
can leave zero-filled holes while the final file size still looks correct. This can silently corrupt
downloads when servers close connections mid-stream without raising.
Code

pridepy/download/transport.py[R628-656]

+def _download_range(url, file_path, start, end, pbar, max_retries=3):
+    """Download a byte range directly into the target file using seek.
+
+    On transient failures (e.g. ``IncompleteRead``) the segment resumes
+    from the last successfully written byte using a fresh Range request
+    for the remaining bytes, instead of restarting the whole segment.
+    ``pbar`` only accumulates bytes that were actually written, so the
+    progress bar stays in sync with on-disk state across retries.
+    """
+    cursor = start
+    for attempt in range(1, max_retries + 1):
+        try:
+            session = Util.create_session_with_retries()
+            headers = {"Range": f"bytes={cursor}-{end}"}
+            with session.get(url, headers=headers, stream=True, timeout=(15, 15)) as r:
+                r.raise_for_status()
+                if r.status_code != 206:
+                    raise RuntimeError(f"Server did not honor Range request: {r.status_code}")
+                content_range = r.headers.get("Content-Range", "")
+                if not content_range.lower().startswith(f"bytes {cursor}-{end}/"):
+                    raise RuntimeError(f"Unexpected Content-Range header: {content_range}")
+                with open(file_path, "r+b") as f:
+                    f.seek(cursor)
+                    for chunk in r.iter_content(chunk_size=8 * 1024 * 1024):
+                        if chunk:
+                            f.write(chunk)
+                            cursor += len(chunk)
+                            pbar.update(len(chunk))
+            return
Evidence
The existing single-connection downloader explicitly guards against silent mid-stream truncation by
checking the final on-disk size against the server’s Content-Length, but _download_range has no
equivalent end-of-range validation and simply returns after the loop.

pridepy/download/transport.py[568-625]
pridepy/download/transport.py[628-656]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`_download_range()` streams bytes for a Range request but never asserts that the downloaded segment reached `end` (i.e., `cursor == end + 1`) before returning. With `_prepare_multipart_target()` preallocating the file, incomplete segments can leave zero-filled gaps that are not detectable via file size.

## Issue Context
The single-stream downloader `_parallel_download()` includes a post-transfer integrity check to catch silent truncation; multipart currently lacks an equivalent check.

## Fix Focus Areas
- pridepy/download/transport.py[628-656]
- pridepy/download/transport.py[613-625]

## Implementation notes
- After the `iter_content` loop, add a completeness assertion like:
 - `expected = end + 1`
 - if `cursor != expected`: raise `RuntimeError` describing the shortfall.
- Consider also validating `Content-Length` for the range response when present, but the cursor check is the critical guard.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Range split can create invalid segments 🐞 Bug ☼ Reliability
Description
_build_download_ranges computes part_size = total_size // threads without guarding against
threads > total_size, which can create zero-length parts and invalid ranges (e.g., end < start).
While default min_size_bytes (10MB) makes this unlikely in typical use, it can occur when callers
lower min_size_bytes or otherwise invoke multipart on small objects.
Code

pridepy/download/transport.py[R695-702]

+def _build_download_ranges(total_size, threads):
+    part_size = total_size // threads
+    ranges = []
+    for index in range(threads):
+        start = index * part_size
+        end = total_size - 1 if index == threads - 1 else (start + part_size - 1)
+        ranges.append((start, end))
+    return ranges
Evidence
_build_download_ranges uses integer division to derive per-part size and directly computes end
as start + part_size - 1, which becomes invalid when part_size is 0. Multipart calls this
function after clamping threads but without relating threads to total_size.

pridepy/download/transport.py[695-702]
pridepy/download/transport.py[728-761]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`_build_download_ranges(total_size, threads)` can produce invalid ranges when `threads` exceeds `total_size` (in bytes), because `part_size` becomes 0 and the computed `end` becomes `start - 1` for non-final parts.

## Issue Context
This is mostly mitigated in normal operation by `_can_use_multipart()` requiring `total_size >= min_size_bytes` (default 10MB) and threads being clamped to <=32, but the function is still unsafe for edge-case inputs (e.g., test/advanced callers setting `min_size_bytes=1`).

## Fix Focus Areas
- pridepy/download/transport.py[695-702]
- pridepy/download/transport.py[728-761]

## Implementation notes
- Defensively clamp the effective thread count before range building, e.g. `threads = min(threads, total_size)` (or `min(threads, max(1, total_size // MIN_PART_SIZE))`).
- Alternatively, ensure `part_size = max(1, total_size // threads)` and adjust the number of ranges accordingly.
- Add a sanity check that every produced range satisfies `0 <= start <= end < total_size`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Zero workers crash 🐞 Bug ☼ Reliability
Description
download_files_by_url now always constructs ThreadPoolExecutor(max_workers=workers) where
workers = min(parallel_files, 3, len(urls)) is not clamped to at least 1, so programmatic callers
passing parallel_files <= 0 will raise at runtime. This regression was introduced by removing the
prior serial-path branching and differs from transport.download_http_urls, which clamps workers
with max(1, ...).
Code

pridepy/download/by_url.py[R194-203]

+    workers = min(parallel_files, 3, len(urls))
    failures: List[Tuple[str, str]] = []

-    if parallel_files < 2:
-        for url in urls:
+    if workers > 1:
+        logging.info(
+            "Downloading %d URL(s) with %d parallel workers",
+            len(urls), workers,
+        )
+    with ThreadPoolExecutor(max_workers=workers) as executor:
+        futures = {
Evidence
download_files_by_url uses workers directly as the executor size without a lower bound, while
transport.download_http_urls explicitly wraps the min(...) with max(1, ...) before creating
its executor.

pridepy/download/by_url.py[189-214]
pridepy/download/transport.py[846-853]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`download_files_by_url()` computes `workers = min(parallel_files, 3, len(urls))` and unconditionally uses it in `ThreadPoolExecutor(max_workers=workers)`. If a programmatic caller passes `parallel_files=0` (or negative), this will crash.

## Issue Context
CLI users are protected (Click IntRange), but `download_files_by_url` is a Python API. Other transports already clamp the worker count to at least 1.

## Fix Focus Areas
- pridepy/download/by_url.py[194-203]
- pridepy/download/transport.py[846-852]

## Implementation notes
- Change to `workers = max(1, min(parallel_files, 3, len(urls)))`.
- Optionally also validate `parallel_files` and raise `ValueError` when it’s < 1, but clamping is consistent with existing transport behavior.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan

Qodo Logo

Comment on lines +628 to +656
def _download_range(url, file_path, start, end, pbar, max_retries=3):
"""Download a byte range directly into the target file using seek.

On transient failures (e.g. ``IncompleteRead``) the segment resumes
from the last successfully written byte using a fresh Range request
for the remaining bytes, instead of restarting the whole segment.
``pbar`` only accumulates bytes that were actually written, so the
progress bar stays in sync with on-disk state across retries.
"""
cursor = start
for attempt in range(1, max_retries + 1):
try:
session = Util.create_session_with_retries()
headers = {"Range": f"bytes={cursor}-{end}"}
with session.get(url, headers=headers, stream=True, timeout=(15, 15)) as r:
r.raise_for_status()
if r.status_code != 206:
raise RuntimeError(f"Server did not honor Range request: {r.status_code}")
content_range = r.headers.get("Content-Range", "")
if not content_range.lower().startswith(f"bytes {cursor}-{end}/"):
raise RuntimeError(f"Unexpected Content-Range header: {content_range}")
with open(file_path, "r+b") as f:
f.seek(cursor)
for chunk in r.iter_content(chunk_size=8 * 1024 * 1024):
if chunk:
f.write(chunk)
cursor += len(chunk)
pbar.update(len(chunk))
return

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Multipart missing completion check 🐞 Bug ≡ Correctness

transport._download_range returns success without verifying it wrote the full requested byte
range; because _multipart_download preallocates the file to the expected size, a truncated segment
can leave zero-filled holes while the final file size still looks correct. This can silently corrupt
downloads when servers close connections mid-stream without raising.
Agent Prompt
## Issue description
`_download_range()` streams bytes for a Range request but never asserts that the downloaded segment reached `end` (i.e., `cursor == end + 1`) before returning. With `_prepare_multipart_target()` preallocating the file, incomplete segments can leave zero-filled gaps that are not detectable via file size.

## Issue Context
The single-stream downloader `_parallel_download()` includes a post-transfer integrity check to catch silent truncation; multipart currently lacks an equivalent check.

## Fix Focus Areas
- pridepy/download/transport.py[628-656]
- pridepy/download/transport.py[613-625]

## Implementation notes
- After the `iter_content` loop, add a completeness assertion like:
  - `expected = end + 1`
  - if `cursor != expected`: raise `RuntimeError` describing the shortfall.
- Consider also validating `Content-Length` for the range response when present, but the cursor check is the critical guard.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pridepy/pridepy.py (1)

678-680: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update protocol help text to reflect threaded multipart behavior.

Line 679 says globus uses a single-connection stream, but this command now accepts --threads and can do multipart range downloads when threads > 1. The help text is now inaccurate.

Suggested diff
-    help="Download strategy. ftp (default): single-connection per URL scheme. "
-         "globus: resume-capable http/https downloads (single-connection stream).",
+    help="Download strategy. ftp (default): single-connection per URL scheme. "
+         "globus: resume-capable http/https downloads (supports multipart ranges with --threads).",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pridepy/pridepy.py` around lines 678 - 680, The help string for the download
strategy currently claims "globus: resume-capable http/https downloads
(single-connection stream)" but is outdated because the command supports
multipart range downloads when --threads > 1; update the help text in the
argparse add_argument call that sets the download strategy (the help parameter
that mentions "ftp (default)" and "globus") to indicate that globus is
resume-capable for http/https and supports threaded multipart/range downloads
when --threads is greater than 1, referring to the --threads option by name so
users know multipart behavior is enabled by increasing threads.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pridepy/download/by_url.py`:
- Around line 194-202: The current computation of workers can produce 0 when
parallel_files is 0, causing ThreadPoolExecutor(max_workers=workers) to raise;
change the clamp to ensure at least one worker by computing workers = max(1,
min(parallel_files, 3, len(urls))) (matching download_http_urls()), so the
serial fallback remains when parallel_files <= 0 before creating the
ThreadPoolExecutor and keeping the existing log/conditional behavior that checks
workers > 1.

In `@pridepy/download/transport.py`:
- Around line 685-692: The helper _prepare_multipart_target currently treats a
zero-filled preallocated file as a finished download; change it to not consider
size alone as completion by using a separate completion marker or atomic rename
strategy: create the preallocated temporary file (e.g., file_path + ".part") in
_prepare_multipart_target and only treat the download as complete when a final
marker (e.g., file_path + ".complete") is written or the temp file is atomically
renamed to file_path after _http_download_one finishes and validates ranges;
update _http_download_one to look for and write the same marker/rename on
successful full multipart assembly and to ignore placeholder files that are
merely preallocated.
- Around line 639-656: The code using Util.create_session_with_retries() and
session.get(...) must verify the requested byte range was fully written before
returning: after the iter_content loop in this block (the with session.get(...)
as r: scope that checks r.status_code and Content-Range), assert that cursor ==
end + 1 (or raise a RuntimeError) — if not, treat it as a truncated 206 response
and handle accordingly (raise so caller can retry or fall back to
_parallel_download). Make this check immediately before the `return` and include
the unique symbols `session.get`, `r.iter_content`, `cursor`, `end`, and
`pbar.update` to locate where to add the verification.

In `@pridepy/pdc/client.py`:
- Around line 218-222: The helper _safe_int currently masks malformed file_size
by returning 0 which makes validate_pdc_file treat invalid sizes as “unknown”
and weakens validation; change _safe_int (or add a new strict parser used by
validate_pdc_file) to not coerce invalid values to 0 — instead return None or
raise a ValueError on parse failure so validate_pdc_file can explicitly reject
or skip entries with malformed sizes; update validate_pdc_file to check for
None/exception and fail or skip the entry (rather than treating it as
unknown/<=0), and apply the same stricter parsing logic to other usages in the
same region (lines around the current _safe_int and the blocks referenced at
225-243) so malformed metadata is handled consistently.
- Around line 141-148: The code incorrectly treats any string whose basename
happens to exist in CWD as a local CSV; change the logic to first detect whether
the argument actually looks like a path (contains os.path.sep or endswith
'.csv') and only then use Path(accession) and call _read_accessions_from_csv; if
it looks like a CSV path but the file doesn't exist raise the "Accession CSV not
found" error. Apply this change to the block using source_path and
_read_accessions_from_csv and the analogous block at lines ~171-175 so bare
accessions (e.g., "PDC000109") are not interpreted as local paths.
- Around line 111-115: Normalize CSV headers immediately after creating
csv.DictReader by sanitizing reader.fieldnames: trim surrounding whitespace,
strip any BOM (e.g. leading '\ufeff'), and normalize casing/characters (e.g.
lower-case and convert hyphens to underscores) before calling _study_id_column
and _file_type_column. Use the sanitized list in place of reader.fieldnames (or
pass it to the detection helpers) so functions _study_id_column(columns, path)
and _file_type_column(columns) receive normalized headers and can match names
like "pdc_id", "pdc_id " or "File-Type".

In `@pridepy/pdc/downloader.py`:
- Around line 173-174: The _target_path function constructs a path from
untrusted pdc_file.study_id and pdc_file.file_name allowing path traversal;
sanitize both components (e.g., strip any path separators, use only the
basename/name portion), build the candidate path, resolve it and assert it is
inside output_folder.resolve() (raise an error if not), and ensure directories
are created safely before writing; update references to _target_path to expect
the sanitized/validated Path.

---

Outside diff comments:
In `@pridepy/pridepy.py`:
- Around line 678-680: The help string for the download strategy currently
claims "globus: resume-capable http/https downloads (single-connection stream)"
but is outdated because the command supports multipart range downloads when
--threads > 1; update the help text in the argparse add_argument call that sets
the download strategy (the help parameter that mentions "ftp (default)" and
"globus") to indicate that globus is resume-capable for http/https and supports
threaded multipart/range downloads when --threads is greater than 1, referring
to the --threads option by name so users know multipart behavior is enabled by
increasing threads.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f0eb45c7-bcbb-46ee-83fa-503673edf22f

📥 Commits

Reviewing files that changed from the base of the PR and between 59c55bf and 42d965b.

📒 Files selected for processing (15)
  • README.md
  • pridepy/download/base.py
  • pridepy/download/by_url.py
  • pridepy/download/client.py
  • pridepy/download/pride.py
  • pridepy/download/transport.py
  • pridepy/pdc/__init__.py
  • pridepy/pdc/client.py
  • pridepy/pdc/downloader.py
  • pridepy/pridepy.py
  • pridepy/tests/test_cli_flatten.py
  • pridepy/tests/test_download_by_url.py
  • pridepy/tests/test_pdc_cli.py
  • pridepy/tests/test_pdc_client.py
  • pridepy/tests/test_pdc_downloader.py

Comment on lines +194 to +202
workers = min(parallel_files, 3, len(urls))
failures: List[Tuple[str, str]] = []

if parallel_files < 2:
for url in urls:
if workers > 1:
logging.info(
"Downloading %d URL(s) with %d parallel workers",
len(urls), workers,
)
with ThreadPoolExecutor(max_workers=workers) as executor:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clamp parallel_files before constructing the executor.

This path now always creates a ThreadPoolExecutor, so parallel_files=0 becomes max_workers=0 and fails immediately. download_http_urls() in pridepy/download/transport.py already uses max(1, min(...)); matching that here preserves the old serial fallback instead of turning a non-positive input into a runtime error.

Suggested fix
-    workers = min(parallel_files, 3, len(urls))
+    workers = max(1, min(parallel_files, 3, len(urls)))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
workers = min(parallel_files, 3, len(urls))
failures: List[Tuple[str, str]] = []
if parallel_files < 2:
for url in urls:
if workers > 1:
logging.info(
"Downloading %d URL(s) with %d parallel workers",
len(urls), workers,
)
with ThreadPoolExecutor(max_workers=workers) as executor:
workers = max(1, min(parallel_files, 3, len(urls)))
failures: List[Tuple[str, str]] = []
if workers > 1:
logging.info(
"Downloading %d URL(s) with %d parallel workers",
len(urls), workers,
)
with ThreadPoolExecutor(max_workers=workers) as executor:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pridepy/download/by_url.py` around lines 194 - 202, The current computation
of workers can produce 0 when parallel_files is 0, causing
ThreadPoolExecutor(max_workers=workers) to raise; change the clamp to ensure at
least one worker by computing workers = max(1, min(parallel_files, 3,
len(urls))) (matching download_http_urls()), so the serial fallback remains when
parallel_files <= 0 before creating the ThreadPoolExecutor and keeping the
existing log/conditional behavior that checks workers > 1.

Comment on lines +639 to +656
try:
session = Util.create_session_with_retries()
headers = {"Range": f"bytes={cursor}-{end}"}
with session.get(url, headers=headers, stream=True, timeout=(15, 15)) as r:
r.raise_for_status()
if r.status_code != 206:
raise RuntimeError(f"Server did not honor Range request: {r.status_code}")
content_range = r.headers.get("Content-Range", "")
if not content_range.lower().startswith(f"bytes {cursor}-{end}/"):
raise RuntimeError(f"Unexpected Content-Range header: {content_range}")
with open(file_path, "r+b") as f:
f.seek(cursor)
for chunk in r.iter_content(chunk_size=8 * 1024 * 1024):
if chunk:
f.write(chunk)
cursor += len(chunk)
pbar.update(len(chunk))
return

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Verify that each range actually reached end + 1 before returning.

This worker exits as soon as iter_content() stops, but unlike _parallel_download it never confirms that the requested segment was fully written. If a 206 body ends early without surfacing as an exception, the untouched bytes from preallocation stay in place and the whole file is still reported as complete.

Suggested fix
                 with open(file_path, "r+b") as f:
                     f.seek(cursor)
                     for chunk in r.iter_content(chunk_size=8 * 1024 * 1024):
                         if chunk:
                             f.write(chunk)
                             cursor += len(chunk)
                             pbar.update(len(chunk))
+                if cursor != end + 1:
+                    raise RuntimeError(
+                        f"Incomplete range {start}-{end}: wrote through {cursor - 1}"
+                    )
             return
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pridepy/download/transport.py` around lines 639 - 656, The code using
Util.create_session_with_retries() and session.get(...) must verify the
requested byte range was fully written before returning: after the iter_content
loop in this block (the with session.get(...) as r: scope that checks
r.status_code and Content-Range), assert that cursor == end + 1 (or raise a
RuntimeError) — if not, treat it as a truncated 206 response and handle
accordingly (raise so caller can retry or fall back to _parallel_download). Make
this check immediately before the `return` and include the unique symbols
`session.get`, `r.iter_content`, `cursor`, `end`, and `pbar.update` to locate
where to add the verification.

Comment on lines +685 to +692
def _prepare_multipart_target(file_path, total_size):
if os.path.exists(file_path) and os.path.getsize(file_path) == total_size:
logging.info("File already complete: %s", file_path)
return False

with open(file_path, "wb") as pre:
pre.truncate(total_size)
return True

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don't treat a preallocated multipart target as a finished download.

This helper marks file_path as "already complete" purely from os.path.getsize(file_path) == total_size, but Line 690 preallocates that exact size before any range bytes are fetched. An interrupted multipart run therefore leaves a full-size placeholder behind, and the next call returns early while _http_download_one() logs success without retransferring or validating the contents.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pridepy/download/transport.py` around lines 685 - 692, The helper
_prepare_multipart_target currently treats a zero-filled preallocated file as a
finished download; change it to not consider size alone as completion by using a
separate completion marker or atomic rename strategy: create the preallocated
temporary file (e.g., file_path + ".part") in _prepare_multipart_target and only
treat the download as complete when a final marker (e.g., file_path +
".complete") is written or the temp file is atomically renamed to file_path
after _http_download_one finishes and validates ranges; update
_http_download_one to look for and write the same marker/rename on successful
full multipart assembly and to ignore placeholder files that are merely
preallocated.

Comment thread pridepy/pdc/client.py
Comment on lines +111 to +115
with path.open("r", encoding="utf-8", newline="") as handle:
reader = csv.DictReader(handle)
columns = reader.fieldnames or []
study_column = _study_id_column(columns, path)
file_type_column = _file_type_column(columns)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Normalize CSV headers before column detection.

csv.DictReader preserves BOMs and stray spaces, so headers such as pdc_id, pdc_id , or File-Type will currently miss both _study_id_column() and _file_type_column() even though the columns are present. Normalizing fieldnames up front makes the CLI much more tolerant of spreadsheet-exported CSVs.

Suggested change
-    with path.open("r", encoding="utf-8", newline="") as handle:
+    with path.open("r", encoding="utf-8-sig", newline="") as handle:
         reader = csv.DictReader(handle)
-        columns = reader.fieldnames or []
+        reader.fieldnames = [
+            str(column).strip().lower() for column in (reader.fieldnames or [])
+        ]
+        columns = reader.fieldnames
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pridepy/pdc/client.py` around lines 111 - 115, Normalize CSV headers
immediately after creating csv.DictReader by sanitizing reader.fieldnames: trim
surrounding whitespace, strip any BOM (e.g. leading '\ufeff'), and normalize
casing/characters (e.g. lower-case and convert hyphens to underscores) before
calling _study_id_column and _file_type_column. Use the sanitized list in place
of reader.fieldnames (or pass it to the detection helpers) so functions
_study_id_column(columns, path) and _file_type_column(columns) receive
normalized headers and can match names like "pdc_id", "pdc_id " or "File-Type".

Comment thread pridepy/pdc/client.py
Comment on lines +141 to +148
source_path = Path(accession).expanduser()
if source_path.exists():
if source_path.suffix.lower() != ".csv":
raise ValueError(f"Only CSV accession files are supported: {source_path}")
return _read_accessions_from_csv(source_path)

if source_path.suffix.lower() == ".csv" or os.path.sep in accession:
raise ValueError(f"Accession CSV not found: {source_path}")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Avoid treating bare accessions as local paths.

Both parsers use Path(accession).exists() before deciding whether the input is a CSV path. If the current working directory already contains a file or folder named like a valid study ID (for example PDC000109), the CLI will reject that accession as a non-CSV path instead of downloading it. Only enter the file-input branch when the argument actually looks like a CSV path.

Also applies to: 171-175

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pridepy/pdc/client.py` around lines 141 - 148, The code incorrectly treats
any string whose basename happens to exist in CWD as a local CSV; change the
logic to first detect whether the argument actually looks like a path (contains
os.path.sep or endswith '.csv') and only then use Path(accession) and call
_read_accessions_from_csv; if it looks like a CSV path but the file doesn't
exist raise the "Accession CSV not found" error. Apply this change to the block
using source_path and _read_accessions_from_csv and the analogous block at lines
~171-175 so bare accessions (e.g., "PDC000109") are not interpreted as local
paths.

Comment thread pridepy/pdc/client.py
Comment on lines +218 to +222
def _safe_int(value) -> int:
try:
return int(value or 0)
except (TypeError, ValueError):
return 0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Malformed file_size values currently disable part of validation.

_safe_int() turns any parse failure into 0, and downstream validate_pdc_file() treats file_size <= 0 as “unknown size”. That means a bad GraphQL file_size value silently weakens integrity checks: if md5sum is also absent, any non-empty partial file will pass validation. Please distinguish “missing” from “invalid” metadata and skip or fail entries with malformed sizes instead of coercing them to zero.

Also applies to: 225-243

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pridepy/pdc/client.py` around lines 218 - 222, The helper _safe_int currently
masks malformed file_size by returning 0 which makes validate_pdc_file treat
invalid sizes as “unknown” and weakens validation; change _safe_int (or add a
new strict parser used by validate_pdc_file) to not coerce invalid values to 0 —
instead return None or raise a ValueError on parse failure so validate_pdc_file
can explicitly reject or skip entries with malformed sizes; update
validate_pdc_file to check for None/exception and fail or skip the entry (rather
than treating it as unknown/<=0), and apply the same stricter parsing logic to
other usages in the same region (lines around the current _safe_int and the
blocks referenced at 225-243) so malformed metadata is handled consistently.

Comment thread pridepy/pdc/downloader.py
Comment on lines +173 to +174
def _target_path(output_folder: Path, pdc_file: PDCFile) -> Path:
return output_folder / pdc_file.study_id / pdc_file.file_name

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Path traversal risk in target path construction

Line 174 builds a filesystem path directly from remote study_id/file_name. If metadata contains ../ or absolute paths, downloads can escape output_folder and overwrite arbitrary files.

🔒 Proposed fix
 def _target_path(output_folder: Path, pdc_file: PDCFile) -> Path:
-    return output_folder / pdc_file.study_id / pdc_file.file_name
+    study_id = pdc_file.study_id
+    file_name = pdc_file.file_name
+
+    if os.path.isabs(study_id) or "/" in study_id or "\\" in study_id or study_id in {"", ".", ".."}:
+        raise ValueError(f"Invalid study_id path component: {study_id!r}")
+    if os.path.isabs(file_name) or "/" in file_name or "\\" in file_name or file_name in {"", ".", ".."}:
+        raise ValueError(f"Invalid file_name path component: {file_name!r}")
+
+    return output_folder / study_id / file_name
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _target_path(output_folder: Path, pdc_file: PDCFile) -> Path:
return output_folder / pdc_file.study_id / pdc_file.file_name
def _target_path(output_folder: Path, pdc_file: PDCFile) -> Path:
study_id = pdc_file.study_id
file_name = pdc_file.file_name
if os.path.isabs(study_id) or "/" in study_id or "\\" in study_id or study_id in {"", ".", ".."}:
raise ValueError(f"Invalid study_id path component: {study_id!r}")
if os.path.isabs(file_name) or "/" in file_name or "\\" in file_name or file_name in {"", ".", ".."}:
raise ValueError(f"Invalid file_name path component: {file_name!r}")
return output_folder / study_id / file_name
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pridepy/pdc/downloader.py` around lines 173 - 174, The _target_path function
constructs a path from untrusted pdc_file.study_id and pdc_file.file_name
allowing path traversal; sanitize both components (e.g., strip any path
separators, use only the basename/name portion), build the candidate path,
resolve it and assert it is inside output_folder.resolve() (raise an error if
not), and ensure directories are created safely before writing; update
references to _target_path to expect the sanitized/validated Path.

@ypriverol ypriverol merged commit 6d9dfc8 into master Jun 8, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants