CPTAC download implementation by ypriverol · Pull Request #110 · PRIDE-Archive/pridepy

ypriverol · 2026-06-06T11:21:26Z

Summary by CodeRabbit

Release Notes

New Features
- Added parallel HTTP range-based downloading for faster single-file transfers via new --threads parameter (1–32) across download commands
- Introduced download-pdc-files command for downloading PDC/CPTAC cancer data files with signed HTTPS URLs
- Added progress reporting during FTP directory enumeration
Documentation
- Updated README with download-pdc-files command documentation and download example
Tests
- Added comprehensive test coverage for HTTP parallel downloads and PDC file download workflows

Large MassIVE timsTOF deposits are spread over thousands of .d directories, each requiring its own PASV+TLS data connection to list. _walk_ftp_tree enumerates the entire tree before any download and was silent throughout, so a multi-minute enumeration looked like a hang (e.g. ~1719 .d in MSV000098940 ≈ 62 min of no output -> users Ctrl-C). Add an INFO progress heartbeat every 100 directories plus a final summary. Pure instrumentation: the returned file list and recursion are unchanged (covered by a mock-tree unit test); existing tests pass.

fix(massive/ftp): emit progress while walking the FTPS tree

Add multi-segment download feature and restore threads option

coderabbitai · 2026-06-06T11:21:39Z

📝 Walkthrough

Walkthrough

This PR adds per-file download thread concurrency (download_threads parameter) throughout the download stack for HTTP byte-range parallelism, introduces HTTP multipart range downloads and FTP progress heartbeats, and adds a complete PDC/CPTAC file download system with GraphQL client, file validation, retries, and CLI command.

Changes

Per-file threading infrastructure

Layer / File(s)	Summary
Provider threading parameter `pridepy/download/base.py`	`download_all_raw`, `download_category`, `download_by_filenames`, and `download_files` accept and forward `download_threads: int = 1`.
URL-based dispatch threading `pridepy/download/by_url.py`	`_dispatch_url_scheme`, `_download_single_url`, and `download_files_by_url` accept `download_threads` and route HTTP URLs to multipart or single-stream; refactored to always use `ThreadPoolExecutor`.
Client facade threading `pridepy/download/client.py`	`download_http_urls`, `download_all_raw_files`, `download_all_category_files`, `download_files_by_list`, and `download_files_by_url` accept and forward `download_threads` to underlying calls.
PRIDE provider threading `pridepy/download/pride.py`	`_globus_download_one` branches on `download_threads` to use multipart or single-stream; threading flows through `download_files_from_globus`, `_batch_download_by_protocol`, `_download_with_fallback`, `download_files`, and `_download_files_batch`.
HTTP multipart byte-range downloads `pridepy/download/transport.py`	New helpers (`_download_range`, `_read_http_download_metadata`, `_prepare_multipart_target`, `_build_download_ranges`, `_download_multipart_ranges`, `_multipart_download`) implement parallel byte-range fetching with per-range resume and fallback; integrated into `_http_download_one` and `download_http_urls`.
FTP tree enumeration progress `pridepy/download/transport.py`	Enhanced `_walk_ftp_tree` tracks and logs directory/file scan counts periodically during `mlsd` and fallback `LIST` traversal.
CLI download commands with --threads `pridepy/pridepy.py`, `pridepy/tests/test_cli_flatten.py`, `pridepy/tests/test_download_by_url.py`	`download-all-public-raw-files`, `download-all-public-category-files`, `download-files-by-list`, and `download-files-by-url` CLI commands replace "parallel files" with `--threads` (1–32 constrained); tests verify threading parameter wiring and multipart behavior.

PDC file download system

Layer / File(s)	Summary
PDC GraphQL client and parsing `pridepy/pdc/client.py`	New module with dataclasses (`PDCFile`, `PDCDownloadRequest`, `PDCFileTypeFilter`); accession parsing from inline text or CSV; file-type normalization and filtering; GraphQL query execution; and signed-URL refresh under process-wide lock.
PDC file validation and orchestration `pridepy/pdc/downloader.py`	File-existence and checksum validation; atomic `.part`-file downloads; retry with exponential backoff and 403 signed-URL refresh; study-level fetch and per-file download loops; public entry point that parses requests, creates output directory, tracks statistics, writes failure manifest on errors, and raises if any downloads fail.
PDC package exports `pridepy/pdc/__init__.py`	Re-exports public PDC symbols for API.
PDC CLI command `pridepy/pridepy.py`	New `download-pdc-files` command with accession, file-type (optional), output folder, checksum (default enabled), threads (1–32), and retry options; routes to `download_pdc_files` with exception-to-Click error mapping and logs completion summary.
PDC tests `pridepy/tests/test_pdc_client.py`, `pridepy/tests/test_pdc_downloader.py`, `pridepy/tests/test_pdc_cli.py`	Accession parsing and CSV handling; file-type filtering; GraphQL mapping; URL refresh; file skipping with checksums; multi-study CSV downloads; multipart threading; failure handling with retry; and CLI argument mapping.
README documentation `README.md`	Added `download-pdc-files` command entry and quick example.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

PRIDE-Archive/pridepy#61: Both PRs touch pridepy.py's PRIDE download CLI—retrieved PR adds download-all-public-raw-files/download-all-public-category-files, while the main PR modifies those same commands' option wiring to use download_threads instead of parallel_files.
PRIDE-Archive/pridepy#93: Both PRs change HTTP/byte-range download concurrency implementation by wiring thread parameters into range-download logic and implementing multipart-style parallel range behavior, aligning the main PR's download_threads work with earlier parallelism changes.

Suggested labels

enhancement, feature, downloads, threading, http, pdc

Suggested reviewers

chakrabandla
selvaebi

Poem

🐰 A thread for each byte-range in sight,
PDC files downloaded swift and tight,
FTP trees now whisper their progress song,
While checksums verify nothing goes wrong! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 5.38% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'CPTAC download implementation' is vague and does not accurately reflect the comprehensive scope of changes, which include PDC/CPTAC downloads, multipart HTTP downloading, and CLI enhancements across multiple modules.	Consider a more specific title such as 'Add PDC/CPTAC download support with multipart HTTP and threading' to better convey the main changes in the pull request.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

qodo-code-review · 2026-06-06T11:21:55Z

Review Summary by Qodo

Add multi-segment HTTP downloads, PDC support, and FTP progress reporting

✨ Enhancement 🧪 Tests

Walkthroughs

Description

• Add multi-segment HTTP Range download support via download_threads parameter
  - Replaces -w/--parallel-files with -t/--threads for per-file parallelism
  - Implements _multipart_download() for concurrent Range requests
• Add PDC/CPTAC download support with GraphQL metadata and signed URLs
  - New pridepy/pdc/ module with client and downloader implementations
  - Supports mzid, psm, raw, mzml file types with CSV accession lists
• Improve FTP tree enumeration progress reporting
  - Emit INFO heartbeats every 100 directories during FTPS tree walk
  - Prevents multi-minute hangs on large deposits like MassIVE timsTOF
• Refactor URL download logic to always use ThreadPoolExecutor
  - Simplifies serial/parallel branching; always uses workers pool

Diagram

flowchart LR
  A["CLI: -t/--threads"] --> B["download_threads param"]
  B --> C["HTTP URLs"]
  B --> D["PRIDE/MassIVE"]
  B --> E["PDC/CPTAC"]
  C --> F["_multipart_download"]
  D --> G["transport layer"]
  E --> H["PDC GraphQL API"]
  F --> I["HTTP Range requests"]
  H --> J["Signed URLs"]
  G --> K["FTP tree walk progress"]

File Changes

1. pridepy/download/base.py ✨ Enhancement +8/-0

Add download_threads parameter to all download methods

pridepy/download/base.py

2. pridepy/download/by_url.py ✨ Enhancement +43/-30

Implement multipart download dispatch and refactor URL logic

pridepy/download/by_url.py

3. pridepy/download/client.py ✨ Enhancement +10/-0

Thread download_threads parameter through transport shim

pridepy/download/client.py

View more (12)

4. pridepy/download/pride.py ✨ Enhancement +35/-5

Support multipart downloads in PRIDE globus and batch handlers

pridepy/download/pride.py

5. pridepy/download/transport.py ✨ Enhancement +199/-4

Implement _multipart_download with Range requests and FTP progress

pridepy/download/transport.py

6. pridepy/pdc/__init__.py ✨ Enhancement +20/-0

New PDC module exports for client and downloader

pridepy/pdc/init.py

7. pridepy/pdc/client.py ✨ Enhancement +281/-0

PDC GraphQL client with file filtering and accession parsing

pridepy/pdc/client.py

8. pridepy/pdc/downloader.py ✨ Enhancement +386/-0

PDC download orchestrator with retry and checksum validation

pridepy/pdc/downloader.py

9. pridepy/pridepy.py ✨ Enhancement +114/-26

Replace -w/--parallel-files with -t/--threads and add PDC command

pridepy/pridepy.py

10. pridepy/tests/test_cli_flatten.py 🧪 Tests +40/-0

Add tests for download_threads parameter in CLI

pridepy/tests/test_cli_flatten.py

11. pridepy/tests/test_download_by_url.py 🧪 Tests +50/-1

Add multipart download and failure cleanup tests

pridepy/tests/test_download_by_url.py

12. pridepy/tests/test_pdc_cli.py 🧪 Tests +90/-0

New PDC CLI command tests

pridepy/tests/test_pdc_cli.py

13. pridepy/tests/test_pdc_client.py 🧪 Tests +177/-0

New PDC client and accession parsing tests

pridepy/tests/test_pdc_client.py

14. pridepy/tests/test_pdc_downloader.py 🧪 Tests +260/-0

New PDC downloader integration and retry tests

pridepy/tests/test_pdc_downloader.py

15. README.md 📝 Documentation +4/-0

Document new download-pdc-files command

README.md

ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan

qodo-code-review · 2026-06-06T11:21:56Z

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0)

1. Multipart missing completion check 🐞 Bug ≡ Correctness

Description

transport._download_range returns success without verifying it wrote the full requested byte
range; because _multipart_download preallocates the file to the expected size, a truncated segment
can leave zero-filled holes while the final file size still looks correct. This can silently corrupt
downloads when servers close connections mid-stream without raising.

Code

pridepy/download/transport.py[R628-656]

+def _download_range(url, file_path, start, end, pbar, max_retries=3):
+    """Download a byte range directly into the target file using seek.
+
+    On transient failures (e.g. ``IncompleteRead``) the segment resumes
+    from the last successfully written byte using a fresh Range request
+    for the remaining bytes, instead of restarting the whole segment.
+    ``pbar`` only accumulates bytes that were actually written, so the
+    progress bar stays in sync with on-disk state across retries.
+    """
+    cursor = start
+    for attempt in range(1, max_retries + 1):
+        try:
+            session = Util.create_session_with_retries()
+            headers = {"Range": f"bytes={cursor}-{end}"}
+            with session.get(url, headers=headers, stream=True, timeout=(15, 15)) as r:
+                r.raise_for_status()
+                if r.status_code != 206:
+                    raise RuntimeError(f"Server did not honor Range request: {r.status_code}")
+                content_range = r.headers.get("Content-Range", "")
+                if not content_range.lower().startswith(f"bytes {cursor}-{end}/"):
+                    raise RuntimeError(f"Unexpected Content-Range header: {content_range}")
+                with open(file_path, "r+b") as f:
+                    f.seek(cursor)
+                    for chunk in r.iter_content(chunk_size=8 * 1024 * 1024):
+                        if chunk:
+                            f.write(chunk)
+                            cursor += len(chunk)
+                            pbar.update(len(chunk))
+            return

Evidence

The existing single-connection downloader explicitly guards against silent mid-stream truncation by
checking the final on-disk size against the server’s Content-Length, but _download_range has no
equivalent end-of-range validation and simply returns after the loop.

pridepy/download/transport.py[568-625]
pridepy/download/transport.py[628-656]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`_download_range()` streams bytes for a Range request but never asserts that the downloaded segment reached `end` (i.e., `cursor == end + 1`) before returning. With `_prepare_multipart_target()` preallocating the file, incomplete segments can leave zero-filled gaps that are not detectable via file size.

## Issue Context
The single-stream downloader `_parallel_download()` includes a post-transfer integrity check to catch silent truncation; multipart currently lacks an equivalent check.

## Fix Focus Areas
- pridepy/download/transport.py[628-656]
- pridepy/download/transport.py[613-625]

## Implementation notes
- After the `iter_content` loop, add a completeness assertion like:
 - `expected = end + 1`
 - if `cursor != expected`: raise `RuntimeError` describing the shortfall.
- Consider also validating `Content-Length` for the range response when present, but the cursor check is the critical guard.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Range split can create invalid segments 🐞 Bug ☼ Reliability

Description

_build_download_ranges computes part_size = total_size // threads without guarding against
threads > total_size, which can create zero-length parts and invalid ranges (e.g., end < start).
While default min_size_bytes (10MB) makes this unlikely in typical use, it can occur when callers
lower min_size_bytes or otherwise invoke multipart on small objects.

Code

pridepy/download/transport.py[R695-702]

+def _build_download_ranges(total_size, threads):
+    part_size = total_size // threads
+    ranges = []
+    for index in range(threads):
+        start = index * part_size
+        end = total_size - 1 if index == threads - 1 else (start + part_size - 1)
+        ranges.append((start, end))
+    return ranges

Evidence

_build_download_ranges uses integer division to derive per-part size and directly computes end
as start + part_size - 1, which becomes invalid when part_size is 0. Multipart calls this
function after clamping threads but without relating threads to total_size.

pridepy/download/transport.py[695-702]
pridepy/download/transport.py[728-761]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`_build_download_ranges(total_size, threads)` can produce invalid ranges when `threads` exceeds `total_size` (in bytes), because `part_size` becomes 0 and the computed `end` becomes `start - 1` for non-final parts.

## Issue Context
This is mostly mitigated in normal operation by `_can_use_multipart()` requiring `total_size >= min_size_bytes` (default 10MB) and threads being clamped to <=32, but the function is still unsafe for edge-case inputs (e.g., test/advanced callers setting `min_size_bytes=1`).

## Fix Focus Areas
- pridepy/download/transport.py[695-702]
- pridepy/download/transport.py[728-761]

## Implementation notes
- Defensively clamp the effective thread count before range building, e.g. `threads = min(threads, total_size)` (or `min(threads, max(1, total_size // MIN_PART_SIZE))`).
- Alternatively, ensure `part_size = max(1, total_size // threads)` and adjust the number of ranges accordingly.
- Add a sanity check that every produced range satisfies `0 <= start <= end < total_size`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. Zero workers crash 🐞 Bug ☼ Reliability

Description

download_files_by_url now always constructs ThreadPoolExecutor(max_workers=workers) where
workers = min(parallel_files, 3, len(urls)) is not clamped to at least 1, so programmatic callers
passing parallel_files <= 0 will raise at runtime. This regression was introduced by removing the
prior serial-path branching and differs from transport.download_http_urls, which clamps workers
with max(1, ...).

Code

pridepy/download/by_url.py[R194-203]

+    workers = min(parallel_files, 3, len(urls))
    failures: List[Tuple[str, str]] = []

-    if parallel_files < 2:
-        for url in urls:
+    if workers > 1:
+        logging.info(
+            "Downloading %d URL(s) with %d parallel workers",
+            len(urls), workers,
+        )
+    with ThreadPoolExecutor(max_workers=workers) as executor:
+        futures = {

Evidence

download_files_by_url uses workers directly as the executor size without a lower bound, while
transport.download_http_urls explicitly wraps the min(...) with max(1, ...) before creating
its executor.

pridepy/download/by_url.py[189-214]
pridepy/download/transport.py[846-853]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`download_files_by_url()` computes `workers = min(parallel_files, 3, len(urls))` and unconditionally uses it in `ThreadPoolExecutor(max_workers=workers)`. If a programmatic caller passes `parallel_files=0` (or negative), this will crash.

## Issue Context
CLI users are protected (Click IntRange), but `download_files_by_url` is a Python API. Other transports already clamp the worker count to at least 1.

## Fix Focus Areas
- pridepy/download/by_url.py[194-203]
- pridepy/download/transport.py[846-852]

## Implementation notes
- Change to `workers = max(1, min(parallel_files, 3, len(urls)))`.
- Optionally also validate `parallel_files` and raise `ValueError` when it’s < 1, but clamping is consistent with existing transport behavior.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan

qodo-code-review · 2026-06-06T11:27:17Z

+def _download_range(url, file_path, start, end, pbar, max_retries=3):
+    """Download a byte range directly into the target file using seek.
+
+    On transient failures (e.g. ``IncompleteRead``) the segment resumes
+    from the last successfully written byte using a fresh Range request
+    for the remaining bytes, instead of restarting the whole segment.
+    ``pbar`` only accumulates bytes that were actually written, so the
+    progress bar stays in sync with on-disk state across retries.
+    """
+    cursor = start
+    for attempt in range(1, max_retries + 1):
+        try:
+            session = Util.create_session_with_retries()
+            headers = {"Range": f"bytes={cursor}-{end}"}
+            with session.get(url, headers=headers, stream=True, timeout=(15, 15)) as r:
+                r.raise_for_status()
+                if r.status_code != 206:
+                    raise RuntimeError(f"Server did not honor Range request: {r.status_code}")
+                content_range = r.headers.get("Content-Range", "")
+                if not content_range.lower().startswith(f"bytes {cursor}-{end}/"):
+                    raise RuntimeError(f"Unexpected Content-Range header: {content_range}")
+                with open(file_path, "r+b") as f:
+                    f.seek(cursor)
+                    for chunk in r.iter_content(chunk_size=8 * 1024 * 1024):
+                        if chunk:
+                            f.write(chunk)
+                            cursor += len(chunk)
+                            pbar.update(len(chunk))
+            return


1. Multipart missing completion check 🐞 Bug ≡ Correctness

transport._download_range returns success without verifying it wrote the full requested byte range; because _multipart_download preallocates the file to the expected size, a truncated segment can leave zero-filled holes while the final file size still looks correct. This can silently corrupt downloads when servers close connections mid-stream without raising.

Agent Prompt

## Issue description `_download_range()` streams bytes for a Range request but never asserts that the downloaded segment reached `end` (i.e., `cursor == end + 1`) before returning. With `_prepare_multipart_target()` preallocating the file, incomplete segments can leave zero-filled gaps that are not detectable via file size. ## Issue Context The single-stream downloader `_parallel_download()` includes a post-transfer integrity check to catch silent truncation; multipart currently lacks an equivalent check. ## Fix Focus Areas - pridepy/download/transport.py[628-656] - pridepy/download/transport.py[613-625] ## Implementation notes - After the `iter_content` loop, add a completeness assertion like: - `expected = end + 1` - if `cursor != expected`: raise `RuntimeError` describing the shortfall. - Consider also validating `Content-Length` for the range response when present, but the cursor check is the critical guard.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pridepy/pridepy.py (1)

678-680: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update protocol help text to reflect threaded multipart behavior.

Line 679 says globus uses a single-connection stream, but this command now accepts --threads and can do multipart range downloads when threads > 1. The help text is now inaccurate.

Suggested diff

-    help="Download strategy. ftp (default): single-connection per URL scheme. "
-         "globus: resume-capable http/https downloads (single-connection stream).",
+    help="Download strategy. ftp (default): single-connection per URL scheme. "
+         "globus: resume-capable http/https downloads (supports multipart ranges with --threads).",

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pridepy/pridepy.py` around lines 678 - 680, The help string for the download
strategy currently claims "globus: resume-capable http/https downloads
(single-connection stream)" but is outdated because the command supports
multipart range downloads when --threads > 1; update the help text in the
argparse add_argument call that sets the download strategy (the help parameter
that mentions "ftp (default)" and "globus") to indicate that globus is
resume-capable for http/https and supports threaded multipart/range downloads
when --threads is greater than 1, referring to the --threads option by name so
users know multipart behavior is enabled by increasing threads.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pridepy/download/by_url.py`:
- Around line 194-202: The current computation of workers can produce 0 when
parallel_files is 0, causing ThreadPoolExecutor(max_workers=workers) to raise;
change the clamp to ensure at least one worker by computing workers = max(1,
min(parallel_files, 3, len(urls))) (matching download_http_urls()), so the
serial fallback remains when parallel_files <= 0 before creating the
ThreadPoolExecutor and keeping the existing log/conditional behavior that checks
workers > 1.

In `@pridepy/download/transport.py`:
- Around line 685-692: The helper _prepare_multipart_target currently treats a
zero-filled preallocated file as a finished download; change it to not consider
size alone as completion by using a separate completion marker or atomic rename
strategy: create the preallocated temporary file (e.g., file_path + ".part") in
_prepare_multipart_target and only treat the download as complete when a final
marker (e.g., file_path + ".complete") is written or the temp file is atomically
renamed to file_path after _http_download_one finishes and validates ranges;
update _http_download_one to look for and write the same marker/rename on
successful full multipart assembly and to ignore placeholder files that are
merely preallocated.
- Around line 639-656: The code using Util.create_session_with_retries() and
session.get(...) must verify the requested byte range was fully written before
returning: after the iter_content loop in this block (the with session.get(...)
as r: scope that checks r.status_code and Content-Range), assert that cursor ==
end + 1 (or raise a RuntimeError) — if not, treat it as a truncated 206 response
and handle accordingly (raise so caller can retry or fall back to
_parallel_download). Make this check immediately before the `return` and include
the unique symbols `session.get`, `r.iter_content`, `cursor`, `end`, and
`pbar.update` to locate where to add the verification.

In `@pridepy/pdc/client.py`:
- Around line 218-222: The helper _safe_int currently masks malformed file_size
by returning 0 which makes validate_pdc_file treat invalid sizes as “unknown”
and weakens validation; change _safe_int (or add a new strict parser used by
validate_pdc_file) to not coerce invalid values to 0 — instead return None or
raise a ValueError on parse failure so validate_pdc_file can explicitly reject
or skip entries with malformed sizes; update validate_pdc_file to check for
None/exception and fail or skip the entry (rather than treating it as
unknown/<=0), and apply the same stricter parsing logic to other usages in the
same region (lines around the current _safe_int and the blocks referenced at
225-243) so malformed metadata is handled consistently.
- Around line 141-148: The code incorrectly treats any string whose basename
happens to exist in CWD as a local CSV; change the logic to first detect whether
the argument actually looks like a path (contains os.path.sep or endswith
'.csv') and only then use Path(accession) and call _read_accessions_from_csv; if
it looks like a CSV path but the file doesn't exist raise the "Accession CSV not
found" error. Apply this change to the block using source_path and
_read_accessions_from_csv and the analogous block at lines ~171-175 so bare
accessions (e.g., "PDC000109") are not interpreted as local paths.
- Around line 111-115: Normalize CSV headers immediately after creating
csv.DictReader by sanitizing reader.fieldnames: trim surrounding whitespace,
strip any BOM (e.g. leading '\ufeff'), and normalize casing/characters (e.g.
lower-case and convert hyphens to underscores) before calling _study_id_column
and _file_type_column. Use the sanitized list in place of reader.fieldnames (or
pass it to the detection helpers) so functions _study_id_column(columns, path)
and _file_type_column(columns) receive normalized headers and can match names
like "pdc_id", "pdc_id " or "File-Type".

In `@pridepy/pdc/downloader.py`:
- Around line 173-174: The _target_path function constructs a path from
untrusted pdc_file.study_id and pdc_file.file_name allowing path traversal;
sanitize both components (e.g., strip any path separators, use only the
basename/name portion), build the candidate path, resolve it and assert it is
inside output_folder.resolve() (raise an error if not), and ensure directories
are created safely before writing; update references to _target_path to expect
the sanitized/validated Path.

---

Outside diff comments:
In `@pridepy/pridepy.py`:
- Around line 678-680: The help string for the download strategy currently
claims "globus: resume-capable http/https downloads (single-connection stream)"
but is outdated because the command supports multipart range downloads when
--threads > 1; update the help text in the argparse add_argument call that sets
the download strategy (the help parameter that mentions "ftp (default)" and
"globus") to indicate that globus is resume-capable for http/https and supports
threaded multipart/range downloads when --threads is greater than 1, referring
to the --threads option by name so users know multipart behavior is enabled by
increasing threads.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f0eb45c7-bcbb-46ee-83fa-503673edf22f

📥 Commits

Reviewing files that changed from the base of the PR and between 59c55bf and 42d965b.

📒 Files selected for processing (15)

README.md
pridepy/download/base.py
pridepy/download/by_url.py
pridepy/download/client.py
pridepy/download/pride.py
pridepy/download/transport.py
pridepy/pdc/__init__.py
pridepy/pdc/client.py
pridepy/pdc/downloader.py
pridepy/pridepy.py
pridepy/tests/test_cli_flatten.py
pridepy/tests/test_download_by_url.py
pridepy/tests/test_pdc_cli.py
pridepy/tests/test_pdc_client.py
pridepy/tests/test_pdc_downloader.py

coderabbitai · 2026-06-06T11:30:51Z

+    workers = min(parallel_files, 3, len(urls))
    failures: List[Tuple[str, str]] = []

-    if parallel_files < 2:
-        for url in urls:
+    if workers > 1:
+        logging.info(
+            "Downloading %d URL(s) with %d parallel workers",
+            len(urls), workers,
+        )
+    with ThreadPoolExecutor(max_workers=workers) as executor:


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clamp parallel_files before constructing the executor.

This path now always creates a ThreadPoolExecutor, so parallel_files=0 becomes max_workers=0 and fails immediately. download_http_urls() in pridepy/download/transport.py already uses max(1, min(...)); matching that here preserves the old serial fallback instead of turning a non-positive input into a runtime error.

Suggested fix

- workers = min(parallel_files, 3, len(urls)) + workers = max(1, min(parallel_files, 3, len(urls)))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

workers = min(parallel_files, 3, len(urls))

failures: List[Tuple[str, str]] = []

if parallel_files < 2:

for url in urls:

if workers > 1:

logging.info(

"Downloading %d URL(s) with %d parallel workers",

len(urls), workers,

)

with ThreadPoolExecutor(max_workers=workers) as executor:

workers = max(1, min(parallel_files, 3, len(urls)))

failures: List[Tuple[str, str]] = []

if workers > 1:

logging.info(

"Downloading %d URL(s) with %d parallel workers",

len(urls), workers,

)

with ThreadPoolExecutor(max_workers=workers) as executor:

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pridepy/download/by_url.py` around lines 194 - 202, The current computation of workers can produce 0 when parallel_files is 0, causing ThreadPoolExecutor(max_workers=workers) to raise; change the clamp to ensure at least one worker by computing workers = max(1, min(parallel_files, 3, len(urls))) (matching download_http_urls()), so the serial fallback remains when parallel_files <= 0 before creating the ThreadPoolExecutor and keeping the existing log/conditional behavior that checks workers > 1.

coderabbitai · 2026-06-06T11:30:51Z

+        try:
+            session = Util.create_session_with_retries()
+            headers = {"Range": f"bytes={cursor}-{end}"}
+            with session.get(url, headers=headers, stream=True, timeout=(15, 15)) as r:
+                r.raise_for_status()
+                if r.status_code != 206:
+                    raise RuntimeError(f"Server did not honor Range request: {r.status_code}")
+                content_range = r.headers.get("Content-Range", "")
+                if not content_range.lower().startswith(f"bytes {cursor}-{end}/"):
+                    raise RuntimeError(f"Unexpected Content-Range header: {content_range}")
+                with open(file_path, "r+b") as f:
+                    f.seek(cursor)
+                    for chunk in r.iter_content(chunk_size=8 * 1024 * 1024):
+                        if chunk:
+                            f.write(chunk)
+                            cursor += len(chunk)
+                            pbar.update(len(chunk))
+            return


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Verify that each range actually reached end + 1 before returning.

This worker exits as soon as iter_content() stops, but unlike _parallel_download it never confirms that the requested segment was fully written. If a 206 body ends early without surfacing as an exception, the untouched bytes from preallocation stay in place and the whole file is still reported as complete.

Suggested fix

with open(file_path, "r+b") as f: f.seek(cursor) for chunk in r.iter_content(chunk_size=8 * 1024 * 1024): if chunk: f.write(chunk) cursor += len(chunk) pbar.update(len(chunk)) + if cursor != end + 1: + raise RuntimeError( + f"Incomplete range {start}-{end}: wrote through {cursor - 1}" + ) return

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pridepy/download/transport.py` around lines 639 - 656, The code using Util.create_session_with_retries() and session.get(...) must verify the requested byte range was fully written before returning: after the iter_content loop in this block (the with session.get(...) as r: scope that checks r.status_code and Content-Range), assert that cursor == end + 1 (or raise a RuntimeError) — if not, treat it as a truncated 206 response and handle accordingly (raise so caller can retry or fall back to _parallel_download). Make this check immediately before the `return` and include the unique symbols `session.get`, `r.iter_content`, `cursor`, `end`, and `pbar.update` to locate where to add the verification.

coderabbitai · 2026-06-06T11:30:51Z

+def _prepare_multipart_target(file_path, total_size):
+    if os.path.exists(file_path) and os.path.getsize(file_path) == total_size:
+        logging.info("File already complete: %s", file_path)
+        return False
+
+    with open(file_path, "wb") as pre:
+        pre.truncate(total_size)
+    return True


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don't treat a preallocated multipart target as a finished download.

This helper marks file_path as "already complete" purely from os.path.getsize(file_path) == total_size, but Line 690 preallocates that exact size before any range bytes are fetched. An interrupted multipart run therefore leaves a full-size placeholder behind, and the next call returns early while _http_download_one() logs success without retransferring or validating the contents.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pridepy/download/transport.py` around lines 685 - 692, The helper _prepare_multipart_target currently treats a zero-filled preallocated file as a finished download; change it to not consider size alone as completion by using a separate completion marker or atomic rename strategy: create the preallocated temporary file (e.g., file_path + ".part") in _prepare_multipart_target and only treat the download as complete when a final marker (e.g., file_path + ".complete") is written or the temp file is atomically renamed to file_path after _http_download_one finishes and validates ranges; update _http_download_one to look for and write the same marker/rename on successful full multipart assembly and to ignore placeholder files that are merely preallocated.

coderabbitai · 2026-06-06T11:30:51Z

+    with path.open("r", encoding="utf-8", newline="") as handle:
+        reader = csv.DictReader(handle)
+        columns = reader.fieldnames or []
+        study_column = _study_id_column(columns, path)
+        file_type_column = _file_type_column(columns)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Normalize CSV headers before column detection.

csv.DictReader preserves BOMs and stray spaces, so headers such as pdc_id, pdc_id , or File-Type will currently miss both _study_id_column() and _file_type_column() even though the columns are present. Normalizing fieldnames up front makes the CLI much more tolerant of spreadsheet-exported CSVs.

Suggested change

- with path.open("r", encoding="utf-8", newline="") as handle: + with path.open("r", encoding="utf-8-sig", newline="") as handle: reader = csv.DictReader(handle) - columns = reader.fieldnames or [] + reader.fieldnames = [ + str(column).strip().lower() for column in (reader.fieldnames or []) + ] + columns = reader.fieldnames

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pridepy/pdc/client.py` around lines 111 - 115, Normalize CSV headers immediately after creating csv.DictReader by sanitizing reader.fieldnames: trim surrounding whitespace, strip any BOM (e.g. leading '\ufeff'), and normalize casing/characters (e.g. lower-case and convert hyphens to underscores) before calling _study_id_column and _file_type_column. Use the sanitized list in place of reader.fieldnames (or pass it to the detection helpers) so functions _study_id_column(columns, path) and _file_type_column(columns) receive normalized headers and can match names like "pdc_id", "pdc_id " or "File-Type".

coderabbitai · 2026-06-06T11:30:51Z

+    source_path = Path(accession).expanduser()
+    if source_path.exists():
+        if source_path.suffix.lower() != ".csv":
+            raise ValueError(f"Only CSV accession files are supported: {source_path}")
+        return _read_accessions_from_csv(source_path)
+
+    if source_path.suffix.lower() == ".csv" or os.path.sep in accession:
+        raise ValueError(f"Accession CSV not found: {source_path}")


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Avoid treating bare accessions as local paths.

Both parsers use Path(accession).exists() before deciding whether the input is a CSV path. If the current working directory already contains a file or folder named like a valid study ID (for example PDC000109), the CLI will reject that accession as a non-CSV path instead of downloading it. Only enter the file-input branch when the argument actually looks like a CSV path.

Also applies to: 171-175

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pridepy/pdc/client.py` around lines 141 - 148, The code incorrectly treats any string whose basename happens to exist in CWD as a local CSV; change the logic to first detect whether the argument actually looks like a path (contains os.path.sep or endswith '.csv') and only then use Path(accession) and call _read_accessions_from_csv; if it looks like a CSV path but the file doesn't exist raise the "Accession CSV not found" error. Apply this change to the block using source_path and _read_accessions_from_csv and the analogous block at lines ~171-175 so bare accessions (e.g., "PDC000109") are not interpreted as local paths.

coderabbitai · 2026-06-06T11:30:51Z

+def _safe_int(value) -> int:
+    try:
+        return int(value or 0)
+    except (TypeError, ValueError):
+        return 0


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Malformed file_size values currently disable part of validation.

_safe_int() turns any parse failure into 0, and downstream validate_pdc_file() treats file_size <= 0 as “unknown size”. That means a bad GraphQL file_size value silently weakens integrity checks: if md5sum is also absent, any non-empty partial file will pass validation. Please distinguish “missing” from “invalid” metadata and skip or fail entries with malformed sizes instead of coercing them to zero.

Also applies to: 225-243

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pridepy/pdc/client.py` around lines 218 - 222, The helper _safe_int currently masks malformed file_size by returning 0 which makes validate_pdc_file treat invalid sizes as “unknown” and weakens validation; change _safe_int (or add a new strict parser used by validate_pdc_file) to not coerce invalid values to 0 — instead return None or raise a ValueError on parse failure so validate_pdc_file can explicitly reject or skip entries with malformed sizes; update validate_pdc_file to check for None/exception and fail or skip the entry (rather than treating it as unknown/<=0), and apply the same stricter parsing logic to other usages in the same region (lines around the current _safe_int and the blocks referenced at 225-243) so malformed metadata is handled consistently.

coderabbitai · 2026-06-06T11:30:51Z

+def _target_path(output_folder: Path, pdc_file: PDCFile) -> Path:
+    return output_folder / pdc_file.study_id / pdc_file.file_name


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Path traversal risk in target path construction

Line 174 builds a filesystem path directly from remote study_id/file_name. If metadata contains ../ or absolute paths, downloads can escape output_folder and overwrite arbitrary files.

🔒 Proposed fix

def _target_path(output_folder: Path, pdc_file: PDCFile) -> Path: - return output_folder / pdc_file.study_id / pdc_file.file_name + study_id = pdc_file.study_id + file_name = pdc_file.file_name + + if os.path.isabs(study_id) or "/" in study_id or "\\" in study_id or study_id in {"", ".", ".."}: + raise ValueError(f"Invalid study_id path component: {study_id!r}") + if os.path.isabs(file_name) or "/" in file_name or "\\" in file_name or file_name in {"", ".", ".."}: + raise ValueError(f"Invalid file_name path component: {file_name!r}") + + return output_folder / study_id / file_name

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _target_path(output_folder: Path, pdc_file: PDCFile) -> Path:

return output_folder / pdc_file.study_id / pdc_file.file_name

def _target_path(output_folder: Path, pdc_file: PDCFile) -> Path:

study_id = pdc_file.study_id

file_name = pdc_file.file_name

if os.path.isabs(study_id) or "/" in study_id or "\\" in study_id or study_id in {"", ".", ".."}:

raise ValueError(f"Invalid study_id path component: {study_id!r}")

if os.path.isabs(file_name) or "/" in file_name or "\\" in file_name or file_name in {"", ".", ".."}:

raise ValueError(f"Invalid file_name path component: {file_name!r}")

return output_folder / study_id / file_name

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pridepy/pdc/downloader.py` around lines 173 - 174, The _target_path function constructs a path from untrusted pdc_file.study_id and pdc_file.file_name allowing path traversal; sanitize both components (e.g., strip any path separators, use only the basename/name portion), build the candidate path, resolve it and assert it is inside output_folder.resolve() (raise an error if not), and ensure directories are created safely before writing; update references to _target_path to expect the sanitized/validated Path.

Shen-YuFei and others added 7 commits May 20, 2026 20:21

feat(download): add -t/--threads multi-segment download, replace -w

07397fc

Merge branch 'dev' of https://github.com/Shen-YuFei/pridepy into dev

85799af

Merge pull request #108 from PRIDE-Archive/feat/ftps-walk-progress

51340f5

fix(massive/ftp): emit progress while walking the FTPS tree

Merge remote-tracking branch 'origin/dev' into dev

f9cb6e6

feat(download): add PDC downloads and restore threads option

78877f9

Merge pull request #109 from Shen-YuFei/dev

42d965b

Add multi-segment download feature and restore threads option

qodo-code-review Bot reviewed Jun 6, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 6, 2026

View reviewed changes

ypriverol merged commit 6d9dfc8 into master Jun 8, 2026
10 checks passed

coderabbitai Bot mentioned this pull request Jun 8, 2026

Enable to download entire CPTAC dataset #112

Merged

		def _target_path(output_folder: Path, pdc_file: PDCFile) -> Path:
		return output_folder / pdc_file.study_id / pdc_file.file_name

-def _target_path(output_folder: Path, pdc_file: PDCFile) -> Path:
-    return output_folder / pdc_file.study_id / pdc_file.file_name
+def _target_path(output_folder: Path, pdc_file: PDCFile) -> Path:
+    study_id = pdc_file.study_id
+    file_name = pdc_file.file_name
+    if os.path.isabs(study_id) or "/" in study_id or "\\" in study_id or study_id in {"", ".", ".."}:
+        raise ValueError(f"Invalid study_id path component: {study_id!r}")
+    if os.path.isabs(file_name) or "/" in file_name or "\\" in file_name or file_name in {"", ".", ".."}:
+        raise ValueError(f"Invalid file_name path component: {file_name!r}")
+    return output_folder / study_id / file_name

Uh oh!

Conversation

ypriverol commented Jun 6, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

qodo-code-review Bot commented Jun 6, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

qodo-code-review Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ypriverol commented Jun 6, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 6, 2026 •

edited

Loading

qodo-code-review Bot commented Jun 6, 2026 •

edited

Loading