Conversation
Large MassIVE timsTOF deposits are spread over thousands of .d directories, each requiring its own PASV+TLS data connection to list. _walk_ftp_tree enumerates the entire tree before any download and was silent throughout, so a multi-minute enumeration looked like a hang (e.g. ~1719 .d in MSV000098940 ≈ 62 min of no output -> users Ctrl-C). Add an INFO progress heartbeat every 100 directories plus a final summary. Pure instrumentation: the returned file list and recursion are unchanged (covered by a mock-tree unit test); existing tests pass.
fix(massive/ftp): emit progress while walking the FTPS tree
Add multi-segment download feature and restore threads option
📝 WalkthroughWalkthroughThis PR adds per-file download thread concurrency ( ChangesPer-file threading infrastructure
PDC file download system
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Review Summary by QodoAdd multi-segment HTTP downloads, PDC support, and FTP progress reporting
WalkthroughsDescription• Add multi-segment HTTP Range download support via download_threads parameter - Replaces -w/--parallel-files with -t/--threads for per-file parallelism - Implements _multipart_download() for concurrent Range requests • Add PDC/CPTAC download support with GraphQL metadata and signed URLs - New pridepy/pdc/ module with client and downloader implementations - Supports mzid, psm, raw, mzml file types with CSV accession lists • Improve FTP tree enumeration progress reporting - Emit INFO heartbeats every 100 directories during FTPS tree walk - Prevents multi-minute hangs on large deposits like MassIVE timsTOF • Refactor URL download logic to always use ThreadPoolExecutor - Simplifies serial/parallel branching; always uses workers pool Diagramflowchart LR
A["CLI: -t/--threads"] --> B["download_threads param"]
B --> C["HTTP URLs"]
B --> D["PRIDE/MassIVE"]
B --> E["PDC/CPTAC"]
C --> F["_multipart_download"]
D --> G["transport layer"]
E --> H["PDC GraphQL API"]
F --> I["HTTP Range requests"]
H --> J["Signed URLs"]
G --> K["FTP tree walk progress"]
File Changes1. pridepy/download/base.py
|
Code Review by Qodo
1. Multipart missing completion check
|
| def _download_range(url, file_path, start, end, pbar, max_retries=3): | ||
| """Download a byte range directly into the target file using seek. | ||
|
|
||
| On transient failures (e.g. ``IncompleteRead``) the segment resumes | ||
| from the last successfully written byte using a fresh Range request | ||
| for the remaining bytes, instead of restarting the whole segment. | ||
| ``pbar`` only accumulates bytes that were actually written, so the | ||
| progress bar stays in sync with on-disk state across retries. | ||
| """ | ||
| cursor = start | ||
| for attempt in range(1, max_retries + 1): | ||
| try: | ||
| session = Util.create_session_with_retries() | ||
| headers = {"Range": f"bytes={cursor}-{end}"} | ||
| with session.get(url, headers=headers, stream=True, timeout=(15, 15)) as r: | ||
| r.raise_for_status() | ||
| if r.status_code != 206: | ||
| raise RuntimeError(f"Server did not honor Range request: {r.status_code}") | ||
| content_range = r.headers.get("Content-Range", "") | ||
| if not content_range.lower().startswith(f"bytes {cursor}-{end}/"): | ||
| raise RuntimeError(f"Unexpected Content-Range header: {content_range}") | ||
| with open(file_path, "r+b") as f: | ||
| f.seek(cursor) | ||
| for chunk in r.iter_content(chunk_size=8 * 1024 * 1024): | ||
| if chunk: | ||
| f.write(chunk) | ||
| cursor += len(chunk) | ||
| pbar.update(len(chunk)) | ||
| return |
There was a problem hiding this comment.
1. Multipart missing completion check 🐞 Bug ≡ Correctness
transport._download_range returns success without verifying it wrote the full requested byte range; because _multipart_download preallocates the file to the expected size, a truncated segment can leave zero-filled holes while the final file size still looks correct. This can silently corrupt downloads when servers close connections mid-stream without raising.
Agent Prompt
## Issue description
`_download_range()` streams bytes for a Range request but never asserts that the downloaded segment reached `end` (i.e., `cursor == end + 1`) before returning. With `_prepare_multipart_target()` preallocating the file, incomplete segments can leave zero-filled gaps that are not detectable via file size.
## Issue Context
The single-stream downloader `_parallel_download()` includes a post-transfer integrity check to catch silent truncation; multipart currently lacks an equivalent check.
## Fix Focus Areas
- pridepy/download/transport.py[628-656]
- pridepy/download/transport.py[613-625]
## Implementation notes
- After the `iter_content` loop, add a completeness assertion like:
- `expected = end + 1`
- if `cursor != expected`: raise `RuntimeError` describing the shortfall.
- Consider also validating `Content-Length` for the range response when present, but the cursor check is the critical guard.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pridepy/pridepy.py (1)
678-680:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate protocol help text to reflect threaded multipart behavior.
Line 679 says
globususes a single-connection stream, but this command now accepts--threadsand can do multipart range downloads when threads > 1. The help text is now inaccurate.Suggested diff
- help="Download strategy. ftp (default): single-connection per URL scheme. " - "globus: resume-capable http/https downloads (single-connection stream).", + help="Download strategy. ftp (default): single-connection per URL scheme. " + "globus: resume-capable http/https downloads (supports multipart ranges with --threads).",🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pridepy/pridepy.py` around lines 678 - 680, The help string for the download strategy currently claims "globus: resume-capable http/https downloads (single-connection stream)" but is outdated because the command supports multipart range downloads when --threads > 1; update the help text in the argparse add_argument call that sets the download strategy (the help parameter that mentions "ftp (default)" and "globus") to indicate that globus is resume-capable for http/https and supports threaded multipart/range downloads when --threads is greater than 1, referring to the --threads option by name so users know multipart behavior is enabled by increasing threads.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pridepy/download/by_url.py`:
- Around line 194-202: The current computation of workers can produce 0 when
parallel_files is 0, causing ThreadPoolExecutor(max_workers=workers) to raise;
change the clamp to ensure at least one worker by computing workers = max(1,
min(parallel_files, 3, len(urls))) (matching download_http_urls()), so the
serial fallback remains when parallel_files <= 0 before creating the
ThreadPoolExecutor and keeping the existing log/conditional behavior that checks
workers > 1.
In `@pridepy/download/transport.py`:
- Around line 685-692: The helper _prepare_multipart_target currently treats a
zero-filled preallocated file as a finished download; change it to not consider
size alone as completion by using a separate completion marker or atomic rename
strategy: create the preallocated temporary file (e.g., file_path + ".part") in
_prepare_multipart_target and only treat the download as complete when a final
marker (e.g., file_path + ".complete") is written or the temp file is atomically
renamed to file_path after _http_download_one finishes and validates ranges;
update _http_download_one to look for and write the same marker/rename on
successful full multipart assembly and to ignore placeholder files that are
merely preallocated.
- Around line 639-656: The code using Util.create_session_with_retries() and
session.get(...) must verify the requested byte range was fully written before
returning: after the iter_content loop in this block (the with session.get(...)
as r: scope that checks r.status_code and Content-Range), assert that cursor ==
end + 1 (or raise a RuntimeError) — if not, treat it as a truncated 206 response
and handle accordingly (raise so caller can retry or fall back to
_parallel_download). Make this check immediately before the `return` and include
the unique symbols `session.get`, `r.iter_content`, `cursor`, `end`, and
`pbar.update` to locate where to add the verification.
In `@pridepy/pdc/client.py`:
- Around line 218-222: The helper _safe_int currently masks malformed file_size
by returning 0 which makes validate_pdc_file treat invalid sizes as “unknown”
and weakens validation; change _safe_int (or add a new strict parser used by
validate_pdc_file) to not coerce invalid values to 0 — instead return None or
raise a ValueError on parse failure so validate_pdc_file can explicitly reject
or skip entries with malformed sizes; update validate_pdc_file to check for
None/exception and fail or skip the entry (rather than treating it as
unknown/<=0), and apply the same stricter parsing logic to other usages in the
same region (lines around the current _safe_int and the blocks referenced at
225-243) so malformed metadata is handled consistently.
- Around line 141-148: The code incorrectly treats any string whose basename
happens to exist in CWD as a local CSV; change the logic to first detect whether
the argument actually looks like a path (contains os.path.sep or endswith
'.csv') and only then use Path(accession) and call _read_accessions_from_csv; if
it looks like a CSV path but the file doesn't exist raise the "Accession CSV not
found" error. Apply this change to the block using source_path and
_read_accessions_from_csv and the analogous block at lines ~171-175 so bare
accessions (e.g., "PDC000109") are not interpreted as local paths.
- Around line 111-115: Normalize CSV headers immediately after creating
csv.DictReader by sanitizing reader.fieldnames: trim surrounding whitespace,
strip any BOM (e.g. leading '\ufeff'), and normalize casing/characters (e.g.
lower-case and convert hyphens to underscores) before calling _study_id_column
and _file_type_column. Use the sanitized list in place of reader.fieldnames (or
pass it to the detection helpers) so functions _study_id_column(columns, path)
and _file_type_column(columns) receive normalized headers and can match names
like "pdc_id", "pdc_id " or "File-Type".
In `@pridepy/pdc/downloader.py`:
- Around line 173-174: The _target_path function constructs a path from
untrusted pdc_file.study_id and pdc_file.file_name allowing path traversal;
sanitize both components (e.g., strip any path separators, use only the
basename/name portion), build the candidate path, resolve it and assert it is
inside output_folder.resolve() (raise an error if not), and ensure directories
are created safely before writing; update references to _target_path to expect
the sanitized/validated Path.
---
Outside diff comments:
In `@pridepy/pridepy.py`:
- Around line 678-680: The help string for the download strategy currently
claims "globus: resume-capable http/https downloads (single-connection stream)"
but is outdated because the command supports multipart range downloads when
--threads > 1; update the help text in the argparse add_argument call that sets
the download strategy (the help parameter that mentions "ftp (default)" and
"globus") to indicate that globus is resume-capable for http/https and supports
threaded multipart/range downloads when --threads is greater than 1, referring
to the --threads option by name so users know multipart behavior is enabled by
increasing threads.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f0eb45c7-bcbb-46ee-83fa-503673edf22f
📒 Files selected for processing (15)
README.mdpridepy/download/base.pypridepy/download/by_url.pypridepy/download/client.pypridepy/download/pride.pypridepy/download/transport.pypridepy/pdc/__init__.pypridepy/pdc/client.pypridepy/pdc/downloader.pypridepy/pridepy.pypridepy/tests/test_cli_flatten.pypridepy/tests/test_download_by_url.pypridepy/tests/test_pdc_cli.pypridepy/tests/test_pdc_client.pypridepy/tests/test_pdc_downloader.py
| workers = min(parallel_files, 3, len(urls)) | ||
| failures: List[Tuple[str, str]] = [] | ||
|
|
||
| if parallel_files < 2: | ||
| for url in urls: | ||
| if workers > 1: | ||
| logging.info( | ||
| "Downloading %d URL(s) with %d parallel workers", | ||
| len(urls), workers, | ||
| ) | ||
| with ThreadPoolExecutor(max_workers=workers) as executor: |
There was a problem hiding this comment.
Clamp parallel_files before constructing the executor.
This path now always creates a ThreadPoolExecutor, so parallel_files=0 becomes max_workers=0 and fails immediately. download_http_urls() in pridepy/download/transport.py already uses max(1, min(...)); matching that here preserves the old serial fallback instead of turning a non-positive input into a runtime error.
Suggested fix
- workers = min(parallel_files, 3, len(urls))
+ workers = max(1, min(parallel_files, 3, len(urls)))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| workers = min(parallel_files, 3, len(urls)) | |
| failures: List[Tuple[str, str]] = [] | |
| if parallel_files < 2: | |
| for url in urls: | |
| if workers > 1: | |
| logging.info( | |
| "Downloading %d URL(s) with %d parallel workers", | |
| len(urls), workers, | |
| ) | |
| with ThreadPoolExecutor(max_workers=workers) as executor: | |
| workers = max(1, min(parallel_files, 3, len(urls))) | |
| failures: List[Tuple[str, str]] = [] | |
| if workers > 1: | |
| logging.info( | |
| "Downloading %d URL(s) with %d parallel workers", | |
| len(urls), workers, | |
| ) | |
| with ThreadPoolExecutor(max_workers=workers) as executor: |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pridepy/download/by_url.py` around lines 194 - 202, The current computation
of workers can produce 0 when parallel_files is 0, causing
ThreadPoolExecutor(max_workers=workers) to raise; change the clamp to ensure at
least one worker by computing workers = max(1, min(parallel_files, 3,
len(urls))) (matching download_http_urls()), so the serial fallback remains when
parallel_files <= 0 before creating the ThreadPoolExecutor and keeping the
existing log/conditional behavior that checks workers > 1.
| try: | ||
| session = Util.create_session_with_retries() | ||
| headers = {"Range": f"bytes={cursor}-{end}"} | ||
| with session.get(url, headers=headers, stream=True, timeout=(15, 15)) as r: | ||
| r.raise_for_status() | ||
| if r.status_code != 206: | ||
| raise RuntimeError(f"Server did not honor Range request: {r.status_code}") | ||
| content_range = r.headers.get("Content-Range", "") | ||
| if not content_range.lower().startswith(f"bytes {cursor}-{end}/"): | ||
| raise RuntimeError(f"Unexpected Content-Range header: {content_range}") | ||
| with open(file_path, "r+b") as f: | ||
| f.seek(cursor) | ||
| for chunk in r.iter_content(chunk_size=8 * 1024 * 1024): | ||
| if chunk: | ||
| f.write(chunk) | ||
| cursor += len(chunk) | ||
| pbar.update(len(chunk)) | ||
| return |
There was a problem hiding this comment.
Verify that each range actually reached end + 1 before returning.
This worker exits as soon as iter_content() stops, but unlike _parallel_download it never confirms that the requested segment was fully written. If a 206 body ends early without surfacing as an exception, the untouched bytes from preallocation stay in place and the whole file is still reported as complete.
Suggested fix
with open(file_path, "r+b") as f:
f.seek(cursor)
for chunk in r.iter_content(chunk_size=8 * 1024 * 1024):
if chunk:
f.write(chunk)
cursor += len(chunk)
pbar.update(len(chunk))
+ if cursor != end + 1:
+ raise RuntimeError(
+ f"Incomplete range {start}-{end}: wrote through {cursor - 1}"
+ )
return🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pridepy/download/transport.py` around lines 639 - 656, The code using
Util.create_session_with_retries() and session.get(...) must verify the
requested byte range was fully written before returning: after the iter_content
loop in this block (the with session.get(...) as r: scope that checks
r.status_code and Content-Range), assert that cursor == end + 1 (or raise a
RuntimeError) — if not, treat it as a truncated 206 response and handle
accordingly (raise so caller can retry or fall back to _parallel_download). Make
this check immediately before the `return` and include the unique symbols
`session.get`, `r.iter_content`, `cursor`, `end`, and `pbar.update` to locate
where to add the verification.
| def _prepare_multipart_target(file_path, total_size): | ||
| if os.path.exists(file_path) and os.path.getsize(file_path) == total_size: | ||
| logging.info("File already complete: %s", file_path) | ||
| return False | ||
|
|
||
| with open(file_path, "wb") as pre: | ||
| pre.truncate(total_size) | ||
| return True |
There was a problem hiding this comment.
Don't treat a preallocated multipart target as a finished download.
This helper marks file_path as "already complete" purely from os.path.getsize(file_path) == total_size, but Line 690 preallocates that exact size before any range bytes are fetched. An interrupted multipart run therefore leaves a full-size placeholder behind, and the next call returns early while _http_download_one() logs success without retransferring or validating the contents.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pridepy/download/transport.py` around lines 685 - 692, The helper
_prepare_multipart_target currently treats a zero-filled preallocated file as a
finished download; change it to not consider size alone as completion by using a
separate completion marker or atomic rename strategy: create the preallocated
temporary file (e.g., file_path + ".part") in _prepare_multipart_target and only
treat the download as complete when a final marker (e.g., file_path +
".complete") is written or the temp file is atomically renamed to file_path
after _http_download_one finishes and validates ranges; update
_http_download_one to look for and write the same marker/rename on successful
full multipart assembly and to ignore placeholder files that are merely
preallocated.
| with path.open("r", encoding="utf-8", newline="") as handle: | ||
| reader = csv.DictReader(handle) | ||
| columns = reader.fieldnames or [] | ||
| study_column = _study_id_column(columns, path) | ||
| file_type_column = _file_type_column(columns) |
There was a problem hiding this comment.
Normalize CSV headers before column detection.
csv.DictReader preserves BOMs and stray spaces, so headers such as pdc_id, pdc_id , or File-Type will currently miss both _study_id_column() and _file_type_column() even though the columns are present. Normalizing fieldnames up front makes the CLI much more tolerant of spreadsheet-exported CSVs.
Suggested change
- with path.open("r", encoding="utf-8", newline="") as handle:
+ with path.open("r", encoding="utf-8-sig", newline="") as handle:
reader = csv.DictReader(handle)
- columns = reader.fieldnames or []
+ reader.fieldnames = [
+ str(column).strip().lower() for column in (reader.fieldnames or [])
+ ]
+ columns = reader.fieldnames🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pridepy/pdc/client.py` around lines 111 - 115, Normalize CSV headers
immediately after creating csv.DictReader by sanitizing reader.fieldnames: trim
surrounding whitespace, strip any BOM (e.g. leading '\ufeff'), and normalize
casing/characters (e.g. lower-case and convert hyphens to underscores) before
calling _study_id_column and _file_type_column. Use the sanitized list in place
of reader.fieldnames (or pass it to the detection helpers) so functions
_study_id_column(columns, path) and _file_type_column(columns) receive
normalized headers and can match names like "pdc_id", "pdc_id " or "File-Type".
| source_path = Path(accession).expanduser() | ||
| if source_path.exists(): | ||
| if source_path.suffix.lower() != ".csv": | ||
| raise ValueError(f"Only CSV accession files are supported: {source_path}") | ||
| return _read_accessions_from_csv(source_path) | ||
|
|
||
| if source_path.suffix.lower() == ".csv" or os.path.sep in accession: | ||
| raise ValueError(f"Accession CSV not found: {source_path}") |
There was a problem hiding this comment.
Avoid treating bare accessions as local paths.
Both parsers use Path(accession).exists() before deciding whether the input is a CSV path. If the current working directory already contains a file or folder named like a valid study ID (for example PDC000109), the CLI will reject that accession as a non-CSV path instead of downloading it. Only enter the file-input branch when the argument actually looks like a CSV path.
Also applies to: 171-175
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pridepy/pdc/client.py` around lines 141 - 148, The code incorrectly treats
any string whose basename happens to exist in CWD as a local CSV; change the
logic to first detect whether the argument actually looks like a path (contains
os.path.sep or endswith '.csv') and only then use Path(accession) and call
_read_accessions_from_csv; if it looks like a CSV path but the file doesn't
exist raise the "Accession CSV not found" error. Apply this change to the block
using source_path and _read_accessions_from_csv and the analogous block at lines
~171-175 so bare accessions (e.g., "PDC000109") are not interpreted as local
paths.
| def _safe_int(value) -> int: | ||
| try: | ||
| return int(value or 0) | ||
| except (TypeError, ValueError): | ||
| return 0 |
There was a problem hiding this comment.
Malformed file_size values currently disable part of validation.
_safe_int() turns any parse failure into 0, and downstream validate_pdc_file() treats file_size <= 0 as “unknown size”. That means a bad GraphQL file_size value silently weakens integrity checks: if md5sum is also absent, any non-empty partial file will pass validation. Please distinguish “missing” from “invalid” metadata and skip or fail entries with malformed sizes instead of coercing them to zero.
Also applies to: 225-243
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pridepy/pdc/client.py` around lines 218 - 222, The helper _safe_int currently
masks malformed file_size by returning 0 which makes validate_pdc_file treat
invalid sizes as “unknown” and weakens validation; change _safe_int (or add a
new strict parser used by validate_pdc_file) to not coerce invalid values to 0 —
instead return None or raise a ValueError on parse failure so validate_pdc_file
can explicitly reject or skip entries with malformed sizes; update
validate_pdc_file to check for None/exception and fail or skip the entry (rather
than treating it as unknown/<=0), and apply the same stricter parsing logic to
other usages in the same region (lines around the current _safe_int and the
blocks referenced at 225-243) so malformed metadata is handled consistently.
| def _target_path(output_folder: Path, pdc_file: PDCFile) -> Path: | ||
| return output_folder / pdc_file.study_id / pdc_file.file_name |
There was a problem hiding this comment.
Path traversal risk in target path construction
Line 174 builds a filesystem path directly from remote study_id/file_name. If metadata contains ../ or absolute paths, downloads can escape output_folder and overwrite arbitrary files.
🔒 Proposed fix
def _target_path(output_folder: Path, pdc_file: PDCFile) -> Path:
- return output_folder / pdc_file.study_id / pdc_file.file_name
+ study_id = pdc_file.study_id
+ file_name = pdc_file.file_name
+
+ if os.path.isabs(study_id) or "/" in study_id or "\\" in study_id or study_id in {"", ".", ".."}:
+ raise ValueError(f"Invalid study_id path component: {study_id!r}")
+ if os.path.isabs(file_name) or "/" in file_name or "\\" in file_name or file_name in {"", ".", ".."}:
+ raise ValueError(f"Invalid file_name path component: {file_name!r}")
+
+ return output_folder / study_id / file_name📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _target_path(output_folder: Path, pdc_file: PDCFile) -> Path: | |
| return output_folder / pdc_file.study_id / pdc_file.file_name | |
| def _target_path(output_folder: Path, pdc_file: PDCFile) -> Path: | |
| study_id = pdc_file.study_id | |
| file_name = pdc_file.file_name | |
| if os.path.isabs(study_id) or "/" in study_id or "\\" in study_id or study_id in {"", ".", ".."}: | |
| raise ValueError(f"Invalid study_id path component: {study_id!r}") | |
| if os.path.isabs(file_name) or "/" in file_name or "\\" in file_name or file_name in {"", ".", ".."}: | |
| raise ValueError(f"Invalid file_name path component: {file_name!r}") | |
| return output_folder / study_id / file_name |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pridepy/pdc/downloader.py` around lines 173 - 174, The _target_path function
constructs a path from untrusted pdc_file.study_id and pdc_file.file_name
allowing path traversal; sanitize both components (e.g., strip any path
separators, use only the basename/name portion), build the candidate path,
resolve it and assert it is inside output_folder.resolve() (raise an error if
not), and ensure directories are created safely before writing; update
references to _target_path to expect the sanitized/validated Path.
Summary by CodeRabbit
Release Notes
New Features
--threadsparameter (1–32) across download commandsdownload-pdc-filescommand for downloading PDC/CPTAC cancer data files with signed HTTPS URLsDocumentation
download-pdc-filescommand documentation and download exampleTests