Skip to content

Transfer triage spike: CORA-owned TransferPort + Globus adapter (not for merge)#362

Draft
xmap wants to merge 8 commits into
mainfrom
worktree-transferport-spike
Draft

Transfer triage spike: CORA-owned TransferPort + Globus adapter (not for merge)#362
xmap wants to merge 8 commits into
mainfrom
worktree-transferport-spike

Conversation

@xmap

@xmap xmap commented Jun 25, 2026

Copy link
Copy Markdown
Owner

What this is

A design-triage spike, not a merge candidate. It answers the open question from the transfer-orchestration design: is a transfer a step the conductor walks, or a long-running edge job with its own lifecycle? It builds the CORA-owned TransferPort, a test double, a 2-BM/DMagic scenario, and a real GlobusTransferPort adapter, then lets the shape settle the question.

The build trigger has not fired and there is no production consumer yet, so this is intentionally parked on a branch.

Commits

  1. feat(operation): CORA-owned TransferPort + in-memory double (triage spike)
  2. feat(operation): GlobusTransferPort adapter over globus-sdk (triage spike)

CORA-owned, not Globus-shaped

The port is shaped from CORA's own needs and the substrate is tested against it, not copied from it:

  • Verbs begin / observe / cancel (a non-blocking observe-loop), not the corpus submit/poll/cancel.
  • TransferState = Pending / Active / Suspended / Succeeded / Failed / Cancelled; a partial move is carried as files_failed > 0 on a terminal Failed (no PartiallyFailed enum, deferred until a substrate has a native one).
  • idempotency_key is on the request because CORA's replay-determinism needs it; actuation_kind is deliberately off the surface (a route-layer concern, as on the control path).
  • GlobusTransferPort is a pure ACL: injected TransferClient, status + error mapping, asyncio.to_thread around the sync client. A TYPE_CHECKING conformance line fails the type-check if the real client ever drifts from the seam.

Triage conclusion

A transfer is a long-running edge job (begin-then-observe-loop, waits through a non-terminal Suspended on credential expiry, folds a partial terminal), driven by the existing EdgeConductor like the Run FSM. It is not a synchronous conductor step like ComputeStep. The step union is untouched.

Verification

ruff, ruff-format, pyright (strict), tach, the full architecture suite, and the operation unit module are all green; naming-r3 OK on both the port and the adapter. Each commit passed the pre-commit gate.

Caveat: the Globus adapter is unit-tested against a fake TransferClient; it has not been run against a live Globus endpoint.

Before any merge

  • Full gate-review panel.
  • A credentialed live-Globus sanity run.
  • The build trigger must fire (a promote/publish-blocking custody invariant, a substrate with a native partial terminal, or a chain-of-custody the Attestation path cannot carry). The lever is the beamline storage-tier answers, not code.

🤖 Generated with Claude Code

xmap and others added 2 commits June 25, 2026 11:46
…pike)

Finalizes the data-transfer design triage with a CORA-shaped seam rather
than a Globus transliteration. The verbs are begin/observe/cancel (a
non-blocking observe-loop), the state set is Pending/Active/Suspended/
Succeeded/Failed/Cancelled, and a partial move is carried as files_failed>0
on a terminal Failed (no PartiallyFailed enum yet). The in-memory double
plus a 2-BM/DMagic scenario show the deciding fact: a transfer is a
long-running edge job (it waits through a non-terminal Suspended and folds a
partial terminal), not a synchronous conductor step like ComputeStep.

Not for merge: no production consumer yet and the build trigger has not
fired. TransferPort takes the bare-verb port-suffix carve-out alongside
ControlPort/ComputePort.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…pike)

The first real TransferPort substrate, validating the CORA-owned shape
against an outside adapter rather than a fake alone. It takes an
already-authorized globus_sdk TransferClient by injection (the OAuth2 dance
stays a composition-root concern), builds a TransferData submission, and maps
Globus task status into TransferState (INACTIVE -> Suspended, a FAILED task
with subtasks_failed>0 -> the partial signal). Globus calls run in
asyncio.to_thread since the client is synchronous. Error mapping: NetworkError
-> EndpointUnreachable; GlobusAPIError dispatched on http_status ->
AccessDenied / EndpointUnreachable / Rejected.

Adds globus-sdk>=3,<4 (resolved 3.65) as a hard dep, following the aioca/p4p
substrate-adapter convention. Unit-tested against a fake TransferClient; NOT
run against a live Globus endpoint (needs credentials + two collections).
Still a spike: no production consumer until the build trigger fires.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  apps/api/src/cora/api
  _distribution_materializer.py 170-173
  apps/api/src/cora/infrastructure/ports
  dataset_distribution_lookup.py
  apps/api/src/cora/operation/adapters
  fdt_transfer_port.py 71-72, 86, 89-91
  globus_transfer_port.py
  in_memory_transfer_port.py 125
  apps/api/src/cora/operation/ports
  __init__.py
  transfer_port.py 249-251, 264-266, 278-279
Project Total  

This report was generated by python-coverage-comment-action

xmap and others added 6 commits June 25, 2026 20:26
The acquisition-to-analysis stage-in substrate for stage-then-reconstruct, the
sibling of GlobusTransferPort (which serves the outer user-delivery leg). It
runs an APS Fast Data Transfer (fdt.jar) client as a subprocess via an injected
TransferRunner and maps the exit code into a TransferState. Integrity and sync
are deferred on purpose: checksum-on-arrival is recorded as an Attestation in
the materialize-a-Distribution edge job that consumes this port, and general
sync belongs to richer substrates (Globus). Progress is coarse (a subprocess
exposes no per-file counters).

Grounded in the real 2-BM pipeline (2bm-docs ops/item_018 + item_025): the
per-scan tomdet:/local1 -> /data2 copy via fdt.jar or scp is exactly the leg
reconstruction on the analysis nodes waits on. Unit-tested against a fake
runner; not run against a real fdt.jar.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…onstruct (triage spike)

Sequences the materialize-a-Distribution flow: move bytes over a TransferPort,
then on success register a new analysis-tier Distribution of the same raw
Dataset (register_distribution) and record a checksum Attestation
(record_attestation), whose Match flips the Distribution to Verified in the Data
BC projection. That Verified-at-tier fact is exactly what leg C's start_run gate
reads.

It owns the sequence and the transfer gate (registers only on a Succeeded move,
waits through a non-terminal Suspended); it trusts the caller's
RegisterDistribution (byte-identical-copy fields built from the parent Dataset).
Lives in cora.api, the only module that may reach both operation.ports and the
Data BC handlers; injected collaborators, unit-tested against the in-memory
TransferPort plus fake handlers. Not yet wired into the EdgeConductor or the app
(deferred to the real build, post gate-review of leg C).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… 1 (triage spike)

The cross-BC seam the start_run input gate (leg C of stage-then-reconstruct)
will read: given an input Dataset, return its non-Discarded Distributions with
status, so the decider can gate on Verified AND distinguish Stale from absent.
Mirrors SupplyLookup (one Data-BC adapter, multiple consumers; the port returns
rows, the decider partitions on status); lives in cora.infrastructure.ports
because Run may not import the Data-internal DistributionLookup (the Edition
canonical-pick query, a different need).

Ships the Protocol + result DTO + two test stubs; no consumer yet. The Method
input-role declaration, the Run input-Dataset binding, and the start_run
genesis Verified-gate (plus the Data BC Postgres adapter) are the following
sub-slices; the gate sub-slice is where gate-review-before-merge bites.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…triage spike)

Adds an optional additive frozenset[str] field to the Method aggregate: the
kinds/roles of input Datasets a reconstruction consumes (raw-projections,
flat-field, prior-reconstruction). The recipe declares the input need, the way
it already declares needed_supplies / needed_family_ids; a per-Run binding will
resolve the concrete input Dataset (sub-slice C3), and the start_run gate reads
its Verified Distribution (C4).

Mirrors needed_supplies across state / event / command / decider / evolver (all
seven evolver arms carry it), with one deliberate difference: it participates in
the Method content hash CONDITIONALLY (rendered only when non-empty, like
role_kind) so existing Methods' content_hash bytes stay byte-stable.
Optional-by-default, so a Method declaring no input kinds is byte-identical to
before and no existing define_method caller, scenario, or content-hash golden
changes. New error InvalidMethodNeededInputKindsError registered as a 400.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…spike)

Adds an optional additive frozenset[UUID] field to the Run aggregate: the input
Dataset id(s) a reconstruction consumes (PROV `used`, targeting the Dataset
entity), the per-Run binding that resolves Method.needed_input_kinds (C2).
Mirrors pinned_calibration_ids exactly: id-only atomic refs, NO cross-BC
existence check (eventual-consistency), cardinality-validated (<=64) via
validate_input_dataset_ids + InvalidInputDatasetsError (registered 400), threaded
through the StartRun command, the RunStarted payload (always-rendered), and all
11 reconstructing evolver arms.

Optional-by-default and domain-only: no projection column (so no migration), the
field is empty for existing callers so the start_run route body and MCP tool are
unchanged. The start_run gate that reads each input Dataset's Verified
Distribution is the next sub-slice (C4), which needs gate-review before merge.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…b-slice 4 (triage spike)

The spine gate that connects legs A-C: a reconstruction Run cannot start unless
every input Dataset it declares (Run.input_dataset_ids, C3) has a Verified
Distribution. It is a genesis-only precondition in the start_run decider, NOT in
check_safety_envelope (which the RunSupervisor resume re-check shares), so the
gate runs only at first start. The handler pre-loads each input Dataset's
non-Discarded Distributions via the cross-BC DatasetDistributionLookup (C1) into
RunStartContext; the pure decider requires at least one with status Verified,
else RunInputNotVerifiedError (409). Reachability/tier is deferred (present +
Verified only).

Wiring mirrors supply_lookup end-to-end: a dataset_distribution_lookup field on
the Kernel (in-memory default NoDatasetDistributionsLookup at the single
construction site), and a production PostgresDatasetDistributionLookup in the
Data BC reading proj_data_distribution_summary, bound at the composition root.
Dormant by default: empty input_dataset_ids skips the lookup and passes the gate
trivially, so no existing Run, scenario, or contract changes.

Verified locally: unit (10900) + run-unit (651) + architecture (27116, incl. the
single-Kernel-construction-site, routes-completeness, no-em-dash gates) +
integration (38 start_run/distribution) all green. A full gate-review-before-merge
plus the contract suite remain the pre-merge gates for this start_run change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant