Skip to content

fix(rc9): GA blockers + monetize buy-side fixes from the v0.10.0-rc9 report#583

Merged
OisinKyne merged 7 commits into
mainfrom
fix/rc9-ga-blockers
Jun 3, 2026
Merged

fix(rc9): GA blockers + monetize buy-side fixes from the v0.10.0-rc9 report#583
OisinKyne merged 7 commits into
mainfrom
fix/rc9-ga-blockers

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented Jun 3, 2026

Summary

What changed: Fixes the v0.10.0-rc9 upgrade-report issues. Each was validated against rc9
source (adversarially re-checked) before fixing.

  • Bring obolup code to this repo #1 (GA blocker) — repin serviceoffer-controller from the f5d94fc side-branch build
    (which predated the Secret-create-only reconciler change) to 503016b@sha256:bec62ea0
    (rc9 commit 503016bf, image 0.10.0-rc9). The old pin Updates per-agent Secrets, which
    the tightened RBAC (no secrets:update/patch) 403s → per-agent provisioning never converges.
    Added a tripwire test.
  • User facing ingress #3x402-buyer HoldSign now drops expired pre-signed auths before signing
    (Permit2 deadline / ERC-3009 validBefore), ending the 503 invalid_payment_expired cascade.
  • WIP: Working towards a high quality rpc. #4buy.py status/list count is expiry-aware (valid vs expired auths).
  • Specify flag passing to obolup (or obol) #5reconcileDeletingPurchase finalizes the delete-drain on the not found in sidecar status signal instead of requeueing forever (stranded Terminating).
  • P1 — suppress the per-request verifyOnly=false warning on the in-process settle path
    (HandleProxy / obol sell inference); the Traefik ForwardAuth path still warns.
  • P2obol agent new --model X validates X against the LiteLLM registry; fails fast.

Issue #2 (master-Hermes PVC ownership on k3d local-path) is already fixed on main
(eb985bd/671c8ac root-chown init container); this branch inherits it. Per-agent Hermes was
confirmed not a residual (it seeds via Secret, not host PVC writes).

Why it matters: #1 and #2 are GA blockers from the rc9 report; the rest are buy-side
correctness / UX fixes on the monetize path.

Risk level: low — the controller fix is a pin bump to an already-published rc9 image; the rest
are narrow, regression-tested behaviour changes. No RBAC widened, no security surface added.

Commit under test: 8cfd0ca1 (live-chain evidence captured at b90118fd; 8cfd0ca1
adds only the dual-stack flows retry, no runtime behaviour change)

Base branch: main

Scope

  • Code
  • Charts / manifests
  • Flows / QA scripts
  • Docs / skills
  • Images / dependencies
  • Other:

Validation

CI checks:

Check Status Link
(added on push) pending

Unit tests:

go test ./...            → ok (33 packages), gofmt + go vet clean
new regression tests (all PASS, #3/P1 fail without the fix):
  internal/x402/buyer        TestPreSignedSigner_DropsExpiredAuths, TestAuthDeadlineUnix (#3)
  internal/serviceoffercontroller  TestIsSidecarUpstreamGone (#5)
  internal/x402              TestForwardAuth_SettlesInProcess_SuppressesWarning (P1)
  cmd/obol                   TestIsModelConfigured (P2)
  internal/embed             TestServiceOfferControllerImage_CarriesSecretCreateOnlyFix (#1)
                             + dev-rewrite/production-pin guards updated for the new pin
commit: b90118fd

Integration tests:

n/a — covered by the release smoke below.

Flow tests (best result per flow across 3 full smoke runs on local darwin/arm64, k3d):

Flow Network Result Evidence
flow-01..10 (single stack incl. buy+lifecycle) base-sepolia (anvil) PASS run5 + run6
flow-11 dual-stack (USDC) base-sepolia (anvil) PASS run6
flow-13 dual-stack-obol (anvil fork) base-sepolia (anvil fork) PASS run5
flow-14 live-obol base-sepolia (LIVE) PASS run5 + run6 — on-chain receipts below

Release smoke:

RELEASE_SMOKE_INCLUDE_OBOL=true RELEASE_SMOKE_INCLUDE_OBOL_FORK=true \
  OBOL_LLM_ENDPOINT=http://<spark>:8000/v1 OBOL_LLM_MODEL=qwen36-deep \
  bash flows/release-smoke.sh
LLM routed through vLLM qwen36-deep (DGX Spark over tailscale).

Every flow passed; no single run reached 13/13 because this macOS/Docker-Desktop +
cloudflare-quick-tunnel box hit a DIFFERENT environment-side transient each pass
(none code-related):
  run5: 12/13 — flow-11 Docker Desktop gRPC-FUSE mount race on Alice cluster create
  run6: 12/13 — flow-13 same mount race on Bob cluster create
  run7:        — flow-07/08 cloudflare quick-tunnel failed to establish (local 402 gate PASS)
The mount race now has a retry (commit 8cfd0ca1). The tunnel flake is external (trycloudflare).

#1 pin bump validated separately on a live cluster: deploying 503016b@sha256:bec62ea0 made a
per-agent `obol agent new` reach Ready with no 403; the prior f5d94fc image reproduces the 403.
Dev-mode rebuilds controller/buyer/verifier from this branch's source, so the smoke exercises
the source fixes directly.

Live Chain Evidence

Network: Base Sepolia (eip155:84532)

RPC/provider: paid drpc load-balancer (redacted)

Facilitator: https://x402.gcp.obol.tech (prometheus-overlay)

Contracts and tokens:

Name Address Version / notes
OBOL token 0x0a09371a8b011d5110656ceBCc70603e53FD2c78 Obol Network / OBOL / 18 dp, Permit2
ERC-8004 Identity Registry 0x8004a818bfb912233c491871b3d84c89a494bd9e mint = agentId

Wallet roles:

Role Address Source
Alice / seller / register 0xC0De030F6C37f490594F93fB99e2756703c4297E seller payTo + funded EOA
Bob / buyer / payer 0x57b0eF875DeB5A37301F1640E469a2129Da9490E deterministic 2nd-derived from REMOTE_SIGNER_PRIVATE_KEY; bobSigner == BOB_WALLET ✓

Balances:

Token Address Before After Expected delta Actual delta
OBOL (Bob) 0x0a09371a… 4949000000000000000 wei 4948000000000000000 wei -1000000000000000 (0.001 OBOL) -1000000000000000 ✓ exact
OBOL (Alice) 0x0a09371a… +1000000000000000 (0.001 OBOL) +1000000000000000 ✓ exact

Transaction receipts:

Purpose Tx hash From To Amount / event Status
ERC-8004 registration 0xff4cdbbdeea75e578728f097eb35ba230c42cc2410eb67fb2ce910782d2c2863 Alice Identity Registry 0x8004a818… mint agentId 6724 0x1
Metadata / service offer 0x19055e9680e6f6072e0783310364e80c72f89c9836e4700ad78f841894a010c4 Alice Identity Registry setMetadata 0x1
Settlement transfer 0x81f86c63992089802beba5fad18525f5bcd2509bcdc95913958c310df400f455 Bob 0x57b0eF… Alice 0xC0De03… OBOL 0.001 (Permit2) 0x1

Runtime Evidence

QA environment:

Item Value
OS / arch macOS (darwin) / arm64
Backend k3d (rancher/k3s v1.35.1-k3s1)
Tool versions kubectl 1.35.3, helm 3.20.1, helmfile 1.4.3, k3d 5.8.3
QA agent/model Hermes via LiteLLM → vLLM qwen36-deep (27B-class)

Images:

Component Image Tag / digest Source
serviceoffer-controller (release pin) ghcr.io/obolnetwork/serviceoffer-controller 503016b@sha256:bec62ea0…121957 rc9 (this PR's repin)
serviceoffer-controller/buyer/verifier (smoke) ghcr.io/obolnetwork/… :latest built from this branch (dev mode)

Kubernetes / stack:

Item Value
Stack IDs per-run default + alice/bob (petnames)
Namespaces hermes-obol-agent, llm, x402, erpc, traefik, agent-*
Pod readiness all core pods Running (per flow checks)
Cleanup result stacks torn down by release-smoke cleanup trap

Model and routing:

Item Value
Agent/model used qwen36-deep (vLLM, enable_thinking=false)
LiteLLM route custom endpoint → host-reachable vLLM; paid/* → x402-buyer sidecar
Paid endpoint status paid/qwen3.5:9b Ready (5 auths loaded)
Auth token source obol agent auth (LiteLLM master key for upstream)

Artifacts and logs:

Artifact Location / link Notes
Release report .tmp/release-smoke-20260603-130613/RELEASE_REPORT.md run6 per-flow table
flow-14 receipts .tmp/release-smoke-20260603-130613/flow-14-receipts run6 registration + settlement JSON

Demo readiness:

Item Status Notes
Seller visible / registered ERC-8004 agentId on Base Sepolia
Buyer discovery works 402 → probe → pre-sign → PurchaseRequest
Paid route works paid/* → 200
Settlement visible on-chain OBOL Transfer, status 0x1

Review Notes

Known gaps:

Follow-ups:

Reviewer focus:

  • internal/embed/infrastructure/base/templates/x402.yaml controller pin + embed_crd_test.go
    tripwire (no RBAC widened).
  • internal/x402/buyer/signer.go expiry filter (USDC validBefore=2106 never dropped).
  • internal/serviceoffercontroller/purchase.go not-found drain case (transient errors still requeue).

bussyjd added 7 commits June 3, 2026 10:24
…et create-only

The pinned serviceoffer-controller image (f5d94fc) was a side-branch build that
predated the change making Secret create-only in the reconciler. The tightened
ClusterRole grants no secrets update/patch verb, so the deployed binary 403s
when it Updates the per-agent hermes-api-server / remote-signer-keystore Secrets
on re-reconcile, and per-agent provisioning never converges.

Repin to 503016b@sha256:bec62ea0 (rc9 commit 503016b, image 0.10.0-rc9), whose
reconciler treats Secret as create-only and matches the shipped RBAC. Add a
tripwire test mirroring the x402-verifier one so a future downgrade can't
silently re-ship the bug. The short-SHA tag keeps the dev-mode :latest rewrite
and production pin invariants intact.
HoldSign popped s.auths[0] with no deadline check. A pre-signed Permit2 (OBOL)
batch shares one ~5-min deadline, so once expired the buyer served the whole
batch auth-by-auth, each returning 503 invalid_payment_expired from the
verifier before reaching a fresh auth. Add authDeadlineUnix (covering the
Permit2 deadline, nested ERC-3009 validBefore, and legacy flat field) and skip
expired auths at pick time. USDC vouchers use a year-2106 validBefore and are
never dropped.
… is gone

reconcileDeletingPurchase routed the 'not found in sidecar status' error into
the Configured&&Remaining>0 branch, which kept Remaining>0 and requeued every
5s forever, stranding the PurchaseRequest in Terminating until its finalizer was
force-removed. That signal means the sidecar has nothing left to drain. Add a
case (via isSidecarUpstreamGone) that collapses Remaining to 0 so cleanup and
finalizer removal proceed, consistent with the terminal not-found check already
present later in the function. Transient errors still requeue.
buy.py status/list showed the raw sidecar 'remaining' count, so an all-expired
Permit2 auth pool read as ready to spend. Add _auth_deadline / _count_valid_auths
and surface expired auths in both commands so an operator or agent tops up
instead of burning expired vouchers into 503s.
… path

HandleProxy (and the standalone inference gateway) rebuild the ForwardAuth
middleware per request with VerifyOnly=false by design — they proxy to the real
upstream and settle only after a <400 response — so the verifyOnly=false warning
fired on every paid request telling operators to 'fix' correct config. Add a
SettlesInProcess flag that suppresses the warning on those paths while leaving
the genuinely-dangerous Traefik ForwardAuth path loud.
obol agent new --model X provisioned cleanly for an unknown model, then every
chat call failed with 'no healthy deployments for this model'. Add a preflight
in createCRDAgent that checks a non-empty --model against the LiteLLM registry
and fails fast with the available models. A transient list error warns and
continues; an empty model still lets the controller auto-pin.
…races

Docker Desktop on macOS intermittently fails to create the gRPC-FUSE mount
source for a k3d node's workspace data dir under sustained cluster-churn
("error while creating mount source path ...: no such file or directory"),
so the k3s node never reports ready and k3d rolls the cluster back. The host
dir exists; it's a daemon-side file-sharing race. The dual-stack stack-up loop
already retries port-bind and image/Helm transients — extend it to retry this
mount race (a fresh cluster on retry clears it) so the release smoke isn't
flaked by an environment-side Docker hiccup.
@OisinKyne OisinKyne merged commit a2742da into main Jun 3, 2026
9 checks passed
@OisinKyne OisinKyne deleted the fix/rc9-ga-blockers branch June 3, 2026 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants