Queued outbox circuit breaker#778
Conversation
Introduce the outbound delivery circuit breaker state machine and its public configuration types. The new implementation tracks per-host closed, open, and half-open states in the configured key-value store so queued delivery integration can make hold, probe, and drop decisions. fedify-dev#620 Assisted-by: Codex:gpt-5.5
Wire the outbound circuit breaker into queued outbox processing. The worker now opens per-host circuits for network and 5xx failures, holds queued activities without consuming retry attempts, honors Retry-After for 429 responses without counting them as circuit failures, and reports held activity expiry through the drop and permanent-failure callbacks. fedify-dev#620 Assisted-by: Codex:gpt-5.5
Add circuit breaker state-change metrics and span events, then document the queued delivery circuit breaker in the manual and changelog. The OpenTelemetry manual now lists the circuit breaker state-change metric and the bounded span event attributes emitted by queued outbox delivery. fedify-dev#620 Assisted-by: Codex:gpt-5.5
Avoid constructing the outbound delivery circuit breaker when a federation has no outbox queue. The feature only applies to queued outbox delivery, so queue-less federations should not initialize the state tracker or emit CAS warnings for storage they will never use. fedify-dev#620 Assisted-by: Codex:gpt-5.5
Skip outbound circuit breaker accounting when a delivery error is already classified as a permanent failure. This keeps custom permanent 5xx statuses from opening a remote host circuit before the message is abandoned. fedify-dev#620 Assisted-by: Codex:gpt-5.5
Always ask the outbound retry policy whether another retry is allowed, even when the remote inbox returned Retry-After. Retry-After still wins as the delay for an allowed retry, but a policy returning null now abandons the queued activity instead of re-enqueueing forever. fedify-dev#620 Assisted-by: Codex:gpt-5.5
Record when a recovery probe enters the half-open state and allow a later worker to claim a new probe after the release interval has passed. This keeps queued delivery from getting stuck forever if the worker that won the first probe crashes before recording success or failure. fedify-dev#620 Assisted-by: Codex:gpt-5.5
Treat out-of-range Retry-After values as unusable instead of letting Temporal throw while handling a delivery failure. Malformed rate-limit headers now fall back to the configured retry policy, so queue processing can continue normally. fedify-dev#620 Assisted-by: Codex:gpt-5.5
Bound the default circuit breaker failure history to the timestamps still needed by its threshold and window policy. Sporadic failures outside the window no longer make closed-state KV values grow without limit. fedify-dev#620 Assisted-by: Codex:gpt-5.5
Treat configured permanent 4xx delivery responses as reachable probe responses for the outbound circuit breaker. Half-open circuits now close before the permanent-failure path abandons the activity, while configured permanent 5xx responses still do not open the circuit. fedify-dev#620 Assisted-by: Codex:gpt-5.5
Record activitypub.remote.host from URL.host so non-default ports are kept while paths and query strings remain excluded. Circuit breaker keys now separate services on the same hostname but different ports, and the OpenTelemetry manual and changelog describe the updated attribute meaning. fedify-dev#620 Assisted-by: Codex:gpt-5.5
Do not let circuit breaker bookkeeping failures after a successful sendActivity() call flow into the delivery-failure retry path. The activity has already reached the remote inbox, so state update failures are logged without re-enqueueing the outbox message. fedify-dev#620 Assisted-by: Codex:gpt-5.5
Handle circuit breaker drop decisions after a failed half-open probe send, so probes that expire during the send are abandoned through the same drop path as already-expired held activities instead of falling through to retry handling. Also use the kebab-case permanent failure reason "circuit-breaker-ttl" for the newly added TTL drop signal. fedify-dev#620 Assisted-by: Codex:gpt-5.5
Configured permanent delivery failures should not keep a remote host in half-open state. Clear the circuit for every configured permanent SendActivityError, including custom permanent 5xx responses, so later deliveries are not throttled by stale probe state. fedify-dev#620 Assisted-by: Codex:gpt-5.5
Circuit breaker bookkeeping is auxiliary to outbox failure handling. Log errors from failure recording or post-failure circuit decisions and then keep using the existing permanent-failure and retry paths, so transient KV or CAS errors do not stop delivery retries. fedify-dev#620 Assisted-by: Codex:gpt-5.5
Held activity TTL should only apply while a remote host circuit is still open or half-open. Read the stored circuit state before checking the TTL so messages that wake after another probe has already closed the circuit can be delivered normally. fedify-dev#620 Assisted-by: Codex:gpt-5.5
Retry-After is rate-limit retry metadata, not circuit breaker state. Parse 429 Retry-After responses before circuit breaker bookkeeping so applications that disable circuit breaker handling still use the server-provided retry delay. fedify-dev#620 Assisted-by: Codex:gpt-5.5
Half-open circuits now keep later held messages queued while the current recovery probe is still leased. The release interval controls how often those messages retry the decision, while the recovery delay bounds stale half-open recovery after a worker exits before recording success or failure. fedify-dev#620 Assisted-by: Codex:gpt-5.5
Held activities should wake no later than their configured TTL even when the recovery delay or half-open release interval is longer. Cap open and half-open hold delays at the held activity expiry so the next worker can drop expired activities on time. fedify-dev#620 Assisted-by: Codex:gpt-5.5
Held delivery span events should describe the circuit state that actually caused the hold. Carry the state on hold decisions so half-open circuits with an in-flight probe are not reported as open. fedify-dev#620 Assisted-by: Codex:gpt-5.5
Invalid Retry-After dates should not override the configured retry policy. Strictly accept only delay-seconds or HTTP-date values, and carry real circuit state changes on probe decisions so stale half-open probe refreshes do not emit a false open-to-half-open span. fedify-dev#620 Assisted-by: Codex:gpt-5.5
When a send failure opens the remote host circuit, the failed activity is held immediately instead of passing through the pre-send hold path. Record the same held span event there so the first held delivery is visible in outbox telemetry. fedify-dev#620 Assisted-by: Codex:gpt-5.5
Retry-After retries are scheduled manually even when the queue normally handles native retries. Carry the original ordering key into that enqueue call so ordered outbox delivery keeps its per-inbox ordering metadata. fedify-dev#620 Assisted-by: Codex:gpt-5.5
CAS contention while selecting a recovery probe should not leave queued outbox workers spinning forever. After repeated misses, keep the activity held briefly so another worker or later retry can observe the settled state. fedify-dev#620 Assisted-by: Codex:gpt-5.5
Held-delivery span events should use the same bounded circuit state values as state-change telemetry. Normalize half-open holds to half_open so consumers do not need to handle two spellings for the same state. fedify-dev#620 Assisted-by: Codex:gpt-5.5
|
/gemini review |
|
@codex review |
There was a problem hiding this comment.
Code Review
This pull request introduces an outbound delivery circuit breaker for queued outbox delivery in Fedify. It tracks consecutive network and HTTP 5xx delivery failures per remote host, stores the state in the configured KvStore, and temporarily holds or requeues messages instead of repeatedly sending them to unreachable servers. The circuit breaker is enabled by default and supports custom failure policies, recovery delays, and state/drop callbacks. Additionally, OpenTelemetry metrics and span events have been added to expose state changes and held activities. The review comments provide valuable feedback, including a correction for parsing HTTP-date formats as UTC in Date.parse(), a suggestion to wrap circuit.beforeSend in a try-catch block to prevent transient storage errors from blocking deliveries, a code simplification for recording circuit breaker failures, and a more robust check for wrapping transport errors in FetchError.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
packages/fedify/src/federation/middleware.test.ts (1)
7105-7142:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAssert ordering metadata on the 503 requeue path.
This new
503+Retry-Afterbranch checks delay selection and failure counting, but it never provesorderingKeysurvives the requeue the way the analogous breaker-managed retry tests now do.Suggested assertion shape
await t.step("503 respects Retry-After while counting failure", async () => { + const orderingKey = "https://example.com/object/unavailable"; fetchMock.hardReset(); fetchMock.spyGlobal(); fetchMock.post("https://unavailable.example/inbox", { status: 503, headers: { "Retry-After": "120" }, @@ await federation.processQueuedTask( undefined, - createOutboxMessage("https://unavailable.example/inbox"), + createOutboxMessage("https://unavailable.example/inbox", { orderingKey }), ); assertEquals(queued.length, 1); const retry = queued[0].message as OutboxMessage; assertEquals(retry.attempt, 1); + assertEquals(retry.orderingKey, orderingKey); assertEquals(retry.circuitHeld, undefined); + assertEquals(queued[0].options?.orderingKey, orderingKey); assertEquals( queued[0].options?.delay, Temporal.Duration.from({ seconds: 120 }), );As per coding guidelines, "Write unit tests for all new functionality following existing test patterns."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/fedify/src/federation/middleware.test.ts` around lines 7105 - 7142, The test "503 respects Retry-After while counting failure" adds a requeue path but doesn't assert that ordering metadata is preserved; update the test to verify that the requeued OutboxMessage (queued[0].message) still has the original orderingKey (use createOutboxMessage to include an orderingKey value when invoking federation.processQueuedTask) and assert equality against that value, similarly to the breaker-managed retry tests; ensure you reference the queued array, the OutboxMessage type, and the orderingKey property in the assertion so ordering survives the requeue.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/fedify/src/federation/middleware.test.ts`:
- Around line 6871-6895: Add a new subtest that mirrors "calendar retry delays
are enqueued" but sets outboxRetryPolicy to return a negative calendar duration
(e.g., Temporal.Duration.from({ days: -1 })) and call
federation.processQueuedTask with
createOutboxMessage("https://calendar-delay.example/inbox"); assert the call
does not throw, that queued.length === 1, and that queued[0].options?.delay is
the intended fallback non-negative duration (e.g., Temporal.Duration.from({
days: 0 }) or whatever the module's fallback value is) so the negative
calendar-unit regression is covered; use the same helpers referenced in the diff
(setup, federation.processQueuedTask, createOutboxMessage, queued).
---
Duplicate comments:
In `@packages/fedify/src/federation/middleware.test.ts`:
- Around line 7105-7142: The test "503 respects Retry-After while counting
failure" adds a requeue path but doesn't assert that ordering metadata is
preserved; update the test to verify that the requeued OutboxMessage
(queued[0].message) still has the original orderingKey (use createOutboxMessage
to include an orderingKey value when invoking federation.processQueuedTask) and
assert equality against that value, similarly to the breaker-managed retry
tests; ensure you reference the queued array, the OutboxMessage type, and the
orderingKey property in the assertion so ordering survives the requeue.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 27036202-6776-4852-9d0d-b5f061403dc8
📒 Files selected for processing (5)
packages/fedify/src/federation/circuit-breaker.test.tspackages/fedify/src/federation/circuit-breaker.tspackages/fedify/src/federation/middleware.test.tspackages/fedify/src/federation/middleware.tspackages/fedify/src/federation/send.ts
|
Codex Review: Didn't find any major issues. Keep it up! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
HTTP-date values are always UTC, but Date.parse can interpret the asctime form without a timezone as local time. Append GMT before parsing that form so Retry-After delays are stable across runtime timezones. fedify-dev#778 (comment) Assisted-by: Codex:gpt-5.5
If circuit breaker storage is unavailable before a queued send, log the bookkeeping failure and proceed with delivery. The breaker should not block a healthy remote host because auxiliary KV state is transiently broken. fedify-dev#778 (comment) Assisted-by: Codex:gpt-5.5
Compute the circuit state change once for HTTP delivery failures and record the span event in one place. This keeps the reachable and failing host paths behaviorally unchanged while reducing duplicated bookkeeping branches. fedify-dev#778 (comment) Assisted-by: Codex:gpt-5.5
Wrap any non-FetchError thrown from the send transport path so queued delivery can consistently classify it as a transport failure, including signed double-knock requests that surface raw fetch errors. fedify-dev#778 (comment) Assisted-by: Codex:gpt-5.5
Add a regression test for retry policies that return negative calendar durations so the clamp path stays covered without reintroducing Temporal.Duration.compare crashes. fedify-dev#778 (comment) Assisted-by: Codex:gpt-5.5
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces an outbound delivery circuit breaker for queued outbox delivery in Fedify. It tracks consecutive network and HTTP 5xx delivery failures per remote host, stores the state in the configured KvStore, and requeues messages held by an open circuit instead of repeatedly sending to an unreachable server. Additionally, it updates OpenTelemetry metrics and span events to preserve and record the URL host (including non-default ports) rather than just the hostname. The review comment correctly identifies a potential storage leak in the KV store when a custom failure callback is provided, as the consecutive failure timestamps are never pruned. Capping the history array to a reasonable limit is recommended.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4522794617
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
♻️ Duplicate comments (1)
packages/fedify/src/federation/middleware.test.ts (1)
7151-7187:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAssert
orderingKeyon the 503 retry path.This new
503 + Retry-Aftercase only proves delay selection and failure counting. If the requeue path drops ordering metadata here, the suite still passes even though the adjacent 429 path protects against that regression. Add anorderingKeyto the message and assert it survives on both the requeuedOutboxMessageandenqueue()options.Suggested test expansion
const { federation, queued, kv } = setup( { failureThreshold: 5, failureWindow: { minutes: 10 }, }, { outboxRetryPolicy: () => Temporal.Duration.from({ seconds: 3 }) }, ); + const orderingKey = "https://example.com/object/unavailable"; await federation.processQueuedTask( undefined, - createOutboxMessage("https://unavailable.example/inbox"), + createOutboxMessage("https://unavailable.example/inbox", { orderingKey }), ); assertEquals(queued.length, 1); const retry = queued[0].message as OutboxMessage; assertEquals(retry.attempt, 1); + assertEquals(retry.orderingKey, orderingKey); assertEquals(retry.circuitHeld, undefined); + assertEquals(queued[0].options?.orderingKey, orderingKey); assertEquals( queued[0].options?.delay, Temporal.Duration.from({ seconds: 120 }), );As per coding guidelines, "Write unit tests for all new functionality following existing test patterns."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/fedify/src/federation/middleware.test.ts` around lines 7151 - 7187, The 503+Retry-After test is missing verification that ordering metadata survives requeueing; update the test that calls federation.processQueuedTask with createOutboxMessage("https://unavailable.example/inbox") to include an orderingKey on the original OutboxMessage, then assert that the requeued item (queued[0].message as OutboxMessage) still has the same orderingKey and that the enqueue options (queued[0].options) also preserve the orderingKey (or equivalent enqueue metadata) so the orderingKey survives the retry path handled by processQueuedTask/enqueue.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@packages/fedify/src/federation/middleware.test.ts`:
- Around line 7151-7187: The 503+Retry-After test is missing verification that
ordering metadata survives requeueing; update the test that calls
federation.processQueuedTask with
createOutboxMessage("https://unavailable.example/inbox") to include an
orderingKey on the original OutboxMessage, then assert that the requeued item
(queued[0].message as OutboxMessage) still has the same orderingKey and that the
enqueue options (queued[0].options) also preserve the orderingKey (or equivalent
enqueue metadata) so the orderingKey survives the retry path handled by
processQueuedTask/enqueue.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 886e38a8-145d-4545-9ec4-05e72d9394ee
📒 Files selected for processing (6)
packages/fedify/src/federation/circuit-breaker.test.tspackages/fedify/src/federation/circuit-breaker.tspackages/fedify/src/federation/middleware.test.tspackages/fedify/src/federation/middleware.tspackages/fedify/src/federation/send.test.tspackages/fedify/src/federation/send.ts
Assert that ordered outbox messages keep their ordering metadata when a 503 response with Retry-After is requeued through the retry path. fedify-dev#778 Assisted-by: Codex:gpt-5.5
Cap stored failure timestamps for custom circuit breaker policies so a host that keeps failing cannot grow its KV state without limit. fedify-dev#778 (comment) Assisted-by: Codex:gpt-5.5
When a 503 response opens a host circuit, keep the held activity delayed until both the circuit recovery delay and the server Retry-After delay have elapsed, while still applying the held activity TTL cap. fedify-dev#778 (comment) Assisted-by: Codex:gpt-5.5
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces an outbound delivery circuit breaker for queued outbox delivery in Fedify. It tracks consecutive network and HTTP 5xx delivery failures per remote host, stores the state in the configured KvStore, and temporarily holds or requeues messages when a circuit is open instead of repeatedly hammering unreachable servers. The circuit breaker is enabled by default and can be customized or disabled. The review comments provide highly actionable feedback to optimize the circuit breaker's state-recording logic by avoiding redundant KV store writes when the circuit is already open, and to improve performance by pre-parsing actorIds instead of using a lazy evaluation function.
|
Codex Review: Didn't find any major issues. Swish! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Return early when recording a failure for a host whose circuit is already open. The state and failure history are unchanged in that case, so avoiding the CAS prevents redundant KV writes. fedify-dev#778 (comment) Assisted-by: Codex:gpt-5.5
Parse actor IDs at most once while processing a queued outbox message so shared failure paths do not repeat URL parsing or duplicate warning logs. fedify-dev#778 (comment) Assisted-by: Codex:gpt-5.5
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces an outbound delivery circuit breaker for queued outbox delivery in Fedify. It tracks consecutive network and HTTP 5xx delivery failures per remote host (including non-default ports) and stores the state in the configured KvStore to temporarily hold deliveries to unreachable servers. The circuit breaker is enabled by default and can be customized or disabled. The changes also include OpenTelemetry metrics and span events for tracking state changes and held activities. The review comment correctly identifies a type mismatch in FederationImpl.processQueuedTask where _ (representing the queue task/context) is passed to this.#createContext instead of undefined, which is inconsistent with other calls.
|
Codex Review: Didn't find any major issues. Can't wait for the next one! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Closes #620.
Summary
This PR adds a circuit breaker for queued outbox delivery so Fedify can stop repeatedly sending to remote inbox hosts that are already failing. When a host crosses the configured failure threshold, later queued deliveries are held instead of sent immediately. After the recovery delay, one half-open probe is allowed through; success closes the circuit, while another failure reopens it.
The implementation is in packages/fedify/src/federation/circuit-breaker.ts and is wired into queued delivery in packages/fedify/src/federation/middleware.ts. The breaker is only created when an outbox queue exists, and applications can disable or tune it through the federation builder options.
Behavior
The circuit breaker tracks remote hosts using the URL host, including non-default ports, so separate services on the same hostname do not share circuit state. Permanent delivery failures are abandoned without counting as reachability failures, but they still clear half-open circuit state because they show that the remote host responded. Held activities respect
heldActivityTtl, including messages that wake after recovery or expire during a failed probe.The retry path also handles
Retry-Afterindependently from circuit breaker state. A429response can use the server-provided delay even when the circuit breaker is disabled, while still honoring the configured retry policy and preserving ordering metadata for manually scheduled retries.Observability
This adds circuit breaker metrics and span events to the existing ActivityPub telemetry. State-change and held-delivery events use the same bounded state values:
closed,open, andhalf_open. Theactivitypub.remote.hostattribute now includes non-default ports, which is documented as a telemetry behavior change.Documentation
The user-facing docs are updated in docs/manual/circuit-breaker.md and docs/manual/opentelemetry.md. The VitePress navigation is updated in docs/.vitepress/config.mts, and the release notes are updated in CHANGES.md.
Tests
This PR adds focused coverage for circuit state transitions, held delivery, TTL expiry, stale half-open recovery, CAS contention, permanent failure handling,
Retry-After, telemetry events, and host key behavior.Verified with:
mise check && mise test:deno && mise test:node && mise test:bun