Skip to content

Retry GHCR container pushes on transient failures#3445

Open
Mpdreamz wants to merge 1 commit into
mainfrom
fix/container-push-retry
Open

Retry GHCR container pushes on transient failures#3445
Mpdreamz wants to merge 1 commit into
mainfrom
fix/container-push-retry

Conversation

@Mpdreamz
Copy link
Copy Markdown
Member

@Mpdreamz Mpdreamz commented Jun 2, 2026

Why

Pre-release and release workflows publish container images to ghcr.io. Transient registry errors can fail the whole job even when earlier images in the same run already pushed successfully.

Example: Pre-release main branch run 26808848612 — the Publish Containers job uploaded docs-builder and docs-builder-mcp, then failed on the third image with CONTAINER1005: Registry push failed; received status code 'BadGateway' when pushing elastic/docs-builder-api:edge to ghcr.io.

What

When pushing to the registry in CI, each dotnet publish /t:PublishContainer is retried up to three times with no delay between attempts. Local builds that do not push to ghcr.io are unchanged.

Pre-release and release workflows can fail when ghcr.io returns
BadGateway during PublishContainer. Retry up to three times when
pushing to the registry; local builds are unchanged.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@Mpdreamz Mpdreamz requested a review from a team as a code owner June 2, 2026 09:29
@Mpdreamz Mpdreamz added the automation packaging, ci/cd. label Jun 2, 2026
@Mpdreamz Mpdreamz requested a review from cotti June 2, 2026 09:29
@Mpdreamz Mpdreamz temporarily deployed to integration-tests June 2, 2026 09:29 — with GitHub Actions Inactive
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 2, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds resilient retry logic for container registry operations in the build pipeline. A new runDotnetWithRegistryPushRetry helper executes dotnet commands with up to 3 attempts, retrying when non-zero exit codes occur. The publishContainers function now routes through this helper when registry-related MSBuild properties are present, providing transient failure handling for registry push operations.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding retry logic for GHCR container pushes on transient failures.
Description check ✅ Passed The description is directly related to the changeset, explaining the problem (transient registry failures), the solution (retry logic), and impact scope (CI pushes only).
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch fix/container-push-retry

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@build/Targets.fs`:
- Around line 110-123: The retry loop in runDotnetWithRegistryPushRetry
currently causes one extra run; change the recursion to start with attempt 0
(call attempt 0) and update the guard to "when number < maxAttempts - 1" so that
you only retry while number < maxAttempts-1, and update the printfn to display
(number + 1) as the current attempt; keep the final branch as exec { run
"dotnet" args } so the command is executed at most maxAttempts times.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bdfcd245-bf8a-4828-bef3-0c13ae41b897

📥 Commits

Reviewing files that changed from the base of the PR and between 089578e and 78bbace.

📒 Files selected for processing (1)
  • build/Targets.fs

Comment thread build/Targets.fs
Comment on lines +110 to +123
let private runDotnetWithRegistryPushRetry (args: string list) =
let maxAttempts = 3
let rec attempt number =
match exec {
validExitCode (fun _ -> true)
exit_code_of "dotnet" args
} with
| 0 -> ()
| _ when number < maxAttempts ->
printfn "Container registry push failed (attempt %d/%d); retrying..." number maxAttempts
attempt (number + 1)
| _ ->
exec { run "dotnet" args }
attempt 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Logic error: performs 4 attempts instead of 3.

The retry logic executes the command 4 times when all attempts fail, not 3. When attempt 3 fails, the guard 3 < maxAttempts is false, so it falls through to line 122 and runs the command a 4th time. The PR description states "up to three times" but this implementation runs once more than intended.

🔧 Proposed fix
 let private runDotnetWithRegistryPushRetry (args: string list) =
     let maxAttempts = 3
     let rec attempt number =
-        match exec {
+        let exitCode = exec {
             validExitCode (fun _ -> true)
             exit_code_of "dotnet" args
-        } with
+        }
+        match exitCode with
         | 0 -> ()
         | _ when number < maxAttempts ->
             printfn "Container registry push failed (attempt %d/%d); retrying..." number maxAttempts
             attempt (number + 1)
         | _ ->
-            exec { run "dotnet" args }
+            failwithf "Container registry push failed after %d attempts with exit code %d" maxAttempts exitCode
     attempt 1
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@build/Targets.fs` around lines 110 - 123, The retry loop in
runDotnetWithRegistryPushRetry currently causes one extra run; change the
recursion to start with attempt 0 (call attempt 0) and update the guard to "when
number < maxAttempts - 1" so that you only retry while number < maxAttempts-1,
and update the printfn to display (number + 1) as the current attempt; keep the
final branch as exec { run "dotnet" args } so the command is executed at most
maxAttempts times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automation packaging, ci/cd.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant