Retry GHCR container pushes on transient failures#3445
Conversation
Pre-release and release workflows can fail when ghcr.io returns BadGateway during PublishContainer. Retry up to three times when pushing to the registry; local builds are unchanged. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>
📝 WalkthroughWalkthroughThis PR adds resilient retry logic for container registry operations in the build pipeline. A new 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@build/Targets.fs`:
- Around line 110-123: The retry loop in runDotnetWithRegistryPushRetry
currently causes one extra run; change the recursion to start with attempt 0
(call attempt 0) and update the guard to "when number < maxAttempts - 1" so that
you only retry while number < maxAttempts-1, and update the printfn to display
(number + 1) as the current attempt; keep the final branch as exec { run
"dotnet" args } so the command is executed at most maxAttempts times.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: bdfcd245-bf8a-4828-bef3-0c13ae41b897
📒 Files selected for processing (1)
build/Targets.fs
| let private runDotnetWithRegistryPushRetry (args: string list) = | ||
| let maxAttempts = 3 | ||
| let rec attempt number = | ||
| match exec { | ||
| validExitCode (fun _ -> true) | ||
| exit_code_of "dotnet" args | ||
| } with | ||
| | 0 -> () | ||
| | _ when number < maxAttempts -> | ||
| printfn "Container registry push failed (attempt %d/%d); retrying..." number maxAttempts | ||
| attempt (number + 1) | ||
| | _ -> | ||
| exec { run "dotnet" args } | ||
| attempt 1 |
There was a problem hiding this comment.
Logic error: performs 4 attempts instead of 3.
The retry logic executes the command 4 times when all attempts fail, not 3. When attempt 3 fails, the guard 3 < maxAttempts is false, so it falls through to line 122 and runs the command a 4th time. The PR description states "up to three times" but this implementation runs once more than intended.
🔧 Proposed fix
let private runDotnetWithRegistryPushRetry (args: string list) =
let maxAttempts = 3
let rec attempt number =
- match exec {
+ let exitCode = exec {
validExitCode (fun _ -> true)
exit_code_of "dotnet" args
- } with
+ }
+ match exitCode with
| 0 -> ()
| _ when number < maxAttempts ->
printfn "Container registry push failed (attempt %d/%d); retrying..." number maxAttempts
attempt (number + 1)
| _ ->
- exec { run "dotnet" args }
+ failwithf "Container registry push failed after %d attempts with exit code %d" maxAttempts exitCode
attempt 1🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@build/Targets.fs` around lines 110 - 123, The retry loop in
runDotnetWithRegistryPushRetry currently causes one extra run; change the
recursion to start with attempt 0 (call attempt 0) and update the guard to "when
number < maxAttempts - 1" so that you only retry while number < maxAttempts-1,
and update the printfn to display (number + 1) as the current attempt; keep the
final branch as exec { run "dotnet" args } so the command is executed at most
maxAttempts times.
Why
Pre-release and release workflows publish container images to ghcr.io. Transient registry errors can fail the whole job even when earlier images in the same run already pushed successfully.
Example: Pre-release main branch run 26808848612 — the Publish Containers job uploaded
docs-builderanddocs-builder-mcp, then failed on the third image withCONTAINER1005: Registry push failed; received status code 'BadGateway'when pushingelastic/docs-builder-api:edgeto ghcr.io.What
When pushing to the registry in CI, each
dotnet publish /t:PublishContaineris retried up to three times with no delay between attempts. Local builds that do not push to ghcr.io are unchanged.