Skip to content

ci: fix flaky integration tests by distributing images via GHCR#3582

Open
amir-deris wants to merge 9 commits into
mainfrom
amir/plt-476-CI-integration-test-image-fix
Open

ci: fix flaky integration tests by distributing images via GHCR#3582
amir-deris wants to merge 9 commits into
mainfrom
amir/plt-476-CI-integration-test-image-fix

Conversation

@amir-deris

@amir-deris amir-deris commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Problem

The Docker Integration Test workflow packaged the localnode/rpcnode Docker images into a ~1 GB artifact (integration-docker-images.tar.zst) that ~40 matrix jobs each downloaded concurrently via actions/download-artifact@v4. The action streams and extracts the zip without an end-to-end integrity check, so a prematurely closed connection can leave a truncated file without failing the step. The first detector was zstd -d | docker load, failing with Read error (39): premature end / unexpected EOF and requiring a manual rerun. With 40 concurrent 1 GB downloads per run, this flaked regularly.

Fix

Distribute the images via GHCR instead of an artifact. Registry pulls are content-addressed — every layer is sha256-verified and retried automatically by the docker client — so truncation cannot slip through silently.

  • prepare-cluster pushes both images to ghcr.io/sei-protocol/sei-chain-integration-test-{localnode,rpcnode}:<run_id> using GITHUB_TOKEN (no OIDC or external secrets required). The CI artifact now carries only the small seid tarball.
  • Test jobs log in to GHCR, docker pull the run-tagged images, and retag them to sei-chain/{localnode,rpcnode} — everything downstream (docker-cluster-start-ci etc.) is unchanged.
  • Both builds stamp a sei-chain.ci-run-id label so every run pushes a unique image digest. Labels are config-only: the layer cache is unaffected and a cache-hit run uploads just a new config blob + manifest. This avoids the pitfall of re-tagging a stable digest where in-flight runs could be affected by tag moves.
  • Reruns of failed test jobs keep working: tags are keyed by run_id and persist in GHCR across attempts.
  • Adds ghcr-integration-test-cleanup.yml: a weekly scheduled workflow (Sundays 06:00 UTC) that prunes run-id tags older than 14 days from both GHCR repos, while preserving the :cache tag. Supports workflow_dispatch with a dry-run option.

Advantage over ECR

It avoid ~3000$ monthly cost for egress charge from AWS to GitHub runners. Also GITHUB_TOKEN is automatically available to all workflows including fork PRs, removing the need for OIDC role assumptions or AWS credentials for image distribution. No IAM setup required.

@cursor

cursor Bot commented Jun 12, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Changes how integration CI obtains Docker images and who can run full integration tests from forks; failures would block CI rather than affect production chain code.

Overview
Integration test localnode/rpcnode images are no longer shipped as a ~1 GB integration-docker-images artifact that every matrix job downloads. prepare-cluster pushes both images to GHCR under run-id tags, keeps only the seid tarball in the CI artifact, and matrix jobs docker pull and retag to sei-chain/localnode / sei-chain/rpcnode.

Build caching moves from AWS ECR to GHCR :cache tags on the same integration-test packages; builds add a sei-chain.ci-run-id label so each run gets a distinct digest for safe pruning. AWS OIDC/ECR login is removed from this workflow; packages:write/read and GITHUB_TOKEN GHCR login replace it (with an explicit note that fork PRs cannot publish org packages).

A new ghcr-integration-test-cleanup workflow runs weekly (and on manual dispatch with dry-run) to delete numeric run-id package versions older than 14 days while preserving :cache and other non-run-id tags.

Reviewed by Cursor Bugbot for commit de08877. Bugbot is set up for automated code reviews on this repo. Configure here.

@amir-deris amir-deris changed the title modified integration-test yaml to push pull from ecr ci: distribute integration test images via ECR instead of 1GB artifact Jun 12, 2026
@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedJun 16, 2026, 8:11 PM

@amir-deris amir-deris requested review from bdchatham and masih June 12, 2026 19:11
@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.35%. Comparing base (d76c712) to head (de08877).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3582      +/-   ##
==========================================
- Coverage   58.96%   58.35%   -0.62%     
==========================================
  Files        2208     2140      -68     
  Lines      181733   174609    -7124     
==========================================
- Hits       107157   101888    -5269     
+ Misses      64959    63649    -1310     
+ Partials     9617     9072     -545     
Flag Coverage Δ
sei-db 70.41% <ø> (ø)
sei-db-state-db ?

Flags with carried forward coverage won't be shown. Click here to find out more.
see 119 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@amir-deris amir-deris changed the title ci: distribute integration test images via ECR instead of 1GB artifact ci: distribute integration test images via GHCR instead of 1GB artifact Jun 12, 2026
@amir-deris amir-deris changed the title ci: distribute integration test images via GHCR instead of 1GB artifact ci: fix flaky integration tests by distributing images via GHCR Jun 12, 2026
packages: write
steps:
- name: Delete stale run-id tags
uses: dataaxiom/ghcr-cleanup-action@d52806a0dc70b430571a37da1fde39733ffd640f # v1.2.2

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't trust this action, could we use the gh official one please: https://github.com/actions/delete-package-versions

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masih Thanks for feedback. I removed the 3rd party package now and used a custom script instead. Regarding the github official action, it wouldn't work here because:

It's count-based, not time-based. Its two main modes are:

  - num-old-versions-to-delete: N — delete the N oldest versions
  - min-versions-to-keep: N — delete everything except the N newest versions

  ignore-versions takes a regex of version names/tags to skip deletion, so you could protect :cache with ignore-versions: ^cache$.

  It could have been used here with a count-based policy — e.g. "keep the last 20 run images" — but that's a weaker fit for this use case:
  - CI frequency varies week to week, so a fixed count doesn't map cleanly to a time window
  - It would require tuning a magic number rather than "14 days"
  
  The official action is better suited for things like "keep the last 5 releases" on a package with predictable, low-frequency publishing. For a high-frequency CI artifact store where time-based
  retention is the natural policy, it falls short.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants