Skip to content

Research async building architecture#233

Open
nforro wants to merge 1 commit into
packit:mainfrom
nforro:async-builds
Open

Research async building architecture#233
nforro wants to merge 1 commit into
packit:mainfrom
nforro:async-builds

Conversation

@nforro
Copy link
Copy Markdown
Member

@nforro nforro commented May 20, 2026

No description provided.

Signed-off-by: Nikola Forró <nforro@redhat.com>
Assisted-by: Claude Opus 4.6 via Claude Code
Copy link
Copy Markdown
Member

@majamassarini majamassarini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the proposed solutions, personally I would start with option A (async multiplexing).

However, I am not totally sure we should disregard the idea of relying on the CI builds.
We would like to rely on the CI tests in the future, right? So why not also on the CI builds?
If we mark the MR as incomplete (WIP/draft), I think we can iterate on it multiple times based on the CI feedback. The iteration loop would be almost the same for the builds and for all the other CI checks. Also, we could collect "intermediate feedback from maintainers" in this way, something that other solutions will not give us.
I think it really depends on how many times we usually need to iterate on a build. And, I may be wrong, but as we improve things, I think we are iterating less and less on it.

@nforro
Copy link
Copy Markdown
Member Author

nforro commented May 25, 2026

If we mark the MR as incomplete (WIP/draft)

Can we do that? I think we wanted to avoid opening clearly wrong MRs and spamming maintainers - if we can open draft MRs and those can be configured to be silent until converted to regular MRs, then I think we can indeed move the feedback loop post MR creation, with the advantage of incorporating both CI and human feedback.


**Advantages:**

- Minimal code changes — workflows are already async
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really love that, no need for celery, just a few code changes

since the `build_package` tool call goes through the same async polling loop
- Single pod, single process — simpler operations

**Challenges:**
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about logs? how hard will it be for us to diagnose issues from logs?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, we would probably have to augment the log messages to be able to differentiate between tasks, similarly to packit-service with Celery.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, we can prepend with some session id or ideally even with human readable identification, something like "f{jira}-{branch}-{attempt}"

Comment on lines +216 to +217
- **Error blast radius**: one task's crash could affect others in the same
process.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be great to know if we can prepare the codebase for this change

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Response from Claude:

Yes, several incremental prep steps would make Option A easier without committing to it yet:

  • Add a task/correlation ID to the workflow context and thread it through all log messages (using Python's logging filter or contextvars). This is useful regardless of concurrency — it improves traceability even with single-task pods.
  • Audit shared mutable state — check if any module-level globals, singletons, or shared file paths (e.g. /git-repos/{issue}/) would collide when two tasks run concurrently. The PVC paths already use {issue} as a namespace, which is good.
  • Ensure MCP gateway session isolation — verify that the _response_streams dict keyed by RequestId truly handles concurrent tool calls without cross-talk.
  • Profile actual RSS on a representative set of packages during source processing. This determines whether N=2 is feasible without a quota bump.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: implement it and propose a PR 😅

error message, upstream patches, repo path). This means conversation
continuity across the build wait is **not required**.

## Alternative: Skip Build Verification Entirely
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while this would simplify things a lot, I am a bit skeptical - I remember agent sometimes iterated many times on build failures during pilot, and re-pushing 5-10 times to a draft MR might be quite noisy. On the other hand, this might be statistically happening in minority of cases (and maybe more common in rebases than in backports?) But definitely worth validating against recent runs, @opohorel @abobrov any data from your experience on this?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would advise against skipping the build verification. quite a few backports are not passing through it, forcing the agent to improve the backport. although I saw that the number of consecutive build failures decreasing with newer models, but we are not there yet

Comment on lines +414 to +421
1. **What is the actual RSS of a backport pod during source processing?**
Feedback from the SE deployment shows large packages (Firefox, Thunderbird)
need ~4Gi just for source extraction, making even N=2 concurrency
challenging. Profiling smaller packages would clarify the typical case.
2. **What's the namespace quota expansion path?** Determines feasibility of
Option A at higher concurrency.
3. **How large are typical agent conversations?** Affects state size
estimates and memory profiling for Option A.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nforro would you create a POC issue that we can prioritise and decide which option to go with?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

6 participants