Research async building architecture#233
Conversation
Signed-off-by: Nikola Forró <nforro@redhat.com> Assisted-by: Claude Opus 4.6 via Claude Code
majamassarini
left a comment
There was a problem hiding this comment.
From the proposed solutions, personally I would start with option A (async multiplexing).
However, I am not totally sure we should disregard the idea of relying on the CI builds.
We would like to rely on the CI tests in the future, right? So why not also on the CI builds?
If we mark the MR as incomplete (WIP/draft), I think we can iterate on it multiple times based on the CI feedback. The iteration loop would be almost the same for the builds and for all the other CI checks. Also, we could collect "intermediate feedback from maintainers" in this way, something that other solutions will not give us.
I think it really depends on how many times we usually need to iterate on a build. And, I may be wrong, but as we improve things, I think we are iterating less and less on it.
Can we do that? I think we wanted to avoid opening clearly wrong MRs and spamming maintainers - if we can open draft MRs and those can be configured to be silent until converted to regular MRs, then I think we can indeed move the feedback loop post MR creation, with the advantage of incorporating both CI and human feedback. |
|
|
||
| **Advantages:** | ||
|
|
||
| - Minimal code changes — workflows are already async |
There was a problem hiding this comment.
really love that, no need for celery, just a few code changes
| since the `build_package` tool call goes through the same async polling loop | ||
| - Single pod, single process — simpler operations | ||
|
|
||
| **Challenges:** |
There was a problem hiding this comment.
how about logs? how hard will it be for us to diagnose issues from logs?
There was a problem hiding this comment.
Good point, we would probably have to augment the log messages to be able to differentiate between tasks, similarly to packit-service with Celery.
There was a problem hiding this comment.
makes sense, we can prepend with some session id or ideally even with human readable identification, something like "f{jira}-{branch}-{attempt}"
| - **Error blast radius**: one task's crash could affect others in the same | ||
| process. |
There was a problem hiding this comment.
it would be great to know if we can prepare the codebase for this change
There was a problem hiding this comment.
Response from Claude:
Yes, several incremental prep steps would make Option A easier without committing to it yet:
- Add a task/correlation ID to the workflow context and thread it through all log messages (using Python's logging filter or contextvars). This is useful regardless of concurrency — it improves traceability even with single-task pods.
- Audit shared mutable state — check if any module-level globals, singletons, or shared file paths (e.g. /git-repos/{issue}/) would collide when two tasks run concurrently. The PVC paths already use {issue} as a namespace, which is good.
- Ensure MCP gateway session isolation — verify that the _response_streams dict keyed by RequestId truly handles concurrent tool calls without cross-talk.
- Profile actual RSS on a representative set of packages during source processing. This determines whether N=2 is feasible without a quota bump.
There was a problem hiding this comment.
Claude: implement it and propose a PR 😅
| error message, upstream patches, repo path). This means conversation | ||
| continuity across the build wait is **not required**. | ||
|
|
||
| ## Alternative: Skip Build Verification Entirely |
There was a problem hiding this comment.
while this would simplify things a lot, I am a bit skeptical - I remember agent sometimes iterated many times on build failures during pilot, and re-pushing 5-10 times to a draft MR might be quite noisy. On the other hand, this might be statistically happening in minority of cases (and maybe more common in rebases than in backports?) But definitely worth validating against recent runs, @opohorel @abobrov any data from your experience on this?
There was a problem hiding this comment.
I would advise against skipping the build verification. quite a few backports are not passing through it, forcing the agent to improve the backport. although I saw that the number of consecutive build failures decreasing with newer models, but we are not there yet
| 1. **What is the actual RSS of a backport pod during source processing?** | ||
| Feedback from the SE deployment shows large packages (Firefox, Thunderbird) | ||
| need ~4Gi just for source extraction, making even N=2 concurrency | ||
| challenging. Profiling smaller packages would clarify the typical case. | ||
| 2. **What's the namespace quota expansion path?** Determines feasibility of | ||
| Option A at higher concurrency. | ||
| 3. **How large are typical agent conversations?** Affects state size | ||
| estimates and memory profiling for Option A. |
There was a problem hiding this comment.
@nforro would you create a POC issue that we can prioritise and decide which option to go with?
No description provided.