Skip to content

ENT-14140: psql_wrapper.sh: retry psql commands on transient failures#3165

Open
larsewi wants to merge 1 commit into
cfengine:masterfrom
larsewi:fr-race
Open

ENT-14140: psql_wrapper.sh: retry psql commands on transient failures#3165
larsewi wants to merge 1 commit into
cfengine:masterfrom
larsewi:fr-race

Conversation

@larsewi
Copy link
Copy Markdown
Contributor

@larsewi larsewi commented May 26, 2026

Observed a race condition in CI where bundle agent superhub_schema interacts with postgres shortly after service restart.

03:12:04 systemd: Stopping CFEngine Enterprise PostgreSQL Database...
03:12:04 systemd: Started CFEngine Enterprise PostgreSQL Database.
03:12:04 cf-agent: Executing ... psql_wrapper.sh cfdb select superhub_schema(...)
03:12:05 cf-agent: returned code '2' defined as promise failed

Fixed by gating superhub_schema, ensure_feeders, and imported_data on a persistent class set by the cf-postgres restart.

Ticket: ENT-14140

@larsewi larsewi added the cherry-pick? Fixes which may need to be cherry-picked to LTS branches label May 26, 2026
Copy link
Copy Markdown
Contributor

@craigcomstock craigcomstock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't feel quite right to me. It would seem we need more of a sequence of actions and not a class/gate situation. We need the restart to finish and then run superhub_schema(). With this solution superhub_schema() would be run at next agent interval, which is OK but maybe not ideal. Could we instead of gating on recent restart gate on postgresql up and ready in hopes that superhub_schema() could run in the same agent run?

# restarted.
{
promise_repaired => { "postgres_recently_restarted" };
persist_time => "1";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a great fix. a 1 minute persistence is a bit strange. 4 or 5 minutes make more sense on the default periodic scale but still I would try to think of a different way to handle this.

@larsewi
Copy link
Copy Markdown
Contributor Author

larsewi commented May 26, 2026

2 if the connection to the server went bad and the session was not interactive

@craigcomstock, @nickanderson what if we have the psql_wrapper.sh retry in case of return code 2 ?

@nickanderson
Copy link
Copy Markdown
Member

2 if the connection to the server went bad and the session was not interactive

@craigcomstock, @nickanderson what if we have the psql_wrapper.sh retry in case of return code 2 ?

yeah I think that would be better.

@larsewi
Copy link
Copy Markdown
Contributor Author

larsewi commented May 27, 2026

2 if the connection to the server went bad and the session was not interactive

@craigcomstock, @nickanderson what if we have the psql_wrapper.sh retry in case of return code 2 ?

yeah I think that would be better.

The only thing @nickanderson is that it will cause the agent to hang while it bootstraps. Or should it perhaps run these commands in the background?

Retry psql command on transient failures. E.g., when postgres is being
restarted due to config change.

Ticket: ENT-14140
Changelog: psql commands are now retried on transient errors in federated reporting
Signed-off-by: Lars Erik Wik <lars.erik.wik@northern.tech>
@larsewi larsewi changed the title ENT-14140: federation.cf: gate postgres interaction on recent service restart ENT-14140: psql_wrapper.sh: retry psql commands on transient failures May 27, 2026
@nickanderson
Copy link
Copy Markdown
Member

nickanderson commented May 27, 2026

The only thing @nickanderson is that it will cause the agent to hang while it bootstraps. Or should it perhaps run these commands in the background?

Hang permanently, or just be slow? Permanent hang needs to be avoided. Just reading the code there it looks like it might hang for up to 30 seconds while re-trying. Also, this hang would be limited to the hub bootstrapping to itself, is that right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick? Fixes which may need to be cherry-picked to LTS branches

Development

Successfully merging this pull request may close these issues.

3 participants