Skip to content

fix(devnet): supervise lumerad lifecycle in start.sh + add restart policy#150

Open
mateeullahmalik wants to merge 1 commit into
masterfrom
fix/devnet-start-sh-lumerad-lifecycle
Open

fix(devnet): supervise lumerad lifecycle in start.sh + add restart policy#150
mateeullahmalik wants to merge 1 commit into
masterfrom
fix/devnet-start-sh-lumerad-lifecycle

Conversation

@mateeullahmalik

Copy link
Copy Markdown
Contributor

Summary

Make the devnet validator containers actually die — and auto-recover — when
lumerad dies. Today the container PID 1 outlives lumerad, so
docker ps happily reports "Up" while the chain process is a defunct
zombie. There is no restart policy either, so docker would not recover
the container even if PID 1 did exit.

Incident that motivated this

2026-06-02 13:05 UTC, lumera-devnet-1. During a host cosmovisor
restart cycle, an operator ran:

sudo pkill -9 -f 'lumerad start'

on the host. The regex was intended to kill the host
lumera-devnet-api.service cosmovisor child, but Docker's PID namespace
is shared with the host, so the pkill also matched the lumerad start
processes inside the 5 validator containers. All 5 in-container
lumerads got SIGKILLed.

Container PID 1 is bash /root/scripts/start.sh, which does:

"${DAEMON}" start --home "${DAEMON_HOME}" ... >"${VALIDATOR_LOG}" 2>&1 &exec tail -F "${VALIDATOR_LOG}" "${SUPERNODE_LOG}"

So PID 1 became tail -F. Tail doesn't care that lumerad died — it kept
following the log file forever. docker ps reported Up 13 days for
all 5 containers while the chain was actually dead at h=301810 for
~6 days before anyone noticed.

There was also no restart: policy on the validator service, so even
if PID 1 had exited correctly, docker would not have restarted the
container.

Fix

1. devnet/scripts/start.sh

  • Capture LUMERAD_PID=$! after launching lumerad.
  • Demote tail -F from exec to a background &, capturing
    TAIL_PID=$!.
  • Add wait_for_lumera():
    • wait "$LUMERAD_PID", capture rc.
    • kill "$TAIL_PID" to flush logs.
    • exit "$rc".
  • Call wait_for_lumera at the end of both run_auto_flow and the
    run mode.

PID 1 now dies with lumerad. Combined with the restart policy, docker
restarts the container automatically.

2. devnet/generators/docker-compose.go

  • Add Restart string field (yaml restart,omitempty) to
    DockerComposeService.
  • Set Restart: "unless-stopped" on the generated validator services.

Validator containers will now restart on lumerad death but NOT on a
deliberate docker compose stop.

End-to-end validation

A standalone docker container reproducing the supervisor logic was
spawned with --restart on-failure:3:

[BOOT] iteration=1 pid1=1
[BOOT] fake lumerad pid=9
/sim_start.sh: line 27:     9 Killed                     sleep 100000
[BOOT] lumerad exited rc=137 — propagating to PID 1
[TEST] killed lumerad pid=9 at 17:40:13
[BOOT] iteration=2 pid1=1
[BOOT] fake lumerad pid=9

State across iterations (the /state/iter mount) was preserved across
the restart, which is exactly the behaviour the real validator
containers need so they rejoin the chain on the same data directory.

Risks / non-goals

  • Devnet-only. Touches devnet/* only — no chain state machine
    impact, no module impact, no protobuf changes, no consensus impact.
  • docker compose stop continues to win over unless-stopped, so
    intentional shutdowns are unchanged.
  • Existing live devnet containers need to be recreated for the new
    restart policy to apply. docker compose restart is NOT enough; ops
    needs docker compose up -d --force-recreate on next deploy.
    Existing containers also pick up the new start.sh only after a
    rebuild of the validator image (the script is baked into the image at
    build time). Both follow naturally from the next devnet rollout.

Rollback

Revert the commit; existing containers do not depend on the new
behaviour for normal steady-state operation.

Observability follow-ups (separate)

  • lumera-devnet-1 is not currently shipping logs to Datadog (verified
    via Datadog API — only testnet supernode services are indexed). Once
    the next ops cycle wires devnet logs to DD, add a monitor for
    "lumera-devnet-1 height did not advance in 5 min". That's a separate
    ops task, not in scope here.

…licy

When lumerad inside a devnet validator container dies (crash, OOM, or a
host-side `pkill -f 'lumerad start'` that matches container processes
via the shared PID namespace), `start.sh` previously kept running because
its final `exec tail -F …` replaced bash as PID 1. `docker ps` reported
"Up" indefinitely while the chain process was a defunct zombie and the
container never restarted.

On 2026-06-02 this exact misfire — an operator's `sudo pkill -9 -f
'lumerad start'` aimed at the host cosmovisor process matched the
in-container lumerads through the shared PID namespace — silently killed
all 5 lumera-devnet-1 validators at h=301810 and left the chain dead for
~6 days before anyone noticed.

Fix:

1. `devnet/scripts/start.sh`: capture `LUMERAD_PID=$!`, demote `tail -F`
   from `exec` to background, then `wait "$LUMERAD_PID"` and propagate
   the exit code via `exit "$rc"`. PID 1 now dies with lumerad.

2. `devnet/generators/docker-compose.go`: add `Restart string` field to
   `DockerComposeService` and set `Restart: "unless-stopped"` on
   validator services. Combined with (1), a non-zero lumerad exit
   triggers docker to restart the container and rejoin the chain.

End-to-end validated against a real docker container with the new
supervisor logic:
 - iteration=1 fake-lumerad SIGKILLed at +3s
 - start.sh propagated rc=137 to PID 1
 - docker restart policy fired
 - iteration=2 started cleanly with fresh fake-lumerad
 - state preserved through the restart

`run`, `auto`, and bootstrap/wait modes all updated consistently so the
behaviour is identical for the modes that actually launch lumerad.

Risks / non-goals:
 - This is a devnet-only change (devnet/* paths). No chain state
   machine impact.
 - `docker compose stop` continues to win over the restart policy
   (unless-stopped semantics), so intentional shutdowns are unchanged.
 - Existing live devnet containers still need to be recreated
   (`docker compose up -d --force-recreate`) for the new restart
   policy to apply; an in-place `compose restart` is not enough.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant