fix(devnet): supervise lumerad lifecycle in start.sh + add restart policy#150
Open
mateeullahmalik wants to merge 1 commit into
Open
fix(devnet): supervise lumerad lifecycle in start.sh + add restart policy#150mateeullahmalik wants to merge 1 commit into
mateeullahmalik wants to merge 1 commit into
Conversation
…licy When lumerad inside a devnet validator container dies (crash, OOM, or a host-side `pkill -f 'lumerad start'` that matches container processes via the shared PID namespace), `start.sh` previously kept running because its final `exec tail -F …` replaced bash as PID 1. `docker ps` reported "Up" indefinitely while the chain process was a defunct zombie and the container never restarted. On 2026-06-02 this exact misfire — an operator's `sudo pkill -9 -f 'lumerad start'` aimed at the host cosmovisor process matched the in-container lumerads through the shared PID namespace — silently killed all 5 lumera-devnet-1 validators at h=301810 and left the chain dead for ~6 days before anyone noticed. Fix: 1. `devnet/scripts/start.sh`: capture `LUMERAD_PID=$!`, demote `tail -F` from `exec` to background, then `wait "$LUMERAD_PID"` and propagate the exit code via `exit "$rc"`. PID 1 now dies with lumerad. 2. `devnet/generators/docker-compose.go`: add `Restart string` field to `DockerComposeService` and set `Restart: "unless-stopped"` on validator services. Combined with (1), a non-zero lumerad exit triggers docker to restart the container and rejoin the chain. End-to-end validated against a real docker container with the new supervisor logic: - iteration=1 fake-lumerad SIGKILLed at +3s - start.sh propagated rc=137 to PID 1 - docker restart policy fired - iteration=2 started cleanly with fresh fake-lumerad - state preserved through the restart `run`, `auto`, and bootstrap/wait modes all updated consistently so the behaviour is identical for the modes that actually launch lumerad. Risks / non-goals: - This is a devnet-only change (devnet/* paths). No chain state machine impact. - `docker compose stop` continues to win over the restart policy (unless-stopped semantics), so intentional shutdowns are unchanged. - Existing live devnet containers still need to be recreated (`docker compose up -d --force-recreate`) for the new restart policy to apply; an in-place `compose restart` is not enough.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Make the devnet validator containers actually die — and auto-recover — when
lumeraddies. Today the container PID 1 outliveslumerad, sodocker pshappily reports "Up" while the chain process is a defunctzombie. There is no
restartpolicy either, so docker would not recoverthe container even if PID 1 did exit.
Incident that motivated this
2026-06-02 13:05 UTC, lumera-devnet-1. During a host cosmovisor
restart cycle, an operator ran:
on the host. The regex was intended to kill the host
lumera-devnet-api.servicecosmovisor child, but Docker's PID namespaceis shared with the host, so the
pkillalso matched thelumerad startprocesses inside the 5 validator containers. All 5 in-container
lumerads got SIGKILLed.Container PID 1 is
bash /root/scripts/start.sh, which does:So PID 1 became
tail -F. Tail doesn't care that lumerad died — it keptfollowing the log file forever.
docker psreportedUp 13 daysforall 5 containers while the chain was actually dead at h=301810 for
~6 days before anyone noticed.
There was also no
restart:policy on the validator service, so evenif PID 1 had exited correctly, docker would not have restarted the
container.
Fix
1.
devnet/scripts/start.shLUMERAD_PID=$!after launching lumerad.tail -Ffromexecto a background&, capturingTAIL_PID=$!.wait_for_lumera():wait "$LUMERAD_PID", capture rc.kill "$TAIL_PID"to flush logs.exit "$rc".wait_for_lumeraat the end of bothrun_auto_flowand therunmode.PID 1 now dies with lumerad. Combined with the restart policy, docker
restarts the container automatically.
2.
devnet/generators/docker-compose.goRestart stringfield (yamlrestart,omitempty) toDockerComposeService.Restart: "unless-stopped"on the generated validator services.Validator containers will now restart on lumerad death but NOT on a
deliberate
docker compose stop.End-to-end validation
A standalone docker container reproducing the supervisor logic was
spawned with
--restart on-failure:3:State across iterations (the
/state/itermount) was preserved acrossthe restart, which is exactly the behaviour the real validator
containers need so they rejoin the chain on the same data directory.
Risks / non-goals
devnet/*only — no chain state machineimpact, no module impact, no protobuf changes, no consensus impact.
docker compose stopcontinues to win overunless-stopped, sointentional shutdowns are unchanged.
restart policy to apply.
docker compose restartis NOT enough; opsneeds
docker compose up -d --force-recreateon next deploy.Existing containers also pick up the new
start.shonly after arebuild of the validator image (the script is baked into the image at
build time). Both follow naturally from the next devnet rollout.
Rollback
Revert the commit; existing containers do not depend on the new
behaviour for normal steady-state operation.
Observability follow-ups (separate)
via Datadog API — only testnet supernode services are indexed). Once
the next ops cycle wires devnet logs to DD, add a monitor for
"lumera-devnet-1 height did not advance in 5 min". That's a separate
ops task, not in scope here.