fix(devnet): supervise lumerad lifecycle in start.sh + add restart policy by mateeullahmalik · Pull Request #150 · LumeraProtocol/lumera

mateeullahmalik · 2026-06-09T18:44:46Z

Summary

Make the devnet validator containers actually die — and auto-recover — when
lumerad dies. Today the container PID 1 outlives lumerad, so
docker ps happily reports "Up" while the chain process is a defunct
zombie. There is no restart policy either, so docker would not recover
the container even if PID 1 did exit.

Incident that motivated this

2026-06-02 13:05 UTC, lumera-devnet-1. During a host cosmovisor
restart cycle, an operator ran:

sudo pkill -9 -f 'lumerad start'

on the host. The regex was intended to kill the host
lumera-devnet-api.service cosmovisor child, but Docker's PID namespace
is shared with the host, so the pkill also matched the lumerad start
processes inside the 5 validator containers. All 5 in-container
lumerads got SIGKILLed.

Container PID 1 is bash /root/scripts/start.sh, which does:

"${DAEMON}" start --home "${DAEMON_HOME}" ... >"${VALIDATOR_LOG}" 2>&1 &
…
exec tail -F "${VALIDATOR_LOG}" "${SUPERNODE_LOG}" …

So PID 1 became tail -F. Tail doesn't care that lumerad died — it kept
following the log file forever. docker ps reported Up 13 days for
all 5 containers while the chain was actually dead at h=301810 for
~6 days before anyone noticed.

There was also no restart: policy on the validator service, so even
if PID 1 had exited correctly, docker would not have restarted the
container.

Fix

1. `devnet/scripts/start.sh`

Capture LUMERAD_PID=$! after launching lumerad.
Demote tail -F from exec to a background &, capturing
TAIL_PID=$!.
Add wait_for_lumera():
- wait "$LUMERAD_PID", capture rc.
- kill "$TAIL_PID" to flush logs.
- exit "$rc".
Call wait_for_lumera at the end of both run_auto_flow and the
run mode.

PID 1 now dies with lumerad. Combined with the restart policy, docker
restarts the container automatically.

2. `devnet/generators/docker-compose.go`

Add Restart string field (yaml restart,omitempty) to
DockerComposeService.
Set Restart: "unless-stopped" on the generated validator services.

Validator containers will now restart on lumerad death but NOT on a
deliberate docker compose stop.

End-to-end validation

A standalone docker container reproducing the supervisor logic was
spawned with --restart on-failure:3:

[BOOT] iteration=1 pid1=1
[BOOT] fake lumerad pid=9
/sim_start.sh: line 27:     9 Killed                     sleep 100000
[BOOT] lumerad exited rc=137 — propagating to PID 1
[TEST] killed lumerad pid=9 at 17:40:13
[BOOT] iteration=2 pid1=1
[BOOT] fake lumerad pid=9

State across iterations (the /state/iter mount) was preserved across
the restart, which is exactly the behaviour the real validator
containers need so they rejoin the chain on the same data directory.

Risks / non-goals

Devnet-only. Touches devnet/* only — no chain state machine
impact, no module impact, no protobuf changes, no consensus impact.
docker compose stop continues to win over unless-stopped, so
intentional shutdowns are unchanged.
Existing live devnet containers need to be recreated for the new
restart policy to apply. docker compose restart is NOT enough; ops
needs docker compose up -d --force-recreate on next deploy.
Existing containers also pick up the new start.sh only after a
rebuild of the validator image (the script is baked into the image at
build time). Both follow naturally from the next devnet rollout.

Rollback

Revert the commit; existing containers do not depend on the new
behaviour for normal steady-state operation.

Observability follow-ups (separate)

lumera-devnet-1 is not currently shipping logs to Datadog (verified
via Datadog API — only testnet supernode services are indexed). Once
the next ops cycle wires devnet logs to DD, add a monitor for
"lumera-devnet-1 height did not advance in 5 min". That's a separate
ops task, not in scope here.

…licy When lumerad inside a devnet validator container dies (crash, OOM, or a host-side `pkill -f 'lumerad start'` that matches container processes via the shared PID namespace), `start.sh` previously kept running because its final `exec tail -F …` replaced bash as PID 1. `docker ps` reported "Up" indefinitely while the chain process was a defunct zombie and the container never restarted. On 2026-06-02 this exact misfire — an operator's `sudo pkill -9 -f 'lumerad start'` aimed at the host cosmovisor process matched the in-container lumerads through the shared PID namespace — silently killed all 5 lumera-devnet-1 validators at h=301810 and left the chain dead for ~6 days before anyone noticed. Fix: 1. `devnet/scripts/start.sh`: capture `LUMERAD_PID=$!`, demote `tail -F` from `exec` to background, then `wait "$LUMERAD_PID"` and propagate the exit code via `exit "$rc"`. PID 1 now dies with lumerad. 2. `devnet/generators/docker-compose.go`: add `Restart string` field to `DockerComposeService` and set `Restart: "unless-stopped"` on validator services. Combined with (1), a non-zero lumerad exit triggers docker to restart the container and rejoin the chain. End-to-end validated against a real docker container with the new supervisor logic: - iteration=1 fake-lumerad SIGKILLed at +3s - start.sh propagated rc=137 to PID 1 - docker restart policy fired - iteration=2 started cleanly with fresh fake-lumerad - state preserved through the restart `run`, `auto`, and bootstrap/wait modes all updated consistently so the behaviour is identical for the modes that actually launch lumerad. Risks / non-goals: - This is a devnet-only change (devnet/* paths). No chain state machine impact. - `docker compose stop` continues to win over the restart policy (unless-stopped semantics), so intentional shutdowns are unchanged. - Existing live devnet containers still need to be recreated (`docker compose up -d --force-recreate`) for the new restart policy to apply; an in-place `compose restart` is not enough.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(devnet): supervise lumerad lifecycle in start.sh + add restart policy#150

fix(devnet): supervise lumerad lifecycle in start.sh + add restart policy#150
mateeullahmalik wants to merge 1 commit into
masterfrom
fix/devnet-start-sh-lumerad-lifecycle

mateeullahmalik commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mateeullahmalik commented Jun 9, 2026

Summary

Incident that motivated this

Fix

1. devnet/scripts/start.sh

2. devnet/generators/docker-compose.go

End-to-end validation

Risks / non-goals

Rollback

Observability follow-ups (separate)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `devnet/scripts/start.sh`

2. `devnet/generators/docker-compose.go`