fix: test-suite stability and drone camera pinhole#2545
Conversation
The test asserted >=8 messages after a fixed 1s sleep, but under heavy parallel load (pytest -n auto) the publisher/LCM threads get starved and deliver fewer, failing with 'assert 5 >= 8'. Poll up to a 15s deadline instead; this still verifies end-to-end delivery without assuming a wall-clock throughput.
Greptile SummaryThis PR delivers three independent stability fixes: a flaky test poll-loop replacement, best-effort fixture teardown to prevent thread leaks under parallel load, and a missing Rerun
Confidence Score: 5/5All three changes are narrowly scoped bug fixes with no shared mutable state, no API surface changes, and direct test coverage for the new teardown helper. The poll-loop replacement is a strictly safer version of the original sleep; the teardown helper is covered by four unit tests and correctly handles every failure permutation; the pinhole constant and matrix are derived correctly from cx/cy yielding 1920×1080. No regressions are introduced. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
subgraph Fix1["Fix 1 — test_basic_deployment"]
A[robot.start / nav.start] --> B{Poll loop\nup to 15 s}
B -- "all counts ≥ 8" --> C[break early]
B -- deadline --> D[timeout — assertions fail]
C --> E[assert mov/odom/lidar ≥ 8]
end
subgraph Fix2["Fix 2 — agent_setup teardown"]
F[fixture yield] --> G[teardown_agent_setup]
G --> H["_safe(coordinator.stop)"]
H -- "raises?" --> H2[log error, continue]
H --> I["_safe(transport.stop) × N"]
I -- "raises?" --> I2[log error, continue]
I --> J["_safe(unsub[i]) × M"]
J -- "raises?" --> J2[log error, continue]
end
subgraph Fix3["Fix 3 — drone camera pinhole"]
K[bridge start] --> L["_log_static()"]
L --> M["rr.log('world/video', rr.Pinhole(...), static=True)"]
M --> N[camera frustum visible in 3D view]
end
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
subgraph Fix1["Fix 1 — test_basic_deployment"]
A[robot.start / nav.start] --> B{Poll loop\nup to 15 s}
B -- "all counts ≥ 8" --> C[break early]
B -- deadline --> D[timeout — assertions fail]
C --> E[assert mov/odom/lidar ≥ 8]
end
subgraph Fix2["Fix 2 — agent_setup teardown"]
F[fixture yield] --> G[teardown_agent_setup]
G --> H["_safe(coordinator.stop)"]
H -- "raises?" --> H2[log error, continue]
H --> I["_safe(transport.stop) × N"]
I -- "raises?" --> I2[log error, continue]
I --> J["_safe(unsub[i]) × M"]
J -- "raises?" --> J2[log error, continue]
end
subgraph Fix3["Fix 3 — drone camera pinhole"]
K[bridge start] --> L["_log_static()"]
L --> M["rr.log('world/video', rr.Pinhole(...), static=True)"]
M --> N[camera frustum visible in 3D view]
end
Reviews (3): Last reviewed commit: "test(tests): cover teardown_agent_setup ..." | Re-trigger Greptile |
| for unsub in unsubs: | ||
| _safe(unsub, "unsubscribe") |
There was a problem hiding this comment.
Generic unsubscribe label hides which subscription failed
Every unsubscribe failure is logged with the same step="unsubscribe" label, so when a cleanup failure is reported you cannot tell whether it was the agent_transport or the finished_transport (or any others) that raised. Since the unsubs list is an ordered collection built from append calls in the fixture, a zero-cost improvement is to include the index or the callable's repr in the step name.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
Good catch — applied in the latest push (folded into the teardown commit). Each unsubscribe is now labelled by index: _safe(unsub, f"unsubscribe[{i}]"), so a failing teardown step identifies which subscription raised.
When an agent test fails mid-run (e.g. the LLM is slow under load and finished_event.wait(60) times out), a raising cleanup step skipped the rest, leaking LCM transport threads and turning the failure into a failure + teardown error. Add a shared teardown_agent_setup() that runs every step (coordinator, transports, unsubs) best-effort and logs rather than propagating. Each unsubscribe is labelled by index so a failing one is identifiable.
The camera image is logged at world/video, inside the 3D view's world origin, but no Pinhole was ever logged, so Rerun warned '2D visualizers require a pinhole ancestor to be shown in a 3D view' and the feed only rendered in the 2D pane. Register a static Pinhole at world/video built from the (now shared) camera intrinsics.
028e879 to
df228ca
Compare
|
Hi @leshy @mustafab0 @paul-nechifor @spomichter — the |
Codecov Report❌ Patch coverage is @@ Coverage Diff @@
## main #2545 +/- ##
==========================================
- Coverage 70.81% 70.46% -0.36%
==========================================
Files 862 863 +1
Lines 77475 77491 +16
Branches 6882 6881 -1
==========================================
- Hits 54862 54601 -261
- Misses 20818 21130 +312
+ Partials 1795 1760 -35
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 14 files with indirect coverage changes 🚀 New features to boost your workflow:
|
Unit-test the agent_setup teardown helper: all steps run in order, and a raising step (coordinator.stop, a transport, or an unsub) neither skips the remaining cleanup nor propagates. Covers the previously-untested except branch and the coordinator=None skip.
Note on the red
|
Three independent fixes surfaced while running the test suite and the demo simulation. Each is its own commit.
1.
fix(core)— flakytest_basic_deploymentThe test asserted
>= 8messages after a fixedtime.sleep(1). Under heavy parallel load (pytest -n auto) the publisher and LCM transport threads get starved and deliver fewer within that window, failing withassert 5 >= 8. It now polls up to a 15s deadline — still verifying end-to-end delivery through the deployment, without assuming a wall-clock throughput.2.
fix(tests)— resilientagent_setupteardownWhen an agent test fails mid-run (e.g. the LLM is slow under load and
finished_event.wait(60)times out), a single raising cleanup step skipped the remaining ones, leaking LCM transport threads. The autousemonitor_threadsfixture then flagged the leak, turning one failure into a failure and a teardown error. A sharedteardown_agent_setup()helper now runs every cleanup step (coordinator, transports, unsubs) best-effort and logs failures instead of propagating. Wired into bothagent_setupfixtures.3.
fix(drone)— camera pinhole for the 3D viewThe drone camera feed is logged at
world/video, inside the 3D view'sworldorigin, but noPinholewas ever logged — so Rerun warned "2D visualizers require a pinhole ancestor to be shown in a 3D view" and the feed only rendered in the dedicated 2D pane. A staticPinholeis now logged atworld/video, built from the camera intrinsics (pulled into a shared_CAMERA_INTRINSICSconstant so they can't drift from the camera module).Testing
test_basic_deploymentpasses under deliberate full-CPU starvation (previously failedassert 5 >= 8).drone-basicreplay boots cleanly; the static pinhole logs at bridge start without error.ruff checkandruff format --checkclean on all changed files.