Skip to content

fix(api): discover orchestrators via nomad service#3176

Open
wj-e2b wants to merge 1 commit into
mainfrom
wj-orchestrator-rollout
Open

fix(api): discover orchestrators via nomad service#3176
wj-e2b wants to merge 1 commit into
mainfrom
wj-orchestrator-rollout

Conversation

@wj-e2b

@wj-e2b wj-e2b commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Switch discovery to use nomad's service discovery instead of blindly checking every node in the default nodepool. Lets us use multiple nodepools for cutting over orchestrator from system jobs to service jobs.

Existing orchestrator-ee system job already registers to the service.

@cursor

cursor Bot commented Jul 2, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Discovery now depends on orchestrator jobs registering the expected Nomad service name; misconfiguration or missing registrations could leave the API with no routable orchestrators until gRPC health checks would have failed anyway on bad nodes.

Overview
Nomad orchestrator discovery no longer treats every ready node in the default pool as an orchestrator. It now lists Nomad-native service registrations (configurable via NOMAD_ORCHESTRATOR_SERVICE_NAME, default orchestrator) and builds dial targets from each registration’s address and port, with per-node deduplication and skips for empty addresses. That supports orchestrators on other node pools and during job cutovers without dialing hosts that never registered the service.

Reviewed by Cursor Bugbot for commit 46e04d9. Bugbot is set up for automated code reviews on this repo. Configure here.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The error check !errors.As(err, &sde) in the deferred wait block will always evaluate to false when a service fails because startService always returns a serviceDoneError regardless of whether the service function f() succeeded or failed. This causes all service failures to be silently ignored, and the orchestrator will incorrectly exit with a success status instead of failing.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread packages/orchestrator/pkg/factories/run.go Outdated
@codecov

codecov Bot commented Jul 2, 2026

Copy link
Copy Markdown

❌ 5 Tests Failed:

Tests completed Failed Passed Skipped
3143 5 3138 8
View the top 3 failed test(s) by shortest run time
github.com/e2b-dev/infra/tests/integration/internal/tests/api/templates::TestTemplateBuildRUN
Stack Traces | 0s run time
=== RUN   TestTemplateBuildRUN
=== PAUSE TestTemplateBuildRUN
=== CONT  TestTemplateBuildRUN
--- FAIL: TestTemplateBuildRUN (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/templates::TestTemplateTagAssignFromSourceTag
Stack Traces | 164s run time
=== RUN   TestTemplateTagAssignFromSourceTag
=== PAUSE TestTemplateTagAssignFromSourceTag
=== CONT  TestTemplateTagAssignFromSourceTag
    template_tags_test.go:59: Build failed: {<nil> An internal error occurred. Please try again or contact support with the build ID. <nil>}
--- FAIL: TestTemplateTagAssignFromSourceTag (163.50s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/templates::TestDeleteTemplateFromAnotherTeamAPIKey
Stack Traces | 164s run time
=== RUN   TestDeleteTemplateFromAnotherTeamAPIKey
=== PAUSE TestDeleteTemplateFromAnotherTeamAPIKey
=== CONT  TestDeleteTemplateFromAnotherTeamAPIKey
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] Building template 7q8s0umhgwnh9m8urwr7/2745e7e0-ea7a-4427-bd76-4ddc31b0306d
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] [base] FROM ubuntu:22.04 [f9f564014e009a9561a82bf8c84f9314242971e833fb019936654ecba452f184]
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] Base Docker image size: 30 MB
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] Creating file system and pulling Docker image
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] Uncompressing layer sha256:40d16f30db405106ef8074779bdf41f012465c2a785bbeaa2eab9f2081099b47 30 MB
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] Uncompressing layer sha256:5a980729adb90fabc752bbb88119b2353157442bbd453d2d3d132e12c8479155 13 MB
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] Uncompressing layer sha256:8c4b1b28875140ed3abacaf16ad0d696f6bef912f52d2148f261a23e3349465b 168 B
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] Layers extracted
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] Root filesystem structure: bin, boot, dev, etc, home, lib, lib32, lib64, libx32, media, mnt, opt, proc, root, run, sbin, srv, sys, tmp, usr, var
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] Provisioning sandbox template
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] Provisioning was successful, cleaning up
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] Sandbox template provisioned
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] [base] DEFAULT USER user [49e586c2171254c6bc4a09e84eedac32dbcf113a158c24248129af2f49cbed74]
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] [builder 1/1] RUN echo 'Hello, World!' [c72b4f813c2a16b0fc1a1c5da7b1365a304cbac516b22dc304a71f70aae48ac0]
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] [builder 1/1] [stdout]: Hello, World!
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] [finalize] Finalizing template build [92c524e30533398ebb41ce04c2596130f0cdecc9aa328e28fdb16a1b11f61d62]
    build_template_test.go:133: test-to-delete-another-team-api-key: [error] Build failed: An internal error occurred. Please try again or contact support with the build ID.
    delete_template_test.go:51: Build failed: {<nil> An internal error occurred. Please try again or contact support with the build ID. <nil>}
--- FAIL: TestDeleteTemplateFromAnotherTeamAPIKey (163.96s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/templates::TestTemplateBuildRUN/Single_RUN_command
Stack Traces | 164s run time
=== RUN   TestTemplateBuildRUN/Single_RUN_command
=== PAUSE TestTemplateBuildRUN/Single_RUN_command
=== CONT  TestTemplateBuildRUN/Single_RUN_command
    build_template_test.go:133: test-ubuntu-run: [info] Building template focp1bl3tvas5oqr2xgx/3114432c-d60e-45f4-9501-731ff61a8f50
    build_template_test.go:133: test-ubuntu-run: [info] [base] FROM ubuntu:22.04 [f9f564014e009a9561a82bf8c84f9314242971e833fb019936654ecba452f184]
    build_template_test.go:133: test-ubuntu-run: [info] Base Docker image size: 30 MB
    build_template_test.go:133: test-ubuntu-run: [info] Creating file system and pulling Docker image
    build_template_test.go:133: test-ubuntu-run: [info] Uncompressing layer sha256:40d16f30db405106ef8074779bdf41f012465c2a785bbeaa2eab9f2081099b47 30 MB
    build_template_test.go:133: test-ubuntu-run: [info] Uncompressing layer sha256:5a980729adb90fabc752bbb88119b2353157442bbd453d2d3d132e12c8479155 13 MB
    build_template_test.go:133: test-ubuntu-run: [info] Uncompressing layer sha256:8c4b1b28875140ed3abacaf16ad0d696f6bef912f52d2148f261a23e3349465b 168 B
    build_template_test.go:133: test-ubuntu-run: [info] Layers extracted
    build_template_test.go:133: test-ubuntu-run: [info] Root filesystem structure: bin, boot, dev, etc, home, lib, lib32, lib64, libx32, media, mnt, opt, proc, root, run, sbin, srv, sys, tmp, usr, var
    build_template_test.go:133: test-ubuntu-run: [info] Provisioning sandbox template
    build_template_test.go:133: test-ubuntu-run: [info] Provisioning was successful, cleaning up
    build_template_test.go:133: test-ubuntu-run: [info] Sandbox template provisioned
    build_template_test.go:133: test-ubuntu-run: [info] [base] DEFAULT USER user [49e586c2171254c6bc4a09e84eedac32dbcf113a158c24248129af2f49cbed74]
    build_template_test.go:133: test-ubuntu-run: [info] [builder 1/1] RUN echo 'Hello, World!' [c72b4f813c2a16b0fc1a1c5da7b1365a304cbac516b22dc304a71f70aae48ac0]
    build_template_test.go:133: test-ubuntu-run: [info] [builder 1/1] [stdout]: Hello, World!
    build_template_test.go:133: test-ubuntu-run: [info] [finalize] Finalizing template build [92c524e30533398ebb41ce04c2596130f0cdecc9aa328e28fdb16a1b11f61d62]
    build_template_test.go:133: test-ubuntu-run: [error] Build failed: An internal error occurred. Please try again or contact support with the build ID.
    build_template_test.go:166: Build failed: {<nil> An internal error occurred. Please try again or contact support with the build ID. <nil>}
--- FAIL: TestTemplateBuildRUN/Single_RUN_command (164.02s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestCommandKillNextApp
Stack Traces | 263s run time
=== RUN   TestCommandKillNextApp
=== PAUSE TestCommandKillNextApp
=== CONT  TestCommandKillNextApp
    process_test.go:30: Build failed: {<nil> An internal error occurred. Please try again or contact support with the build ID. <nil>}
--- FAIL: TestCommandKillNextApp (263.29s)
Executing command cat in sandbox ivfucjkwit241apq28jr8 (user: root)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale comment

Comment thread packages/api/internal/orchestrator/discovery/nomad.go

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c82c8df518

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread packages/orchestrator/pkg/factories/run.go Outdated
@wj-e2b wj-e2b force-pushed the wj-orchestrator-rollout branch from c82c8df to 46e04d9 Compare July 2, 2026 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants