Skip to content

[draft] High Availability Deployment Models page#4703

Open
lukeknep wants to merge 11 commits into
mainfrom
ha-worker-deployments
Open

[draft] High Availability Deployment Models page#4703
lukeknep wants to merge 11 commits into
mainfrom
ha-worker-deployments

Conversation

@lukeknep

@lukeknep lukeknep commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

When using multi-region High Availability, Temporal Cloud customers often ask us how to decide where to deploy their Workers and other systems.

This page gives recommendations on common patterns for an overall High Availability strategy that a Temporal Cloud user can adopt in their architecture.

Notes to reviewers

Internal context: https://temporaltechnologies.slack.com/archives/C04V0LSU5S6/p1781117451071889?thread_ts=1781008921.964629&cid=C04V0LSU5S6

┆Attachments: EDU-6522 [draft] High Availability Deployment Models page

@lukeknep lukeknep requested a review from a team as a code owner June 11, 2026 17:46
@vercel

vercel Bot commented Jun 11, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
temporal-documentation Ready Ready Preview, Comment Jun 16, 2026 11:47pm

Request Review

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@vercel

vercel Bot commented Jun 15, 2026

Copy link
Copy Markdown

Deployment failed with the following error:

The `vercel.json` schema validation failed with the following message: should NOT have additional property `public`

Learn More: https://vercel.com/docs/concepts/projects/project-configuration

- **Active / Passive** — Workflows process in one region at a time, the "active" region. The other region is "passive" and ready for failover. This pattern has two variants:
- **[Active / Passive (Cold)](#active-cold)** — a.k.a. Active / Cold — Workers run in only one region at a time. After a failover, Workers start in the secondary region. The region where Workers run == the region where Workflows process. To fail over, Workers need a "cold start" in the other region.
- **[Active / Passive (Hot)](#active-hot)** — a.k.a. Active / Hot — Workers run in **both regions** simultaneously, but Workflows still process in only one region at any given time. The other region's Workers are on "hot" standby.
- **[Active / Active](#active-active)** — Workflows process in both regions at the same time. Necessarily, Workers run in both regions at all times.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: necessarily is an odd word to use here. Id just remove

Active / Cold Pattern: **On failover**

- **The Namespace fails over automatically.** Temporal Cloud promotes the secondary region's replica to active. No action is needed to fail over the Namespace itself.
- **You bring the Workers up in the secondary region.** Because no Workers were running there, they start from nothing — a "cold" start. Starting and scaling that fleet is your responsibility, ideally through tested automation. Until the Workers are running, no Workflows make progress.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the question everyone reading this is going to ask is, how do we detect a failover.

I know we have plans to answer this in H2, but is there something we want to tell them now? Like them have some sort or system that is constantly querying what the active is to detect a failover? Or do we just want to wait for the question and address it then?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could just be one of those things where we fix the problem before we expect to be asked about it.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing I thought about is them knowing when to scale down those workers and do their own failback

Active / Cold Pattern: **Tradeoffs**

- Highest overall recovery time of the three patterns, due to cold starting the Worker fleet after failover.
- Depends on tested automation to bring up the secondary-region fleet quickly.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"tested automation", I see this 3 times and as a user I'd have no idea what this means personally.


- **Use the Namespace Endpoint.**
- Connect Workers through the [Namespace Endpoint](/cloud/namespaces#access-namespaces), which always connects to the Namespace in its active region and automatically fails over to the new region.
- **Rationale:** If a Temporal Cloud incident requires the Namespace to fail over while the rest of the primary region is healthy, the Workers in the primary region can still connect through the Namespace Endpoint and process Workflows. If the Workers use the Regional Endpoint for the primary region, they will not reliably connect to the Namespace during a Temporal Cloud incident in the primary region.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the Workers use the Regional Endpoint for the primary region, they will not reliably connect to the Namespace during a Temporal Cloud incident in the primary region.

won't they be forwarded?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see lower about turning off forwarding. This seems like this would be a really good feature to have in the worker and pass up the flag. Cause if you know you are connecting to a regional endpoint, and you don't want to have forwarding, seeing it all in one spot in the code is much more clear than having to set the regional endpoint in the worker and make a cli call externally.

just a thought

- **Codec Servers and proxies** — run in both regions continuously.
- **Databases and queues** — accessed from both regions; cross-region consistency must be designed for.

### Dual Active (Multi-Active) {/* #dual-active */}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused about this one. Is this not just taking the active passive pattern and now just doing it for 2 namespaces now? I guess I'm confused about this being here when we already have active passive.

Like is this pattern here just really saying "you can have different namespaces in different regions"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants