[draft] High Availability Deployment Models page#4703
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📖 Docs PR preview links |
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Deployment failed with the following error: Learn More: https://vercel.com/docs/concepts/projects/project-configuration |
| - **Active / Passive** — Workflows process in one region at a time, the "active" region. The other region is "passive" and ready for failover. This pattern has two variants: | ||
| - **[Active / Passive (Cold)](#active-cold)** — a.k.a. Active / Cold — Workers run in only one region at a time. After a failover, Workers start in the secondary region. The region where Workers run == the region where Workflows process. To fail over, Workers need a "cold start" in the other region. | ||
| - **[Active / Passive (Hot)](#active-hot)** — a.k.a. Active / Hot — Workers run in **both regions** simultaneously, but Workflows still process in only one region at any given time. The other region's Workers are on "hot" standby. | ||
| - **[Active / Active](#active-active)** — Workflows process in both regions at the same time. Necessarily, Workers run in both regions at all times. |
There was a problem hiding this comment.
nit: necessarily is an odd word to use here. Id just remove
| Active / Cold Pattern: **On failover** | ||
|
|
||
| - **The Namespace fails over automatically.** Temporal Cloud promotes the secondary region's replica to active. No action is needed to fail over the Namespace itself. | ||
| - **You bring the Workers up in the secondary region.** Because no Workers were running there, they start from nothing — a "cold" start. Starting and scaling that fleet is your responsibility, ideally through tested automation. Until the Workers are running, no Workflows make progress. |
There was a problem hiding this comment.
I feel like the question everyone reading this is going to ask is, how do we detect a failover.
I know we have plans to answer this in H2, but is there something we want to tell them now? Like them have some sort or system that is constantly querying what the active is to detect a failover? Or do we just want to wait for the question and address it then?
There was a problem hiding this comment.
It could just be one of those things where we fix the problem before we expect to be asked about it.
There was a problem hiding this comment.
Another thing I thought about is them knowing when to scale down those workers and do their own failback
| Active / Cold Pattern: **Tradeoffs** | ||
|
|
||
| - Highest overall recovery time of the three patterns, due to cold starting the Worker fleet after failover. | ||
| - Depends on tested automation to bring up the secondary-region fleet quickly. |
There was a problem hiding this comment.
"tested automation", I see this 3 times and as a user I'd have no idea what this means personally.
|
|
||
| - **Use the Namespace Endpoint.** | ||
| - Connect Workers through the [Namespace Endpoint](/cloud/namespaces#access-namespaces), which always connects to the Namespace in its active region and automatically fails over to the new region. | ||
| - **Rationale:** If a Temporal Cloud incident requires the Namespace to fail over while the rest of the primary region is healthy, the Workers in the primary region can still connect through the Namespace Endpoint and process Workflows. If the Workers use the Regional Endpoint for the primary region, they will not reliably connect to the Namespace during a Temporal Cloud incident in the primary region. |
There was a problem hiding this comment.
If the Workers use the Regional Endpoint for the primary region, they will not reliably connect to the Namespace during a Temporal Cloud incident in the primary region.
won't they be forwarded?
There was a problem hiding this comment.
ah I see lower about turning off forwarding. This seems like this would be a really good feature to have in the worker and pass up the flag. Cause if you know you are connecting to a regional endpoint, and you don't want to have forwarding, seeing it all in one spot in the code is much more clear than having to set the regional endpoint in the worker and make a cli call externally.
just a thought
| - **Codec Servers and proxies** — run in both regions continuously. | ||
| - **Databases and queues** — accessed from both regions; cross-region consistency must be designed for. | ||
|
|
||
| ### Dual Active (Multi-Active) {/* #dual-active */} |
There was a problem hiding this comment.
I'm a little confused about this one. Is this not just taking the active passive pattern and now just doing it for 2 namespaces now? I guess I'm confused about this being here when we already have active passive.
Like is this pattern here just really saying "you can have different namespaces in different regions"?
What does this PR do?
When using multi-region High Availability, Temporal Cloud customers often ask us how to decide where to deploy their Workers and other systems.
This page gives recommendations on common patterns for an overall High Availability strategy that a Temporal Cloud user can adopt in their architecture.
Notes to reviewers
Internal context: https://temporaltechnologies.slack.com/archives/C04V0LSU5S6/p1781117451071889?thread_ts=1781008921.964629&cid=C04V0LSU5S6
┆Attachments: EDU-6522 [draft] High Availability Deployment Models page