Rewriting the 'RPO and RTO' page to clear up common confusion #4091

lukeknep · 2025-12-23T21:36:59Z

What does this PR do?

Corrects errors and clears up common confusion points around our RPO, RTO, and SLA.

Explain the difference between RTO and SLA, and why a 20-minute RTO can still meet a 99.99% SLA.
Clarify that Temporal-initiated failovers must be enabled for the RTO to apply.
Clarify that MRR and MCR are still protected against AZ failures and cell failures.
Fixed the "8-hour RPO / RTO" for non-HA workloads

Internal Note on the previously-stated 8-hour RTO / RPO for non-HA Namespaces:

an "8-hour RTO" doesn't make sense when we are entirely dependent on the underlying cloud infrastructure -- if the infra is down for 24 hours, there's no way we can make an 8-hour RTO.
Conversely, if the infra is only down for 20 minutes, then an 8-hour RTO may be too long.
Additionally, the 8-hour RPO needs to be carefully explained, as it is not relevant to most outages; most outages historically have not caused data corruption. But if a customer just reads "8-hour RPO," they might erroneously think, "oh no, if the region has an incident like the AWS us-east-1 incident, I may lose 8 hours of data."

vercel · 2025-12-23T21:37:04Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Review	Updated (UTC)
temporal-documentation	Error		Jan 20, 2026 2:49pm

github-actions · 2025-12-23T21:37:25Z

📖 Docs PR preview links

Cloud
- High Availability
  - Monitoring
- RPO and RTO
- SLA

lukeknep · 2025-12-23T22:28:26Z

docs/production-deployment/cloud/rto-rpo.mdx

-Temporal Cloud is designed to limit data loss after recovery when the incident triggering the failover is resolved.
+- "Temporal-initiated failovers:" Also known as "automatic failovers," these failovers are initiated by Temporal's tooling and/or on-call engineers on Namespaces that have High Availability enabled. **Temporal highly recommends keeping Temporal-initiated failovers enabled,** which is the default for all Namespaces with High Availability features. Users can still trigger manual failovers on their Namespaces even if Temporal-initiated failovers are enabled. When Temporal-initiated failovers are disabled on a Namespace, Temporal's RTO for that Namespace is unbounded (it is dependent on how long the underlying outage lasts)

-Temporal Cloud strives to maintain a P95 [replication lag](/cloud/high-availability/monitoring#replication-lag-metric) of less than 1 minute.


I removed this bit because

I'm not sure p95 is good enough. This could be read as "up to 5% of Namespaces could be above the 1-minute RPO at any given moment."

We already say we have a 1-minute RPO. I don't think we need additional standards / goals to be publicly stated. They would only add confusion. Let's state our main goal (RPO) and stand by it.

docs/evaluate/temporal-cloud/sla.mdx

bechols · 2025-12-23T23:02:18Z

docs/production-deployment/cloud/rto-rpo.mdx


-Recovery Point Objective (RPO) and Recovery Time Objective (RTO) define data durability and service restoration timelines, respectively.
-In Temporal Cloud, these objectives vary depending on your deployment configuration and the scope of any failure.
+Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are the objectives that Temporal strives to meet for data durability (recovery point) and service restoration (recovery time) in the event of cloud outages.


RPO and RTO are how we measure, and low values for RPO and RTO are what we strive for. Could tighten this phrasing.

That's not accurate (or at least, if I understand your comment correctly, that's not how the terms are used in the industry)

Recovery Point/Time "Objective": this is the goal we have for all outages. That's why the term has "Objective" in it's name.

recovery time / recovery point: this is the actual observed values in a given outage. I could say "observed recovery time" or "achieved recovery time," but that gets bloated.

I wanted to make the distinction between the two terms really clear in the doc. If it's not clear, then I need to reword it.

P.S. Confirmed the industry standard with GPT 5.1:

To me, the current wording boils down to "we strive for RPO", which doesn't really have any informational content without the actual numerical objective. "We strive for zero RPO" or "we strive for sub 20 minute RTO" is informative.

Trying this wording with a similar concept: "Uptime is the objective that Temporal strives to meet for availability (service accessibility)". I think it's clearer/more informative to say something like "Temporal Cloud measures availability in terms of service uptime, and has a 99.99% availability SLO and 99.% availability SLA"

All that said: happy to merge as-is.

bechols · 2025-12-23T23:04:18Z

docs/production-deployment/cloud/rto-rpo.mdx

-Recovery Point Objective (RPO) and Recovery Time Objective (RTO) define data durability and service restoration timelines, respectively.
-In Temporal Cloud, these objectives vary depending on your deployment configuration and the scope of any failure.
+Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are the objectives that Temporal strives to meet for data durability (recovery point) and service restoration (recovery time) in the event of cloud outages.
+These objectives are high priority goals for Temporal Cloud, but are not a contractual commitment.


This sounds rough. Can we say RPO + RTO aren't part of the availability SLA instead (and link to the SLA page)?

Suggested change

These objectives are high priority goals for Temporal Cloud, but are not a contractual commitment.

Temporal Cloud's RPO and RTO are complementary to but separate from the [availability SLA](/cloud/sla)."

docs/production-deployment/cloud/rto-rpo.mdx

bechols · 2025-12-23T23:05:39Z

docs/production-deployment/cloud/rto-rpo.mdx

+In case of an outage in the active region or cell, Temporal Cloud will failover to the replica so that existing Workflow Executions will continue to run and new Executions can be started.

-## High Availability, Regional Failure
+The Recovery Point Objective and Recovery Time Objective for Temporal Cloud depend on the type of outage and which [High Availability](/cloud/high-availability) feature your Namespace has enabled:


This breakdown is great!

docs/production-deployment/cloud/rto-rpo.mdx

bechols · 2025-12-23T23:12:05Z

docs/production-deployment/cloud/rto-rpo.mdx

+Temporal has put extensive work into tools and processes that minimize the recovery point and achieve its RPO for Temporal-initiated failovers, including:

-**Recovery Time Objective (RTO) - 20 minutes**
+- Best-in-class data replication technology that keeps the replica up to date with the active.


Could link to https://www.youtube.com/watch?v=mULBvv83dYM where Liang gets into more specifics

docs/production-deployment/cloud/rto-rpo.mdx

bechols · 2025-12-23T23:13:23Z

docs/production-deployment/cloud/rto-rpo.mdx


-**All writes to storage are synchronously replicated across AZs**, including our writes to ElasticSearch.
-ElasticSearch is eventually consistent, but this does not impact our RPO as there is no data loss.
+- You can detect outages that Temporal doesn't. In the cloud, regional outages never affect every service the same way. It's possible that Temporal--and the services it depends on--are unaffected by the outage, even while your Workers or other cloud infrastructure are disrupted. If you monitor each service in your critical path and alert on unusual 


Can we suggest or link to specific guidance on how to do this?

docs/production-deployment/cloud/rto-rpo.mdx

Co-authored-by: Ben Echols <[email protected]>

…o rpo-rewrite

typo removed

sergeybykov

A general good practice is to start each sentence on a new line, so that a later reviewer can add a comment to a specific sentence.

sergeybykov · 2026-01-20T21:11:31Z

docs/production-deployment/cloud/rto-rpo.mdx

-Temporal Cloud delivers different RPO/RTOs based on these scenarios because of the way our platform performs writes to our data provider.
+As Workflows progress in the active region, history events are asynchronously replicated to the replica.
+Because replication is asynchronous, High Availability does not impact the latency or throughput of Workflow Executions in the active region.
+If an outage hits the active region or cell, Temporal Cloud will failover to the replica so that existing Workflow Executions will continue to run and new Workflow Executions can be started.


"failover" -> "fail over" (verb vs. noun)

sergeybykov · 2026-01-20T21:17:04Z

docs/production-deployment/cloud/rto-rpo.mdx


-This means there is _no_ logical corruption and restoration is done from a live replicated instance.
-This applies for both single region Namespaces and multi region Namespaces.
+- Regular drills where we failover our internal Namespaces to test our tooling.


"failover" -> "fail over" (verb vs. noun)

sergeybykov · 2026-01-20T21:22:27Z

docs/production-deployment/cloud/rto-rpo.mdx

+- You can detect outages that Temporal doesn't. In the cloud, regional outages never affect every service the same way. It's possible that Temporal--and the services it depends on--are unaffected by the outage, even while your Workers or other cloud infrastructure are disrupted. If you monitor each service in your critical path and alert on unusual 

-**Recovery Time Objective (RTO) - 0**
+- You can sequence your failovers in a particular order. Your cloud infrastructure probably contains more pieces than just your Temporal Namespace: Temporal Workers, compute pools, data stores, and other cloud services. If you manually failover, you can choose the order in which these pieces switch to the replica region. You can then test that ordering with failover drills and ensure it executes smoothly without data consistency issues or bottlenecks.


"If you manually failover" -> "If you manually fail over" (verb vs. noun)

sergeybykov · 2026-01-20T21:22:58Z

docs/production-deployment/cloud/rto-rpo.mdx


-Temporal is active-active across AZs.
-The RTO is stated to be zero, meaning there should be no downtime in such scenarios.
+- You can proactively failover more aggressively than Temporal. While the 20-minute RTO should be sufficient for most use cases, some may strive to hit an even lower RTO. For workloads like high frequency trading, auctions, or popular sporting events, an outage at the wrong time could cause tremendous lost revenue per minute. You can adopt a posture that fails over more eagerly than Temporal does. For example, you could trigger a manual failover at the first sign of a possible disruption, before knowing whether there's a true regional outage. 


"proactively failover" -> "proactively fail over" (verb vs. noun)

sergeybykov · 2026-01-20T21:23:33Z

docs/production-deployment/cloud/rto-rpo.mdx

-The RTO is stated to be zero, meaning there should be no downtime in such scenarios.
+- You can proactively failover more aggressively than Temporal. While the 20-minute RTO should be sufficient for most use cases, some may strive to hit an even lower RTO. For workloads like high frequency trading, auctions, or popular sporting events, an outage at the wrong time could cause tremendous lost revenue per minute. You can adopt a posture that fails over more eagerly than Temporal does. For example, you could trigger a manual failover at the first sign of a possible disruption, before knowing whether there's a true regional outage. 
+
+- Even if you have robust tooling to detect an outage and trigger a failover, leaving Temporal-initiated failovers enabled provides a "safety net" in case your automation misses an outage. It also gives Temporal leeway to preemptively failover your Namespace if we detect that it may be disrupted soon, e.g., by a rolling failure that has impacted other Namespaces but not yours, yet.


"preemptively failover" -> "preemptively fail over" (verb vs. noun)

sergeybykov · 2026-01-20T21:24:04Z

docs/production-deployment/cloud/rto-rpo.mdx

+
+- Namespace 1_A was in the region and its cell experienced a partial degradation that caused 10% of requests to fail in the first 5 minutes, 25% in the second five minutes, and 50% in the third five minutes. Since it was significantly impacted from 10:00:00 to 10:15:00, its Recovery Time was 15 minutes. If it had no other service errors that month, then its service error rate for the month would be: ( (1 - 10%) + (1 - 25%) + (1 - 50%) + 8925 * 100% ) / 8928  = 99.990%. (Note: there are 8928 5-minute periods in a 31-day month.)
+
+- Namespace 1_B was in the same cell as Namespace 2_A, so it also experienced a partial degradation that caused 10% of requests to fail. However, its owner detected the outage via their own tooling and decided to manually failover at 10:05:00. This Namespace achieved a recovery time of 5 minutes and a service error rate of ( 1 * (1 - 10%) + 8927 * 100% ) / 8928 = 99.998%.


"failover" -> "fail over" (verb vs. noun)

sergeybykov · 2026-01-20T21:24:11Z

docs/production-deployment/cloud/rto-rpo.mdx

+
+- Namespace 2_A was in the region and its cell was fully network partitioned at the start of the outage, causing 100% of requests to fail. Since it was significantly impacted from 10:00:00 to 10:15:00, its Recovery Time was 15 minutes. If it had no other service errors that month, then its service error rate for the month would be: ( 3 * (1 - 100%) + 8928 * 100% ) / 8640 5-minute periods per month = 99.97%.
+
+- Namespace 2_B was in the region and was fully network partitioned, causing 100% of requests to fail. However, its owner detected the outage via their own tooling and decided to manually failover at 10:05:00. This Namespace achieved a recovery time of 5 minutes and a service error rate of ( 1 5-minute periods * (1 - 100%) + 8639 5-minute periods * 100% ) / 8640 5-minute periods per month = 99.99%.


"failover" -> "fail over" (verb vs. noun)

sergeybykov · 2026-01-20T21:34:57Z

docs/production-deployment/cloud/rto-rpo.mdx

+
+- Namespace 2_B was in the region and was fully network partitioned, causing 100% of requests to fail. However, its owner detected the outage via their own tooling and decided to manually failover at 10:05:00. This Namespace achieved a recovery time of 5 minutes and a service error rate of ( 1 5-minute periods * (1 - 100%) + 8639 5-minute periods * 100% ) / 8640 5-minute periods per month = 99.99%.
+
+All of the above Namespaces were in the affected region, but they achieved varying recovery times and service error rates. 


Can we say that in all these cases RPO was zero?

Rewriting the 'RPO and RTO' page to clear up common confusion

c830cd7

lukeknep requested review from a team and bechols as code owners December 23, 2025 21:37

lukeknep changed the title ~~[WIP / Feedback requested] Rewriting the 'RPO and RTO' page to clear up common confusion~~ [Feedback requested DO NOT MERGE] Rewriting the 'RPO and RTO' page to clear up common confusion Dec 23, 2025

lukeknep commented Dec 23, 2025

View reviewed changes

bechols approved these changes Dec 23, 2025

View reviewed changes

lukeknep and others added 2 commits December 26, 2025 11:39

Update docs/production-deployment/cloud/rto-rpo.mdx

c73da1d

Co-authored-by: Ben Echols <[email protected]>

Update docs/production-deployment/cloud/rto-rpo.mdx

a37e7e3

Co-authored-by: Ben Echols <[email protected]>

vercel bot deployed to Preview December 26, 2025 19:40 View deployment

lukeknep added 2 commits January 9, 2026 14:11

Updated RTO and RPO page based on feedback

e16ddcf

Merge branch 'rpo-rewrite' of github.com:temporalio/documentation int…

e35cf41

…o rpo-rewrite

vercel bot had a problem deploying to Preview January 9, 2026 22:15 Failure

lukeknep added 2 commits January 19, 2026 14:01

Remove changes to SLA page

1715ca9

typo

0241043

typo removed

vercel bot had a problem deploying to Preview January 19, 2026 22:03 Failure

Rewrite the RPO and RTO goals to be more succinct

900b2d5

vercel bot had a problem deploying to Preview January 20, 2026 14:49 Failure

lukeknep changed the title ~~[Feedback requested DO NOT MERGE] Rewriting the 'RPO and RTO' page to clear up common confusion~~ Rewriting the 'RPO and RTO' page to clear up common confusion Jan 20, 2026

sergeybykov approved these changes Jan 20, 2026

View reviewed changes

	These objectives are high priority goals for Temporal Cloud, but are not a contractual commitment.
	Temporal Cloud's RPO and RTO are complementary to but separate from the [availability SLA](/cloud/sla)."


		- Namespace 1_A was in the region and its cell experienced a partial degradation that caused 10% of requests to fail in the first 5 minutes, 25% in the second five minutes, and 50% in the third five minutes. Since it was significantly impacted from 10:00:00 to 10:15:00, its Recovery Time was 15 minutes. If it had no other service errors that month, then its service error rate for the month would be: ( (1 - 10%) + (1 - 25%) + (1 - 50%) + 8925 * 100% ) / 8928 = 99.990%. (Note: there are 8928 5-minute periods in a 31-day month.)

		- Namespace 1_B was in the same cell as Namespace 2_A, so it also experienced a partial degradation that caused 10% of requests to fail. However, its owner detected the outage via their own tooling and decided to manually failover at 10:05:00. This Namespace achieved a recovery time of 5 minutes and a service error rate of ( 1 * (1 - 10%) + 8927 * 100% ) / 8928 = 99.998%.


		- Namespace 2_A was in the region and its cell was fully network partitioned at the start of the outage, causing 100% of requests to fail. Since it was significantly impacted from 10:00:00 to 10:15:00, its Recovery Time was 15 minutes. If it had no other service errors that month, then its service error rate for the month would be: ( 3 * (1 - 100%) + 8928 * 100% ) / 8640 5-minute periods per month = 99.97%.

		- Namespace 2_B was in the region and was fully network partitioned, causing 100% of requests to fail. However, its owner detected the outage via their own tooling and decided to manually failover at 10:05:00. This Namespace achieved a recovery time of 5 minutes and a service error rate of ( 1 5-minute periods * (1 - 100%) + 8639 5-minute periods * 100% ) / 8640 5-minute periods per month = 99.99%.


		- Namespace 2_B was in the region and was fully network partitioned, causing 100% of requests to fail. However, its owner detected the outage via their own tooling and decided to manually failover at 10:05:00. This Namespace achieved a recovery time of 5 minutes and a service error rate of ( 1 5-minute periods * (1 - 100%) + 8639 5-minute periods * 100% ) / 8640 5-minute periods per month = 99.99%.

		All of the above Namespaces were in the affected region, but they achieved varying recovery times and service error rates.

Rewriting the 'RPO and RTO' page to clear up common confusion #4091

Are you sure you want to change the base?

Rewriting the 'RPO and RTO' page to clear up common confusion #4091

Uh oh!

Conversation

lukeknep commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

vercel bot commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📖 Docs PR preview links

Uh oh!

lukeknep Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lukeknep Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sergeybykov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lukeknep commented Dec 23, 2025 •

edited

Loading

vercel bot commented Dec 23, 2025 •

edited

Loading

github-actions bot commented Dec 23, 2025 •

edited

Loading

lukeknep Dec 23, 2025 •

edited

Loading

lukeknep Dec 26, 2025 •

edited

Loading