Disaster Recovery Metrics: Understanding RPO & RTO


Recent Posts

Time and time again, statistics and real-world cases demonstrate that a sound disaster recovery (DR) plan and reducing downtime can make or break a business. For large enterprises, even minutes of downtime can cost hundreds of thousands of dollars. As SalesForce and Google outages last year demonstrated, downtime for some services amounts to lost productivity across the globe.

Of course, it isn’t just large businesses that are affected. A recent Infrascale study found that 37% of SMBs have lost customers due to downtime. Further, despite the risks, 19% felt they weren’t adequately prepared to address and prevent unexpected downtime. To mitigate the cost of a disaster or service outage, both SMBs and large enterprises need a DR plan. It’s important that your DR plan is both robust and realistic. That means clearly defining your threat model, business objectives, and testing your plans.

Two important metrics to consider as you devise a realistic DR plan are: recovery point objective (RPO) and recovery time objective (RTO). Here, we’ll take a look at RPO and RTO in-detail and help you understand how to apply them to your business.

What are RTO and RPO?

RTO and RPO are two sides of the same coin. They are defined as follows:

  • RPO (recovery point objective) is the maximum amount of time that can pass for a backup to be considered acceptable. In other words, it provides a maximum acceptable age for your backups. For example, if your business RPO point is 30 minutes, then you must do a backup at least every 30 minutes.
  • RTO (recovery time objective) is the maximum acceptable amount of time from when a disaster occurs until operations are restored.

Wikimedia

As you can see, RPO deals with how old the data you recover can be, while RTO deals with how quickly you can restore the data.

To understand how RPO and RTO work together, let’s walk through an example scenario: suppose you run an app that collects widget data. Given the cost of backups and your budget, you’re okay with losing up to 1 day’s data, but no more. Therefore, you peg your RPO at 24 hours and take backups every night.

You’ve also decided that the maximum amount of downtime in the event an incident occurs is 2 hours. With that in mind, 2 hours becomes your RTO number.

Then this happens:

  1. On Day 1 at 11:00 p.m., you take your first backup
  2. On Day 2 at 11:00 p.m., you take your second backup
  3. On Day 3 at 11:00 a.m., your production data gets corrupted due to a hardware failure
  4. On Day 3 at 12:45 p.m., your IT team restores service using the backup from Day 2

So, you’ve lost 12 hours of data and experienced 1 hour and 45 minutes of downtime. Given your recovery point objective of 24 hours and your recovery time objective of 2 hours, this can be considered a “success”.

What are good RPO and RTO numbers?

Obviously, the lower your RPO and RTO numbers, the better. However, like anything else in business it all comes down to risk vs reward. For enterprise financial data, RPO and RTO numbers will likely be near-zero. For small businesses or non-mission critical apps and services, RPOs and RTOs of 4-24 or even longer hours may be reasonable.

To define a realistic number your own RPO, ask yourself these questions:

  • What applications and data are you responsible for?
  • How often does your data change?
  • How much storage space for backups do you need?
  • How much data are you willing to lose given the cost of taking and storing backups?

To define a realistic number for your own RTO, ask yourself these questions:

  • What data loss scenarios (e.g. site outage, corrupt data, ransomware, etc.) do you need to account for?
  • From the time an incident begins, how much downtime can you afford?
  • What will it take to restore services and data?

From there, you have the parameters to begin exploring what solutions make sense given how much data you generate and how frequently it changes. It’s important to note that it can make sense to take a varied approach to RTO and RPO. For example, archived data that isn’t used day-to-day may have a significantly higher RTO time than an e-commerce website.

How to improve RPO and RTO

While it may not be realistic for everyone to have near-zero RTOs and RPOs, there is plenty you can do to optimize yours. Here are a few pro tips to help you with your own disaster recovery planning:

  • Test your backups! While this advice may be common knowledge, it’s still often overlooked. In fact, a Spiceworks survey found that about 1-in-4 companies don’t test their DR plans. If you’ve never tested your own DR plan, you can’t be sure it works. You don’t want to figure it out when you’re under the pressure of an outage.
  • Understand how your backup strategy impacts recovery time. Suppose you have 5TB of data and an RTO of 4 hours. If your only backups are offsite tape drives and the offsite location is 2 hours away, you’ve likely set yourself up for failure. Why? Because the tapes will take time to re-catalog onsite and unexpected traffic could chew up more than 2 hours on any given day. While this is an oversimplified solution, the takeaway is clear: where and how you backup your data impacts how quickly you can recover it.
  • Keep multiple backups. Your backups could fail or become inaccessible, and it’s your job to account for this. Strategies like 3-2-1 backups add a layer of resilience to your DR planning. Even if one backup fails, you should have another to fall back on.
  • Monitor. The sooner you know a system going offline or data being compromised, the sooner you can recover. Proactively monitor for network availability and malware so you can get a jump on hitting those RTO numbers.
  • Automate. Automatic failover is an excellent solution for zero or near-zero RTO and RPO use cases. Additionally, automation of backups should be a requirement for modern businesses. If you’re regularly taking manual backups, chances are something is wrong.
  • Use the cloud. Cloud backups can go a long way in simplifying recovery, automatic syncing, and offsite storage. For example, Azure Site Recovery can replicate on-premises virtual and physical servers in the Azure cloud in the event of a disaster.

Final thoughts: RTO and RPO are key DR metrics

As we have seen, RTO and RPO are important parts of disaster recovery planning. By asking the right questions and setting realistic recovery point objectives and recovery time objectives, you can improve your overall DR strategy. Defining RPO helps you take a strategic approach to the frequency of your backups, while RTO can provide clear target for maximum downtime. With both in mind, you can develop a plan where your backup strategy compliments your recovery strategy to meet your business continuity goals.

We hope you enjoyed the article – now go test your backups.