When we think about disasters, we think about buildings collapsing, roads flooding, and cities losing power for days at a time. Events like earthquakes and storms can be devastating for physical infrastructure, but it doesn’t take a force of nature to bring down your website or application. Disasters in IT can be anything that has a negative impact on your infrastructure’s availability, and by extension, your finances. But just like with physical disasters, they are important to plan for. This is where disaster planning comes in.
Disaster planning is about preparing for one of these events. You can’t predict when a piece of hardware will fail or a power outage will affect your datacenter. However, just like you may plan your escape route to a storm shelter, you can plan what to do when something goes wrong with your system architecture.
While we can’t cover everything you need to know, we’ve put together this overview of disaster planning on AWS to help you create a plan that works for your organization.
Why disaster planning is critical to business
Imagine you’re responsible for a website that unexpectedly goes down for 24 hours due to a physical server failure. If it’s your personal blog, this is certainly annoying, but let’s face it—it’s probably not the end of the world. However, for a Fortune 500 company, the same event could cost millions of dollars.
So how do you create a disaster plan? First, let’s go over a couple key terms.
Recovery time objective (RTO)
Your organization’s RTO is the amount of time it takes after a disaster to return to your normal, expected level of service. Let’s say you’re in charge of an application that needs to support 50,000 users. At 12:00 PM, an earthquake hits your datacenter. If your RTO is 8 hours, your application will need to return to its full capacity (i.e., have the ability to support 50,000 users) by 8:00 PM that same day.
Recovery point objective (RPO)
Your RPO is the maximum acceptable amount of data loss, measured in time. Suppose you’re managing a database for a large application, with hundreds of thousands of records. At this scale, the number of records is likely growing at a very high rate, and this is where RPO becomes important. At 12:00 PM, a mouse chews through the main cable connecting your database server, bringing it offline. If your RPO is 1 hour, you will need to be able to restore all data that was already in the database at 11:00 AM, 1 hour prior to the incident (you’ll also need to call an exterminator, but that’s not exactly an engineering problem).
Why RPO and RTO matter
What makes RTO and RPO important is that you can use them to quantify acceptable levels of service and data loss from a technological standpoint. Your organization likely has similar metrics on the business side. And this is one of the key responsibilities of a Solutions Architect—making sure that people from both sides understand the needs of the other.
To set an effective RTO, you need to define and understand the acceptable level of service for your business. In the example above, it was the application’s ability to support 50,000 users. In the real world, it may be the ability to process a certain number of requests, or something else entirely. Most organizations have an operational level agreement (OLA) that defines this specifically, and it will depend on a number of factors specific to your company.
Your RPO depends on how valuable your data is, as well as your business reputation. For example, new data being added to your system may include purchase orders. If customers who placed an order during downtime are asked to resubmit their request, they may decide to look for another company they feel is more reliable. In some cases, that profit will be lost. The finance department is usually a good resource for hard statistics that can help determine what amount of data loss is acceptable for the business.
Creating a disaster plan for your organization
Once you define your ideal response to a disaster event, you can start to plan accordingly. When you’re creating your plan, there are two main phases to think about.
Identifying your RTO and RPO is just part of the preparation. You also need to create (or migrate to) infrastructure that will allow you to quickly move your site or app in the event of a disaster, and choose redundant storage options to mitigate your risk of data loss.
In this phase, you should start with an audit of your existing infrastructure. Start with cloud computing basics—high availability, fault tolerance, and redundancy. Which parts are most susceptible to failure? If those parts do fail, what will be the impact to the business? Once you answer these questions, you can start to create a tangible list of improvements that need to be made.
When disaster strikes, what are you going to do? Answering this question is the second part of an effective disaster recovery plan. In this stage, you need to think about how you’ll provision new instances when existing ones go down, and how you’ll handle failover from your existing instances to the new ones you create.
For example, if the datacenter that hosts your site were to suddenly go offline, how would you migrate your infrastructure? How long would it take? When you define answers to these questions, you can begin to understand your organization’s weak points and make calculated improvements to your reaction plan. We’ll be covering a strategy for disaster recovery in a later post.
Planning for disaster on AWS
The benefit of using a cloud provider like Amazon Web Services is huge in a disaster. AWS allows you to quickly provision new infrastructure, and they offer plenty of tools and services to help you manage your systems when you’re planning for the worst-case scenario.
AWS allows you to distribute your systems across different regions and availability zones. This ability, along with general cloud computing concepts like high availability and fault tolerance, are essential to reducing your risk.
There’s a reason AWS named their storage service S3 (Simple Storage Service)—it really is simple to use. But there’s a lot going on behind the scenes. Data you write to S3 is stored redundantly, even within a single region. In fact, it’s designed for durability of 99.999999999%. There are also features like versioning and infrequent access storage that can help you ensure that your data is available while minimizing cost.
The most popular AWS compute service, EC2, is ideal for creating infrastructure that works with your disaster recovery plan. First, EC2 instances are incredibly fast to spin up. If your application goes down, creating new instances can be done in just a matter of minutes in most cases. You can also preconfigure AMIs (Amazon Machine Images) to create instances with the software and settings you need to run your infrastructure.
These are just a few examples, of course. AWS offers similar tools to handle networking and manage your databases in the event of a disaster. In fact, most of their services are designed with this in mind, and it’s a big part of the reason AWS is the most popular cloud provider in the market today.
Disaster planning in the real world
Understanding your weaknesses is just the first step. Planning for a disaster is an ongoing process, and even once you have a plan, it should be evaluated regularly to make sure that your RTO and RPO align with the needs of the business.
The most important phase, however, is keeping your infrastructure current. One of the greatest risks to stability and security is unpatched software, and it’s important to know what your team is responsible for. AWS is a mature platform, and they release changes nearly every single day. Making sure your team has the knowledge and skills to keep up is absolutely vital. What might have been best practice yesterday could be overshadowed by a new service or feature tomorrow, and the only way to stay ahead of risk is through continuous learning and training. Speaking of continuous learning and training, check out the latest AWS hands-on training content we released last month.
Luckily, AWS makes disaster preparation easier than ever before. Its services are constantly being improved, its features are becoming even more enterprise-friendly, and the cost of cloud computing continues to attract large organizations—one study even found that cloud file storage costs up to 74% less than its traditional on-premises alternative.
When we think about disasters in IT, we think of lost profits, angry customers, and in some cases, that dreaded email from our manager asking to come to their office for what’s sure to be an unpleasant chat. But disasters don’t have to be storm clouds hanging over our heads. Platforms like AWS offer ways to minimize the risk through redundancy, speed, and services that are built with large-scale businesses in mind. With the proper infrastructure, plan, and recovery strategy, an IT disaster that would have once registered as an earthquake will barely be felt as a tremor in your organization.
For more information, check out the official AWS whitepaper on disaster recovery.
Where to start
If you’re not sure where to start learning the technical side of disaster planning, an AWS Certification may be just what you need. Download our free eBook on how to study, take the exam, and start down the path to becoming an AWS expert.