Too often, a disaster recovery plan can look a little like it was put together by the underpants gnomes.
Phase 1: Take database backups
Phase 2: ???
Phase 3: Recovered from disaster
Starting with taking backups isn’t the worst idea (can’t say the same for collecting underpants), but this approach leaves a lot of questions unanswered. Let’s try and think this through in a slightly more logical manner than an underpant gnome. Below are my personal top 5 recommendations when planning for recovery:
1 – What are your objectives?
A good starting point to understand your objectives are defining the Recovery Time Objective (RTO) & Recovery Point Objective (RPO). Think of these as: “how long do you want recovery to take” (RTO) & “what point (in time) are we able to recover to”. The problem with both these measures, is when you ask the business, the probable first response will be “minutes” and “none at all”. The problem with that response? Cost. Implementing near zero RTO & RPO for all systems will be pricey. There is a difference between revenue affecting systems that are used by your customers and the back office systems that make your life easier. Set the RPO & RTO accordingly.
What you need to do: Sit down with the business and define these objectives for each system you support. You will need to come prepared with some idea how you could meet the likely objectives.
2 – Plan your disaster
Planning your recovery involves planning your disaster. There are different types and different scales of disaster, requiring different types and different scales of response. While you can’t plan for them all, knowing how to recover your service from several common types of disaster will help build your knowledge in preparation for the disaster you didn’t predict.
What you need to do: Think through the disasters that could affect your database servers from data loss due human error to a total loss of a data centre, plan responses and mitigations for each disaster you identify.
3 – Document it
Even the underpants gnomes documented their plan, incomplete though it was. This can’t be published and forgotten though. As the nature of your environment and the technology changes, both the kind of disaster, and your options for responding to it will change, your plan & documentation needs to change as well. Your documentation should include any scripts you’ve used in your testing to automate the process. Having a scripted recovery process that’s scripted, documented and proven will save you from another disaster.
What you need to do: Formally document your plans along with the agreed RPOs and RTOs. Make sure your DR plans get agreed by the people who need to and review the plans as things change.
4 – Testing
This will help iron out any gaps in your planning (e.g. backing up a database, but not backing up the encryption key) and demonstrate that you can (or cannot…) meet your recovery objectives. This is also a good time to really straighten out any scripts & automation you use for recovery, making sure these are as slick as they can be (and commenting them too). When the time comes to enact your recovery plan, there’ll be a lot of pressure and attention on you. Give yourself a head start by working through your plan when things are calm.
What you need to do: Take the plans you made in step 2. Work through it, engaging any other team members you need to. Revise the plan and the documentation with what you’ve learned
5 – Communication
You can’t really consider your work to be done and your applications to be recovered, unless the application owner (and the rest of the business) know about it. You need to plan your communication. You need to know who needs to be in the loop for a given application. It’s here that a Configuration Management Database (CMDB) is key. As your environment grows in size or complexity, it becomes more and more important that details of the key components in your environment and their interactions is available to the whole team in a rigorous format.
What you need to do: Include in your documentation who the key stakeholders are, who owns the application and plan for how to communicate key parts of recovery to them.
If you follow the above five steps your plan should look less vague and have definitive actions to take to ensure you can recover in a lot more scenarios than the gnomes' plan allows.