Disaster Recovery Runbook

Executive Summary

[Fill in with company information where appropriate]

[COMPANY NAME] currently has [NUMBER OF]virtual servers and [NUMBER OF] physical servers all hosted from HQ at [LOCATION]. In the event of a disaster and a loss of these systems the IT disaster recovery plan is to recover these servers to [LOCATION]using “[THIRD PARTY]”.

The key IT staff required for this recovery are: [INTERNAL STAFF] and “[THIRD PARTY]”. The servers will be recovered within [NUMBER OF MINUTES/HOURS] ofinvocation.

Contents

Executive summary

Invoking Disaster Recovery

Key contacts (internal)

Key contacts (external)

DR call tree – who calls who

Basic IT and telecoms diagram

Space for network diagram (VISIO, Dia etc.)

Space for networking information

Communication with the rest of the workforce/business

Alternative premises / Recovery location

Software license and registration keys (if required)

Recovery types

How to recover

Recovery Templates

Disaster recovery event log

Invoking Disaster Recovery

If a crisis or disaster has occurred, you must decide your course of action.

Begin by recording your cascading decision-making hierarchy. Who is authorised to invoke the DR plan, and from whom must they get approval to do so?

1 / CIO
2 / IT Director
3 / IT Manager
4 / IT Administrator

In a crisis situation, a common mistake is to spend too much time trying tofix the minor issues that caused the IT outage at the expense of actually invoking DR.

Avoid this by specifying exactly what kinds of outage merit DR invocation, and then follow those standards rigidly.

For instance, acommunications supplier may promise that an issue will be fixed “within 1 hour”. After 45 minutes, they may continue to say the issue will be resolved “within 1 hour”. Unless pre-defined plans are followed, it can delay the invocation of the DR plan and therefore the time until the business is up and running again. In this example, if an issue is not resolved within 1 hour, you should invoke your DR plan and begin restoration. It is always possible to stop the restoration if systems are available and working again.

Include both specific incidents that are likely to occur such as flooding (if you are in an at-risk area) and general incidents that will include all of the issues you might encounter.

Incident / What is affected / Action
IT systems down / Individual departments / If not resolved within 30 mins, invoke DR plan
Flooding / Access to offices / Relocate to alternative office
SAN failure, the SAN attempts to repair itself but loses all data / All IT systems / Restore from backups

Key Contacts (internal)

Crisis Management Team
Name
Role / Managing Director
Mobile telephone number
Home telephone number
Personal email address
Name
Role / Financial Director
Mobile telephone number
Home telephone number
Personal email address
Name
Role / Business Continuity Manager
Mobile telephone number
Home telephone number
Personal email address
Department Heads
Name
Role / Sales Manager
Mobile telephone number
Home telephone number
Personal email address
Name
Role / Production Manager
Mobile telephone number
Home telephone number
Personal email address
Name
Role / HR Director
Mobile telephone number
Home telephone number
Personal email address
Name
Role / Backup Manager
Mobile telephone number
Home telephone number
Personal email address
Key Personnel for IT Recovery
In this section, list all staff with the specific skills for the recovery of IT. In larger IT departments, this might be the ‘Infrastructure Team’.For smaller departments, this might include all members of the IT team (as well as any 3rd party support contacts).
Amend the points below to produce an exhaustive list of all the skills required by key recovery personnel, including:
  • The skills to recover systems
  • Login details and encryption keys for recovery
Knowledge of where encryption keys are stored securely - this may include an offsite 3rd party
Name
Role / CIO
Mobile telephone number
Home telephone number
Personal email address
Name
Role / IT Director
Mobile telephone number
Home telephone number
Personal email address
Name
Role / Backup Manager
Mobile telephone number
Home telephone number
Personal email address

Key Contacts (external)

Type / Company name / Specific contacts / Contact number / Contact email address
Disaster recovery service provider / General support information
Account manager
Service delivery manager
Power supplier
Telecoms supplier
Telecoms supplier 2
Internet supplier
Internet supplier 2
Hardware/software/support suppliers
Insurance

Disaster Recovery Runbook|1

DR Call Tree – who calls who

Disaster Recovery Runbook|1

Basic IT and Telecoms Diagram

Disaster Recovery Runbook|1

Space for NetworkDiagram (VISIO, Dia etc.)

Disaster Recovery Runbook|1

Space for networking information

Service / IP address / Resolution
Email
Web Server
Terminal Services
MPLS
VoIP
Monitoring Services

Communication with the rest of the workforce/business

Cards for distribution to staff:

Create emergency cards like the examples below with instructions for staff to follow in the event of a crisis situation. Include an emergency telephone number and clear actions to report an emergency if discovered.

Underneath, record your emergency telephone number, the recorded emergency instructions, and a schedule for how often the instructions should be updated.

Emergency telephone number and recorded message:

How often to update message:

AlternativePremises / Recovery Location

In most cases this will be another office location or a third party ‘recovery suite’.

Address / Contact details / Details

Software License and Registration Keys
(if required)

System / Version / Support / Platform
Features:
Users:
System / Version / Support / Platform
Features:
Users:

Recovery Types

Below are some stock examples of common IT failures that warrant DR invocation. The list is not exhaustive and you should add more examples pertinent to the risk profile of your organisation.

Recovery type / Possible actions / Who is responsible?
Single server loss / Create new VM – recover from disk-based onsite backups / IT Admin
SAN loss / Invoke disaster recovery
Recover servers at DR site and send staff to alternate location / IT Admin
Satellite site lost / Recover site at HQ and provide remote access / Crisis Management Team
HQ / major site / primary data centre lost / Invoke disaster recovery
Recover servers at DR site and send staff to alternate location / Crisis Management Team
IBM series I / AS400 / power Systems / Contact service provider / Crisis Management Team and specialist 3rd party
Connectivity/power loss / Contact ISP and relocate to 2nd site/failover to secondary power source / Crisis Management Team

How to Recover

Use this table to outline the order in which you need to recover individual servers, according to the business priority.

Priority order / Server name / IP address / Description
1 / DC01 / 101.10.10.101 / Domain Controller
2 / Ex01 / 101.10.10.102 / Exchange server
3 / CRM01 / 101.10.10.103 / CRM Cluster
3 / CRM02 / 101.10.10.104 / CRM Cluster
3 / CRM03 / 101.10.10.105 / CRM Cluster
4 / Accounts01 / 101.10.10.106 / Accounts server

Recovery Templates

Below are four templates outlining the steps to recover from example disasters.

Use these templates to create your own recovery plans in accordance with your own recovery processes.

Example 1– DR invocation by recovering servers using an outsourced DR service

This example outlines the steps to take when working with a 3rd party disaster recovery provider.

Possible actions / Who is responsible?
STEP 1 / Identify issue and report to Crisis Management Team / IT Admin, IT Manager
STEP 2 / Inform DR service provider of need to recover / Crisis Management Team
STEP 3 / Authorise recovery (this may include third party verifying the recovery with key staff, determining the correct backup/snapshot to recover from and input of credentials/encryption keys for recovery) / Crisis Management Team
STEP 4 / Start restoration (if not using a service provider, you may want to include specific step by step instructions here with screenshots or more advanced, recovery softwarespecific instructions) / Service provider
STEP 5 / Confirm recovery of server(s) / Service provider
STEP 6 / Test recovered servers / Crisis Management Team
STEP 7 / Confirm the recovery was successful with IT team (i.e. do servers start?) / Crisis Management Team
STEP 8 / Functional testing with key users (small samples of users to test systems are working as expected and data is up to date) / Crisis Management Team
STEP 9 / Confirm user testing successful / Crisis Management Team
STEP 10 / Release DR systems to all users / Crisis Management Team

Example 2 – Invoking DR by restoring servers from backups

This example outlines the steps to take when recovering from a disaster using server backups.

Possible actions / Who is responsible?
STEP 1 / Setup backup software and media for restoration / IT Manager, IT Admin
STEP 2 / Confirm correct backup version to recover from / IT Director
STEP 3 / Restore server / IT Admin
STEP 4 / Reboot server / IT Admin
STEP 5 / Log on to domain / IT Admin
STEP 6 / Check IP address / IT Admin
STEP 7 / Reboot VM / IT Admin
STEP 8 / Start database (restore databases separately if required) / IT Admin
STEP 9 / Check domain and AD authentication / IT Admin
STEP 10 / Reboot server / IT Admin
STEP 11 / Check SQL instance is running using SQL Management Studio / IT Admin
STEP 12 / Functional testing with key users (small samples of users to test systems are working as expected and data is up to date) / IT Manager, Operations Manager, Sales Manager
STEP 13 / Confirm user testing / IT Manager
STEP 14 / Release DR systems to all users / IT Director

Example 3 – Operating after loss of primary power

This example outlines the steps to take when maintaining business continuity throughout a major power outage.

Possible actions / Who is responsible?
STEP 1 / Identify scope of affected systems / IT Admin, IT Manager
STEP 2 / Confirm issue and invoke DR procedures / Crisis Management Team
STEP 3 / Notify users and update emergency telephone message / Crisis Management Team
STEP 4 / Begin safely powering down non-essential systems or entire infrastructure until resolution / Crisis Management Team
STEP 5 / Verify interruptions to service delivery & resolve issues / Crisis Management Team
STEP 6 / Restore primary power source / Crisis Management Team and Service Provider

Example 4 – Operating after loss of primary connectivity

This example outlines the steps to take when maintaining business continuity throughout the loss of your primary internet connection.

Possible actions / Who is responsible?
STEP 1 / Receive alert from automated monitoring / IT Admin, IT Manager
STEP 2 / Confirm issue / IT Manager, IT Director
STEP 3 / Contact ISP /supplier / Crisis Management Team
STEP 4 / Notify users and update emergency telephone message / Crisis Management Team
STEP 5 / Send skeleton operational team to DR site / Crisis Management Team
STEP 6 / Authorise home/remote working for non-essential users / Crisis Management Team
STEP 7 / Make changes to bandwidth intensive applications/services / Crisis Management Team

Disaster Recovery Event Log

As you work through the runbook, you should record every action in the event log. Following asuccessful recovery, you can then use the event log as the basis for the IT DR Report and use any issues encountered to improve your plan.

DESCRIPTION OF DISASTER:
COMMENCEMENT DATE:
DATE DISASTER RECOVERY TEAM MOBILISED:
Key activities undertaken by disaster recovery team
(Give brief description) / Date and time undertaken / Outcome / Further action required

Disaster Recovery Runbook|1