Develop Disaster Recovery Plan

Develop disaster recovery plan

Overview

Imagine being in a foreign country and not having any money because your credit cards and money have been stolen from your bag. What would you do next? If you’d thought about that before leaving you may have asked your bank about what to do in situations such as this. They might have given you a phone number to call reverse charges or have a local contact for emergencies.

Having a plan to follow when things go wrong is also important to business continuity. A plan makes it easier for a business to return to production as soon as possible. Statistics show that without a plan the business would most likely fail.

In this topic we will look out how to translate an agreed disaster prevention and recovery strategy into a detailed process, procedure and resource plan. The plan is then used to recover from a disaster of any magnitude be it minor, major or total devastation.

You should have already formulated preventive and recovery strategies. Formulating preventive and recovery strategies requires:

developing strategies for dealing with risk
identifying the cost of preventive and recovery options
completing a strategy report
gaining approval from management for strategy implementation.

This topic contains:

reading notes
activities
references
topic quiz.

As you work through the reading notes you will be directed to activities that will help you consolidate the content of what you have been learning. The topic ends with references to aid further learning and a topic quiz to check your understanding.

Reading notes

Implementing a disaster prevention and recovery strategy
In-built system contingencies
Review and update policies and procedures
Additional or changed hardware and/or software required
Identifying cut-over criteria
Documenting the Disaster Recovery Plan

Implementing a disaster prevention and recovery strategy

Once the DPR strategy has been formally accepted by the business and approved by senior management, it’s time to implement it. Required actions include:

changing procedures, eg virus checkers to run each time a computer is switched on
purchasing equipment to provide fault tolerance and standby
implementing additional controls to identify errors
improving backup procedures
increasing security over data and user access
developing the disaster recovery plan.

These can be categorised as:

building or implementing in-built system contingencies
bringing the current site to the standard required
making changes to policies and procedures
implementing additional or changed hardware and/or software.

In-built system contingencies

Not all prevention or recovery processes will cost money to implement. Often existing facilities have not been fully implemented or turned on. These will vary from system to system and it is important for the team undertaking the risk analysis to be aware of these built-in facilities.

We will examine a few of the built-in facilities of Windows XP Professional and how these may be used to safeguard against different risk events. These are summarised in the following table:

Table 1: Windows XP system contingencies
Facility / Function
User accounts / Restrict access to authorised users only.
Encryption / Additional level of security to ensure that confidential files are secure
Permissions / Allows some users restricted access (such as read only) to safeguard the data from destruction or corruption
Auditing / Tracks events to determine what users have been doing on their computers
Lock computer / Prevents others from accessing a user’s computer
Support for smart cards / Restricts access to authorised users only
Automated System Recovery / Allows quick recovery from an operating system problem
Support for RAID 5 and mirroring / Allows system to continue working even if a hard disk fails
Recycle bin / Allows recovery of recently deleted files
Backup software / Creates backups of files and the whole system.
System restore / Monitors and records system changes. Enables roll back to a previous point in time
File protection / Protects Windows files from being corrupted by rogue software installs
Firewall / Prevents malicious attacks by worms and other viruses from the network or Internet

Controls such as passwords and access permissions may be referred to as logical controls.

Current site configuration

Here we are primarily concerned with systems in terms of software, data and hardware. However, the security and controls that are implemented at the physical site are also an important consideration in the risk analysis.

While encryption and user access can be used to prevent unauthorised access, no-one should be able to physically access a computer in the first place. The following diagram will give you an idea of some of the levels of physical security that may be applied.

An organisation in a secure building with locked doors on each floor with security guards and video cameras can be confident that an intruder would find it difficult to access a PC and the confidential data it contains. However, many frauds and errors are perpetrated by trusted employees. That is why there is still an ongoing need for logical controls and passwords for each user.

Figure 1: Security measures

Activity

To practise identifying systems contingencies go toActivity 1located in the Activities section of the Topic menu.

Review and update policies and procedures

The normal day-to-day operations of an organisation are described in its policy and procedures manual. This may be stored electronically, on the company's Intranet or published as a paper-based manual. After designing the recovery requirements, you will often need to update this manual to include the changes required to prevent or recover from a disaster.

As mentioned earlier, many risk events are also security threats which are often identified during a security audit or review. Similarly, review and investigation of the current procedures also form part of the Disaster Recovery Planning process to ensure that they meet DRP requirements.

The review process follows the following stages:

Identify key DRP issues that should have been resolved by the existing processes and procedures
Review and evaluate the operational policies to ensure that they meet the demands imposed by the DRP
Design a series of tests to verify that procedures are in accordance with these policies
Carry out the testing and document the results
Evaluate the findings and make any recommendations for changes or approve the current processes.

The procedural changes required will depend upon what is discovered and the DRP strategy adopted. Here are a few examples:

Table 2: Examples of procedural changes
Strategy adopted / Impact on procedures
Nightly backups to be taken offsite / Backup procedures and the process for getting backups offsite and subsequent retrieval will need to be described.
Software to be fully tested before going into production. / Testing procedures (defining what ‘fully tested’ means), documentation and test results to be maintained will need to be described.
Virus checking / Procedures to explain the danger of viruses, how to check for viruses on disks and in e-mails and what to do if a virus is discovered will be required.
Only licensed software to be used. / Procedures for checking the numbers of licenses that the organisation has and what to do if more are needed will be required. Penalties to be imposed if staff disregard the policy.

A set of procedures for the disaster recovery plan itself will also be required.

Additional or changed hardware and/or software required

A DRP strategy usually requires new or updated hardware and software. Some of these requirements are detailed in the following table:

Table 3: DRP requirements
Strategy / Hardware or software
Regular backups to tapes / Tape backup unit with sufficient capacity. Tapes for the backup. Appropriate backup software.
Mirrored disks or RAID. / Additional disks or disk subsystems.
Fault tolerance systems, duplicated systems / Requires similar hardware to that being duplicated. If a file server is to be duplicated, a matching machine will be needed. May also require additional software licenses.
Virus checking / Virus software licenses for all users

Think about the hardware and software that would be required by the home user to implement the disaster prevention and recovery strategies identified earlier, under which:

work is saved every few minutes
files are regularly backed up
external backup devices such as tape, zip or CDs, are used
important files are stored away from the home, possibly in the office
UPS or surge protectors are used especially if in an area that suffers power problems.
telephone surge protectors are used with modems
virus checking software are always used and kept up to date
a repair disk is always created
serial numbers of all components are recorded in case of theft
a fire extinguisher is kept in the vicinity of the computer
only licensed software is used and all licenses are stored safely
passwords and/or encryption is used to protect confidential files
passwords are not stored in dial-up settings
anti-spyware software and firewalls are always used if connected to the Internet
security patches for software (operating systems and applications) are kept up to date.

The following hardware and software would be required:

Backup tape unit (or zip drive or CD writer), tapes (or zip cartridges or CDs), appropriate backup software and hardware drivers
UPS and/or surge protectors for power and telephone
Virus-checking software
Fire extinguisher.

Identifying the required hardware and software and developing implementation plans to install them forms part of the DRP project. One aspect of the risk analysis and recovery plan is to identify cost-effective options to meet a variety of threats. These would have been approved as part of the approved plan. However, before implementing these, you will need to select particular products which meet both business requirements and cost constraints.

Sometimes the required resources cost much more than was originally estimated. In this case, you will need to revisit the DRP and submit a new recommendation for approval. In extreme cases (say, for example, the original cost for a hot site was estimated as $300,000, but turned out to be nearer $1,000,000) management may decide that they will live with the risk and the contingency option is dropped. They opt instead for a cold site costing only $200,000.

Precise requirements and costs are documented only when current operational procedures have been reviewed and gaps between the ideal and what is actually agreed upon have been identified.

Once the report has been approved, it is time to put it into action. This may involve changing existing policies and procedures and purchasing new hardware and software.

Large organisations may follow specific risk analysis methods and use specialised tools. You can still carry out an analysis by identifying the asset to be safeguarded, possible risks to that asset, the cost to the company of the risk event occurring, the likelihood of such an event and the cost of prevention or recovery.

The output from the risk analysis is an action plan to make changes to the current way of working in order to minimise or prevent the risk and a disaster recovery plan so staff know what to do should the risk event occur.

Activity

To practise the reviewing of policies and procedures go toActivity 2located in the Activities section of the Topic menu.

Identifying cut-over criteria

How do you know when to activate your disaster recovery plan? If an earthquake that destroyed the office building the answer would be obvious. But what if a computer virus deleted all the data on one or all the servers. Each possible incident needs to be analysed to determine the impact of the disruption to the business. The first step is to determine the extent of the impact to establish how long it will take for the business systems to be restored. If this exceeds the maximum allowable downtime, then a disaster is declared.

The Disaster Recovery Co-ordinator, with input from upper management, is responsible for deciding when to activate the disaster recovery plan. If the co-ordinator is not available, responsibility flows down the chain of command. This is why it is important for roles and responsibilities to be clearly defined in the Disaster Recovery Plan. A contact list should be created and maintained containing details of all employees with after-hours phone numbers. The organisation’s internal directory listing, it can be modified accordingly.

Figure 2 Example of a generic structure for disaster recovery

Documenting the Disaster Recovery Plan

All that remains is to document the Disaster Recovery Plan. The plan outlines the tasks that need to be completed to recover from the disaster and return the business to its normal operations. The plan is a dynamic one – it will constantly change as the business changes. Therefore it is important to review it at regular intervals to ensure it is up to date.

There are many different possible formats for a DRP.

Here is one suggestion:

Introduction
Purpose
Scope
Authorities (what legal/contractual requirement the DRP complies with)
Record of change
Operations
Systems description and architecture (a general description of all the systems
Responsibilities (detailed outline of teams responsible for recovery operations)
Activation phase (initial actions to detect and assess damage)
Recovery phase (processes and procedures to complete recovery of each system with nominated staff positions responsible for each task)
Details of the post-recovery review to be performed after the completion of the recovery from any declared disaster.

An example DRP

A small firm of accountants consisting of two partners and four assistants operate from a house converted into offices. The building has fire alarms installed but no other fire or security devices. Computers are allocated to each staff member and networked to a file server. The equipment, which is around two years old, has been reliable to date but is now out of warranty and the original supplier is no longer in business.

The server has a tape backup unit but no one knows how to use it.

The office uses the following software:

MYQB accounting system to process the accounts for 150 clients. Average charge per client is $1,500 per account. Work for each client is carried out throughout the year.
PAT tax return system used with 2,000 clients. Average charge to a client is $150 per return. Most work is done from September to December.
FIN to manage the financial affairs of 100 clients. Average charge per client is $2,000. Work is carried out four times a year in March, June, September and December.
TIM for time recording of staff and billing of clients. Weekly time recording and monthly billing.
TRUS to keep trust accounts for 200 clients. Average charge is $1,000 per annum. Work is carried out in March and September. This software was developed by one of the assistants.
Excel for spreadsheet work
Word for word-processing
Windows ME and a NetWare server.

All users have access to the Internet and use e-mail to communicate with clients. Turnover of staff is quite high and all the assistants have been with the firm for less than a year. To minimise the need to administer the network, everyone signs on with the same user ID and password. This has been the situation for over two years.

Issues to be considered at this firm

If you were to asked to undertake a risk analysis for this office you would need to take into account the following issues.

There are peak periods when risk events could cause greater damage
Confidential data is kept about clients
There are legal requirements to meet deadlines
Clients could readily change accountants if service is poor.

Major risks

Table 4: Major risks
Risk / Likelihood
PC failure and breakdown / Very likely
Theft / Possible
Fire / Possible
Client data security breaches / Possible
Software support problems / Very likely
Loss of data / Possible

Business requirements of the firm

The firm generates a lot of income from the use of computers. If we consider the worst possible case then during peak periods they could earn an income of $18,750 from accounting work, $75,000 from tax returns, $50,000 from FIN work, and $100,000 from trusts. This amounts to nearly $250,000 a month or $60,000 a week or $12,000 a day! If the server were down and little work could be done, this the amount of revenue they could lose.

This is an over-simplification, however, since work would most likely just be delayed and staff could work overtime to catch up. On the other hand, if the problem resulted in the late payment of taxes, and the firm were held liable, they could end up paying a lot more in fines. Either way, this exercise demonstrates that the systems are critical.

Actions to be taken

Table 5: Actions
Risk / Action
PC failure and breakdown / Consider the need for standby PCs and/or server. Negotiate service and support contracts for 24/7 quick response service. Implement backup procedures as a priority.
Theft / Consider office security involving the use of alarm systems and window shutters or bars. Implement backup procedures as a priority.
Fire / Consider use of a fireproof safe. Implement backup procedures as a priority including off-site backup.
Client data security breaches / Assign each user a user with an ID and password. Improve security procedures and staff training. Ensure appropriate access is implemented. Consider the use of encryption for sensitive data. Consider implementing auditing of file level access.
Software support problems / Review quality of support provided for TRUS system. Consider outcomes if staff member leaves. Negotiate for the source code or consider escrow arrangement.
Loss of data / Implement backup procedures as a priority.

Disaster Recovery Plan