ASCA & ALCS CurriculumDisaster RecoveryV1.006/04/2006

SOHO

OPERATION

HIGHLIGHT

ON L I N E

E-Learning Training Package on the

AS-level Computer Applications (ASCA) and

A-level Computer Studies (ALCS) Curricula

Email:

Web Site:

DisasterRecovery

- 1 -

ASCA & ALCS CurriculumDisaster RecoveryV1.006/04/2006

Contents

1Introduction

2Disaster Examples

2.1Physical damage

2.2Logical damage

3Creating Disaster Recovery Plan

3.1Data Loss Prevention

3.2Data Recovery

Author’s remarks

  • Part of the materials in this set of handout is adapted from Wikipedia.
  • This set of materials is essentially developed by Chung, C.F. Jeffrey.

1Introduction

"He didn't backup the data, so our team lost last week's work"

Have you ever heard or experienced of that? Although disaster rarely happens, once it does the price is always higher than what you thought. Try to image your personal computer suddenly broke down and you lost all of your teaching materials, PowerPoint and worksheets for which you have spent weeks or even months to prepare. The trouble is severe. Now imagine the serious of the trouble if a computer network breaks down.

In order for you and your organization to effectively protect your resources from potential disaster, you should invest in implementing a disaster recovery plan to minimize data loss and allow data recovery even disaster happened.

Disaster recovery is the ability of an infrastructure or system to restart operations after a disaster. Disaster recovery is used both in the context of data loss prevention and data recovery.

There are two primary metrics to demonstrate recoverability following failure:

Recovery Point Objective (RPO) is the point in time that the restarted infrastructure will reflect. Essentially, this is the roll-back that will be experienced as a result of the recovery. Reducing RPO requires increasing synchronicity of data replication.

Recovery Time Objective (RTO) is the amount of time that will pass before an infrastructure is available. Reducing RTO requires data to be online and available at a failover site.

2Disaster Examples

Before adisaster recovery planis created, we must understand what kinds of disaster we will faceand how to solve these problems. Here are some examples.

2.1Physical damage

A wide variety of failures can cause physical damage to storage media. CD-ROMs can have their metallic substrate or dye layer scratched off; hard disks can suffer any of several mechanical failures, such as head crashes and failed motors; and tapes can simply break. Physical damage often causes some data loss, and in many cases damages the logical structures of the file system. This causes logical damage that must be dealt with before any files can be used again.

Most physical damages cannot be repaired by end users. For example, opening a hard disk in a normal environment can allow dust to settle on the surface, causing further damage to the platters. End users generally do not have the right tools or technical expertise to make these sorts of repairs.

2.2Logical damage

Far more common than physical damage is logical damage to a file system. Logical damage is primarily caused by power outages that prevent file system structures from being completely written to the storage medium, but problems with hardware and drivers, as well as system crashes, can have the same effect. The result is that the file system is left in an inconsistent state. This can cause a variety of problems, such as drives reporting negative amounts of free space, system crashes, or an actual loss of data. Various programs exist to correct these inconsistencies, and most operating systems come with at least one rudimentary repair tool for their native file systems. Linux, for instance, comes with the fsck utility, and Microsoft Windows provides chkdsk. Third-party utilities are also available, and some can produce superior results by rescuing data even when the disk cannot be recognized by the operating system’s repair utility.

Two main techniques are used by these repair programs. The first, consistency checking, involves scanning the logical structure of the disk and checking to make sure that it is consistent with its specification.

The second technique for file system repair is to assume very little about the state of the file system to be analyzed and torebuild the file system from scratch using any hints that any undamaged file system structures might provide. This strategy involves scanning the entire drive and making note of all file system structures and possible file boundaries, then trying to match what was located to the specifications of a working file system. Some third-party programs use this technique, which is notably slower than consistency checking. It can, however, rescue data even when the logical structures are almost completely destroyed. This technique generally does not repair the underlying file system, but merely allows for data to be extracted from it to another storage device.

3Creating Disaster Recovery Plan

As mentioned, disaster recovery should include data loss prevention and data recovery.In the following paragraphs, some important techniquesfor data loss prevention and data recoverywill be highlighted.

3.1Data Loss Prevention

Technique 1 – Using Uninterruptible Power Supply (UPS)

At power failures, disk controller may report file system structures have been saved to the disk when it has not actually occurred. This can often occur if the drive stores data in its write cache at the point of failure, resulting in a file system in an inconsistent state such that the journal itself is damaged or incomplete. One solution to this problem is using disk controllers equipped with a battery backup so that the waiting data can be written when power is restored. Finally, the entire system can be equipped with a battery backup that may make it possible to keep the system on in such situations, or at least to give enough time to shut down properly. This battery backup is called “Uninterruptible Power Supply (UPS)”.

UPS, is a device or system that maintains a continuous supply of electric power to certain essential equipment that must not be shut down unexpectedly. The equipment is inserted between a primary power source and the equipment to be protected for the purpose of eliminating the effects of a temporary power outage and transient anomalies. They are generally associated with telecommunications equipment, computer systems, and other facilities such as airport landing systems and air traffic control systems where even brief commercial power interruptions could cause injuries or fatalities, serious business disruption or data loss.

In order to prevent blackouts, UPS will use a process called load shedding. This reduces the amount of power being sent to the consumers but does not eliminate it entirely. This drop in voltage is also sometimes called a voltage sag or a brownout. UPS will also protect equipment upon the occurrence of a brownout by using its internal batteries to correct the drop in voltage. The single biggest event that brought attention to the need for UPS power backup units was the big power blackout of 2003 in the north-eastern US and eastern Canada.

There are nine standard power problems that a UPS may encounter. They are as follows:

  1. Power failure.
  2. Power sag (under voltage for up to a few seconds).
  3. Power surge (over voltage for up to a few seconds).
  4. Brownout (long term under voltage for minutes or days).
  5. Long term over voltage for minutes or days.
  6. Line noise superimposed on the power waveform.
  7. Frequency variation of the power waveform.
  8. Switching transient (under voltage or over voltage for up to a few nanoseconds).
  9. Harmonic multiples of power frequency superimposed on the power waveform.

Technique 2 – Using Redundant Array of Independent Disks (RAID)

RAID is a system of using multiple hard drives for sharing or replicating data among the drives. Depending on the version chosen, the benefit of RAID is one or more of increased data integrity, fault-tolerance, throughput or capacity compared to single drives.Its key advantage is the ability to combine multiple low-cost devices into an array that offered greater capacity, reliability, or speed, or a combination of these things, than was affordably available in a single device.

RAID specification suggested a number of prototype “RAID levels”, or combinations of disks. Each has theoretical advantages and disadvantages. The most common as well as provide functionality to prevent data loss and fault-tolerance one are RAID 1 and RAID 5. A simple animation that illustrates how a RAID system delivers its fault-tolerance behavior can be found at

RAID 1

Figure 1. RAID 1 configuration.

RAID 1 creates an exact copy (or mirror) of a set of data on two or more disks. This is useful when write performance is more important than minimizing the storage capacity used for redundancy. The array can only be as big as the smallest member disk, however. A classic RAID 1 mirrored pair contains two identical disks, which increases reliability over a single disk, but it is possible to have more than two disks. Since each member can be addressed independently if the other fails, reliability increases with the number of disks in the array. To truly get the full redundancy benefits of RAID 1, independent disk controllers are recommended, one for each disk. Some refer to this practice as splitting or duplexing.

When reading, both disks can be accessed independently. Thus theaverageseek time is reduced. The transfer rate would be almost doubled as data can be accessed in parallel. In general, the more disks are used in the array, the better the performance of the disk array. The only limit is how many disks can be connected to the controller and its maximum transfer speed.

RAID 1 has many administrative advantages. For instance, in some 365*24 environments, it is possible to “Split the Mirror”: make one disk inactive, do a backup of that disk, and then “rebuild” the mirror. This requires that the application supports recovery from the image of data on the disk at the point of the mirror split. This procedure is less critical in the presence of the “snapshot” feature of some file systems, in which some space is reserved for changes, presenting a static point-in-time view of the file system. Alternatively, a set of disks can be kept in much the same way as traditional backup tapes are.

RAID 5

Figure 2. RAID 5 configuration.

RAID 5 uses block-level striping with parity data distributed across all member disks. RAID 5 has achieved popularity due to its low cost of redundancy. Generally RAID-5 is implemented with hardware support for parity calculations.

Every time a data block is written on a disk in an array, a parity block is generated within the same stripe. A block is often composed of many consecutive sectors on a disk. A series of blocks from each of the disks in an array is collectively called a “stripe”. If another block, or some portion of a block, is written on that same stripe the parity blockis recalculated and rewritten. The disk used for the parity block is staggered from one stripe to the next, hence the term “distributed parity blocks”. RAID-5 writes are expensive in terms of disk operations and traffic between the disks and the controller.

The parity blocks are not read on data reads, since this would be unnecessary overhead and would diminish performance. The parity blocks are read, however, when a read of a data sector results in a cyclic redundancy check (CRC) error. In this case, the sector in the same relative position within each of the remaining data blocks in the stripe and within the parity block in the stripe are used to reconstruct the errant sector. The CRC error is logged down but this would not hinder the operations of the computer system. Likewise, should a disk fail in the array, the parity blocks from the surviving disks are combined mathematically with the data blocks from the surviving disks to reconstruct the data on the failed drive “on the fly”. This is sometimes called Interim Data Recovery Mode. The operating system knows that a disk drive has failed and it notifies the administrator that a drive needs replacement; applications running on the computer are unaware of the failure. Reading and writing to the drive array continues seamlessly, though with some performance degradation. In RAID 5, where there is a single parity block per stripe, the failure of a second drive results in total data loss.

The maximum number of drives in a RAID-5 redundancy group is theoretically unlimited, but it is common practice to limit the number of drives. The tradeoffs of larger redundancy groups are greater probability of a simultaneous double disk failure, the increased time to rebuild a redundancy group, and the greater probability of encountering an unrecoverable sector during RAID reconstruction. In a RAID 5 group, the mean time between failures (MTBF) can become lower than that of a single disk. This happens when the likelihood of a second disk failing out of (N-1) dependent disks, within the time it takes to detect, replace and recreate a first failed disk, becomes larger than the likelihood of a single disk failing.

RAID-5 implementations suffer from poor performance when faced with a workload which includes many writes that are smaller than the capacity of a single stripe because parity must be updated on each write, requiring read-modify-write sequences for both the data block and the parity block. More complex implementations often include non-volatile write back cache to reduce the performance impact of incremental parity updates.

In the event of a system failure while there are active writes, the parity of a stripe may become inconsistent with the data. If this is not detected and repaired before a disk or block fails, data loss may ensue as incorrect parity will be used to reconstruct the missing block in that stripe. This potential vulnerability is sometimes known as the “write hole”. Battery-backed cache and other techniques are commonly used to reduce the window of vulnerability of this occurring.

3.2Data Recovery

Technique – Using Backup

Backup means copying of data for the purpose of having an additional copy of an original source. If the original data is damaged or lost, the data may be copied back from that source, a process which is known as data recovery or data restore. The “data” in question can be either data as such, or stored program code, both of which are treated the same by the backup software. Backups differ from an archive in that the data is necessarily duplicated, instead of simply moved.

Backups are most often made from hard disk based production systems to large capacity magnetic tape, hard disk storage, or optical write-once read-many (WORM)disk media like CD-R, DVD-R and similar formats. As broadband access becomes more widespread, network and remote backups are gaining in popularity. There are quite a few companies (found by Google Keyword Search)offering Internet-based backup. During the period 1975–95, most personal/home computer users associated backup mostly with copying floppy disks. However, recent drop in hard disk prices, and its number one position as the most reliable re-writeable media, makes it one of the most practical backup media.

To plan for Backup, several strategies should be considered:

A backup should be easy to do.

A backup should be automated and rely on as little human interaction as possible.

Backups should be made regularly.

A backup should rely on standard, well-established formats.

A backup should not use compression. Uncompressed data is easier to recover if the backup media is damaged or corrupted.

A backup should be able to run without interrupting normal work.

Rely on standard formats.

If a backup spans multiple volumes, recovery should not rely on all volumes being readable or present.

If you use certain medium to do your backup on, you also need to have a drive available that can read it.

Backup media needs to be read from time to time, to make sure the data is still readable. Also, the data needs to be copied to a new medium if it’s about to disappear. Will you be able to read a CD-R in 10 years' time?

Each of the different media has benefits and drawbacks. Also consider the cost per gigabyte when comparing different solutions.

Proper backup relies on at least two copies, stored on different media, kept at different locations.

In the case of a disaster, no one will be able to think clearly, and act accordingly. For this reason, checklists need to exist that outline what to do.

Staff needs proper training in what to do in case of disaster occurred.

To perform backup, you should use some backup software to help such as ntbackup in Windows XP environment (see Figure 3).

Figure 3. ntbackup – a data backup and recovery tool in Windows XP.

To perform network backup, which means backup data to network backup server through the network, you should set up your client computers and network backup server with proper IP and subnet mask.

- 1 -