Dear collegues,

this is the report on the incident occurred at CNAF, requested by the

BaBar management. I'm presenting it in my quality of INFN representative

in the CSC.

1) INCIDENT DESCRIPTION

On June 15 2006, at CNAF, there has been a double disk bladefailure on “volume

Group” 1 (out of 4) of the disk systemSTK BladeStore 0855 (class 1250),

served by the sameRAID5.

The second failure happened on the june 15thduring RAID reconstruction of the data triggered by the first disk failure (happened on the 14th).

The net result was the loss of around 7 TB of skimmed data (mainly R14/16) and ofone of the two Charm AWG areas (of about 1.8TB), out of the four AWG areas existing at CNAF.

The user data on the AWG area were lost because, in spite of the fact that both the two Charm AWG areas (and the “skimming” area) were supposed to be backupped on tape, the backupprocedure was failing since more than 1 month and no full “snapshot” was available.

No one from the BaBar side (Alexis and Armando), was notified of this failure before the 14th of june when the CNAF backup admin found out the failure and restarted the backup procedure. It was too late however because of the unlucky

coincidence of the double disk failure.

2) ADOPTED ACTIONS

The skimmed data have been re-imported from SLAC within few days.

After having checked that the failure wasn’t due to troubles in the electrical interface of the blades on the backplane, upon BaBar side request, and after agreement reached between STK and CNAF (meeting of the 23rd of june), the two broken disks have been given to acompany (Microwell), in Milan, specialized in data recovery, on the 27th of june.

The recovery of one disk, the one in the spare blade, has failed. The other disk has been subjected to a substitution of mechanical pieces and is under mirroring recovery right now. In case the recovery would succeed, the mirror disk would be put in the blade and the whole system would undergo a reconstruction process within RAID5, and reasonably the AWG data would be recovered. The company asked for one more week to finish the work.

In order to prevent similar incidents in the future or, at least, to

mitigate their effects, the following actions have been decided:

1) the AWG areas, will be moved to a newer and slightly more reliable storage system (STK FlexLine 680).Currently 1 AWG area has been already moved.

The other AWG areas will be moved as soon as two new dedicated diskservers with the appropriate HBA fiber channel will be installed and configured (they are expected to be shipped at CNAF within next week). A reasonable timescale for this, because of the summer vacations of CNAF staff, is end of august or first week of September.

In the meanwhile BaBar is going to provide a disk backup of the remaining Charm AWG area (1.8TB maximum size, but only 85GB right now).

2) the Charm AWG area is now under backup (as well as the home directories).

BaBar and CNAF are going to re-discuss the backup policy soon. It’s likely the backup responsibility will be moved to another CNAF person or shared between two CNAF people. Due to the wide size of the area to backup, CNAF has proposed a looser backup policy which would provide an incremental backup every 2 days and a monthly total backup.

3) FUTURE OPERATIONS

CNAF is going to test a new storage system (where each disk will have its own fiber channel interface). There is an agreement for which few TB of the first limited-size test batch are going to be given to BaBar.

Moreover CNAF is also going to test a new wide tape storage system in autumn.

CNAF staff is confident that all the presented measures and a renewed

storage system monitoringwill reduce the probability ofdata loss and will minimize the disruption to users.

In addition, the major upgrade of the cooling and power distribution

system scheduled for 2007, will improve the stability of the hardware

installed at CNAF by providing a better environment.

Best Regards,

Fabrizio Bianchi