KRDS2 Benefits Case Study: National Crystallographic Service, The University of Southampton (Supplementary Material)

This supplementary material prepared by Simon Coles is made available on the KRDS2 project web site. A summary version of this material is published in the KRDS2 final report.

1.1.  Background

The National Crystallographic Service (NCS) has been operation since 1981, housed within the respective Schools or Departments of Chemistry at a number of institutions and funded by a succession of research grants under both the rolling grant and responsive mode funding schemes of the EPSRC. The NCS provides an analytical service for UK chemists, based on state of the art experimental data collection facilities. This service includes the provision of raw data for those ‘skilled in the art’, who wish to work up a crystal structure themselves but don’t have experimental facilities available to them or the provision of fully analysed crystal structures for chemists who do not have the necessary training or facilities to conduct these experiments.

The Service has operated a range of different instruments and has been operational for long enough to have seen, and be party to, a number of different techniques, processes, software programs, file formats and standards. Over the years this gives rise to very useful longitudinal data in the context of the acquisition of essential analytical chemistry data and so is an excellent case to inform this study. Whilst the technique of crystallography has always been reliant on computational power to develop its models, it is really since the mid-late 1980’s that the entire process, from data collection to publication, has been digitally underpinned. This increase in power of the equipment and data analysis – deriving from improved instrumentation, increased computer power and much improved algorithms, has led to an explosion in the use of x-ray crystallography. Initially a technique that was used when absolutely necessary, it has moved to the routine structural technique of choice when a suitable crystalline sample is available.

Initially policies on the archiving and storage of this digital data were scant due to a lack of knowledge or understanding of working with this medium. More recently it has become clear that a service operating on behalf of others must have a policy for the archiving of the data it generates so that the data can be provided on request, some time after the original experiment(s) has been performed.

As with many instrument based scientific techniques, crystallographers make a definite distinction between raw and analysed data. In this case, modern ‘area detector’ raw data consists of about a gigabyte of ‘image’ files that are recorded in a proprietary binary format, depending on the manufacturer of the instrument being operated. This method of data collection has been routine and the approach of choice for about 15 years and data acquisition typically takes between 1 and 12 hours. Prior to the arrival of this technology, raw data were collected on ‘point detectors’, which also produced proprietary format files either from manufacturer personally developed software and generally running on VAX mainframe type computers. These data collections would typically have taken days or weeks to conduct. Very little point detector raw data remains, or is readable, today. Older instruments tended to be driven by software written in Fortran, whilst current machines can be controlled by software written in modern programming languages and running on PC (Windows) or Linux operating systems. Raw data is then processed into a condensed derived form (order of megabytes in ASCII text format) that can be read and worked up by a range of open source software than easily runs on an average desktop computer. Historically the software for working up a dataset would have run on mainframe or university central computing facilities and the output of each cycle would be printed and taken back to the laboratory for analysis for improvement of the model for input to the next cycle of workup.

It is only in the last decade that the storage of raw data in digital form has become a lesser issue and is no longer a significant component of the time taken to perform a study, however little is understood by the average researcher about the curation and preservation of this data over time. This study assigns cost data to the preservation of raw data over this time period, as the NCS has shifted from storage on magnetic tape to CD to DVD to removable solid-state disks and also to online mass storage. Due to the size of raw datasets and the cost of storage on magnetic tape, very little was stored for any length of time from the point detector era – as soon as a dataset had been worked up into a satisfactory result the raw data was deleted to make room for the next dataset.

It is only since the early 90’s that even the relatively small results files (CIF’s) have all been routinely stored in their digitally-born form – originally on 5½ inch floppy disks, then 3¼ inch floppies, followed by CD’s and more recently on removable USB drives, computer hard disks or online. Prior to the early nineties the medium for long-term storage of results data was paper. During the last year, the NCS has been systematically organising all available results data from its entire history in an effort to make them openly available in electronic form – this has involved migration from all the media mentioned above and in many cases complete construction of the results into CIF format. We have compared manual typing up of coordinates to digital scanning and subsequent optical character recognition and have costs for these processes. Additionally this study reports costs for transformation of the paper-based results into CIF format and also the cost of migration of more recent data from spinning disks to on-line storage in a structured and managed repository (vide infra – the eCrystals repository).

It is also important to note from these migration studies that a very significant amount of time is also spent discovering, acquiring and generating metadata associated with these datasets. The value of metadata has only recently been recognised and therefore this has had to be ‘clawed back’ from paper media, such as notebooks and diaries.

1.2.  Comparison Data

This discussion is considered in two parts – the cost of preservation now in contrast to the cost of the equivalent activity in a previous era and the cost of rescuing data from a particular point in time versus the cost of ‘doing it properly’ today. Over the historic timescale of the crystallographic technique, one can identify a number of unique ‘generations’ where handling and storage of data can be considered to be particularly unique. For raw data, storage was only a possible practice from the mid 80’s and these generations can be considered as:

1)  1985 – 1995: Magnetic Tape

2)  1995 – 2005: CD / DVD

3)  2005 – present: On-line storage

For results data the generations are slightly different, due to the physical size of the digital information to be stored:

1)  1970 – 1990: Paper Records

2)  1990 – 2000: Floppy Disks

3)  2000 – present: Hard Drives

The key costs for data preservation are going to be the development of infrastructure, advocacy and administration. These are generally very labour intensive exercises and therefore the real costs (direct and indirect) will be perceived as high. The School of Chemistry doesn’t currently have any plans for data preservation services, mainly due to the fact that the nature of chemistry research data is very diverse and some will not easily lend itself to preservation under a generic model. However the Repository for the Laboratory (R4L: http://r4l.eprints.org) that was run in the school, developed a generic repository for data deposition. The cost of running preservation services in this way would require an initial outlay for development, deployment and advocacy but thereafter the costs would be similar to those incurred for the running of an institutional repository – i.e. the primary cost would be for an FTE to perform administrative duties. However it is not clear that these costs can be covered by simple routes i.e. they will not come directly from a research grant, but will have to be incorporated into a complex financial model.

The basis for the following discussion is the cost data, for which a full breakdown of each component is provided in section 1.4. It is useful here to note the value of a crystal structure (in order that we may contrast this with its preservation costs over time) and this can be measured as the unit cost for its generation. This is calculated as £328.60 at current day costs (it has not been possible to find accurate historic costs for this).

The relative costs for raw data preservation can be summarised as follows:

1985-1995 Magnetic tapes £21.95

1995-2005 Compact Discs £6.00

2003-Present Outsourcing £1.48

This clearly demonstrates that the cost of archiving has roughly dropped by a quarter each time a new storage medium (and hence archival approach) has become widely available. It is important to note that this process is one of byte storage and very little, or no, preservation activity is performed – CD’s were not periodically checked to ensure they were still readable and the outsourcing option merely ensures the retrieval of byte deposited.

Results data preservation is quite different to raw data in that its size/volume is considerably more manageable, all files are ASCII text format and a considerable amount of metadata is required. The summary of costs for preservation during three separate generations is:

1970-1990 Paper records £30.00

1990-2000 Electronic copies on 3.25” floppy disks £7.25

2000-present Electronic copies on computer disks £2.15

The real cost of preserving results data roughly drops by a quarter as new methods and media become available.

Results can be regenerated if the raw data is preserved. However at modern day fEC, this would amount to between £50 and £400 (1-8 hours PDRA time) per structure. If raw data has not been preserved and results are lost then the cost of not preserving this data is enormous, as the compound generally cannot be resynthesised and therefore the amount that might be attributed here would be the cost of generating a molecule from a fresh research project (£20K – this figure is taken from the KRDS1 Southampton case study).

The time required to perform the processes of metadata recovery was a significant component of the rescue, and took more or less the same amount of time, irrespective of the medium it had been stored on.

Cost – Benefit Analysis

From a costing of the process involving migration of material from a particular generation into the infrastructure of today and a knowledge of the current cost of preserving data within this modern infrastructure, it is possible to derive a benefit analysis based on costs.

Raw Data

Migration between media is often a problematic matter when raw data is concerned as data formats and archival hardware are generally closely tied to the instrumentation – new instruments involve new software, formats and archival methods e.g. it was not possible or sensible to migrate ANY data from magnetic tapes to CD’s, due to a new instrument, but the format was maintained for the CD to outsourcing migration and it was therefore deemed worthwhile to perform. The cost of the latter migration was approximately £1.75 per dataset, which was essentially attributable to labour. There was a 7% loss of data in this process due to CD’s being corrupt or unreadable – this factor was dependent on the original method (press vs burn) and speed of writing. As a rule of thumb it is now generally accepted that data will be lost after 5 years if the CD was written at highest speeds. It is worth noting that this corresponds to 140 lost raw datasets, which equates to roughly £2.8M when costed at the reproduction value of £20K per dataset above.

It should be noted that this was purely an exercise of migrating archived datafiles – there was no metadata associated with each collection of files, which was merely compressed and zipped with a name prefix that included the original identifier.

The cost to migrate a dataset from CD to removable disk or online storage is £1.75, which contrasts with a cost of £1.48 to deposit and store indefinitely.

Results Data

In a relative sense the cost of migrations is high for results data, with paper to electronic being £48.20 per structure for data as well as results recovery. It did not make economic sense to employ people to manually input data, as the small set trialled resulted in the process taking four times longer than Optical Character Recognition (OCR) with subsequent tidying. There was a 5% loss of data stored on floppy disk, which in real terms equates to a loss of roughly £3286.00.

However, in cases where OCR did not work it cost £12.10 to type the result and deposit in the repository. The cost to migrate from floppy disk was £7.10, whilst the equivalent process for data stored on hard drives was £4.77.

Cost of Preservation / Cost of Migration / Migration as % of Preservation
Raw Data / Results Data / Raw Data / Results Data / Raw Data / Results Data
Paper / Not possible / £30.00 / Not possible / £48.20 / N/A / 160.66%
Floppy disk / £21.95 / £7.25 / Not possible / £7.10 / N/A / 97.9%
CD or Hardisk / £6.00 / £2.15 / £1.75 / £4.77 / 29.2% / 222.8%

Table 1. Summary of the costs (per dataset) for migration and preservation processes.