A specialised metadata approach to discovery and use of data in the NERC DataGrid

Kevin O'Neill1, Ray Cramer3, Marta Gutierrez2, KerstinKleesevanDam1, SivaKondapalli3, SusanLatham2, BryanLawrence2, Roy Lowry3, AndrewWoolf1
1 CCLRC e-Science Centre
2 British Atmospheric Data Centre
3 British Oceanographic Data Centre

1.Introduction

The Natural Environment Research Council (NERC) has a wide range of data holdings, held in technologies from flat files to relational databases. These holdings are relevant to a wide range of scientific disciplines, despite often having been collected on behalf of quite narrow specialised disciplines. The data holdings are stored across a wide range of archives, ranging from specialist professional data curators and archivists, such as the British Atmospheric Data Centre (BADC) and the British Oceanographic Data Centre (BODC) to files held on the hard disc of an individual scientist's PC.

The NDG vision is for the user to see these data resources as one entity, thus improving the ability of scientists to find data, and to provide a framework for the integration of data manipulation and visualisation services to improve the usability of the data. As a by-product, it is hoped that it will then be easier for scientists to contribute to and help maintain managed data holdings.

Key requirements are that the NDG should:

  • allow discovery and access of data without needing a priori knowledge details of storage characteristics, values or parameters;
  • be discipline specific, but provide functionality for users beyond that community;
  • allow discovery and access of relevant data by science beyond the discipline for which it was collected;
  • hide the heterogeneity of the data sources being queried, and combine the results into a single, consistent, result set;
  • allow the specification of pre-presentation processing, such as sub-querying, transformation, and consolidation, particularly where the data may be spread across several data sources;
  • deliver data to the desired place in the desired format, aiming at hiding the original format of the data without losing data values or its semantic content;
  • allow (limited) server-side processing of the data.

Given that the NDG is going to be built on pre-existing data holdings, the NDG needs to provide mechanisms to query metadata about the datasets and collate the results, along with the means to declare metadata models into which a data holding can map its local schema to allow cross-holding queries and data processing.

It is intended to do this by providing a decoupled data and metadata infrastructure that will bring together developed versions of tools that either already exist or areunderdevelopmentwithineScience ortheworldwideearthsciencecommunity. Initially, Atmospheric and Oceanographic data held in the BODC and BADC will be made available, with data from other disciplines funded by NERC being added in due course.

2.Overview of NDG Metadata

Usually, metadata models have tried to cover discovery and use within a single structure. In trying to capture the entire metadata chain from discovery to use for the NDG, it was found that either this structure would be far too large to be easily managed or understood, or we would have to make a pragmatic decision regarding the perspective to be emphasised. Also, metadata values were found to have multiple semantics that may not sit together easily in the longer run, especially where each viewpoint required attributes to be maintained that were not of interest to the other viewpoint.

The above problems led to the development of a metadata taxonomy that identified metadata specialisations and related them. In brief, the key elements of the metadata include (but are not limited to):

A [Archive]format and usage metadata.

B [Browse]superset of discovery and contextual metadata.

C [Comment]annotations, documentation and other supporting material.

D [Discovery]metadata used primarily to locate datasets.

The key types are the “Type A” metadata, which is directly concerned with the use of the data, “Type D” which is the metadata directly used by discovery services, and the “Type B” core metadata..

Type B is a superset of the Discovery metadata. This will be used to generate different “D Type” discovery formats from a single corpus of metadata, e.g. GCMD DIF, FGDC Z.39.50 “GEO” profile, or Dublin Core. It is generally referred to as the NDG Metadata Model.

“Type A” is more directly concerned with the use of the data, and is the basis of work on the semantics contained inside the data itself and how this can help in the realisation of the semantic grid [1]. This is referred to as the NDG Data Model, emphasising its inclination towards the data itself.

This categorisation has brought benefits by giving a clear split between discovery and use. Many disciplines have widely used, almost standard, data formats encapsulating discipline semantics to one degree or another. Separation allows the discovery metadata model to be plugged into different data models in a manner that means that the underlying data model is transparent to the user; and the reverse is also true as a single data model may be used in different disciplines. It also means that each model can tune the detail kept in it to that necessary to perform its task. For example, the data model must keep track of the actual data values and sufficient information to deliver the data to the user, if necessary transforming it from the original format to another, whereas the metadata model needs only a summary of the data values, but must hold detail of how and why the data was gathered. Thus, some data values are kept in both the data and metadata models, but their intended usages or the detail required is very different.

Figure 2 - Examples of metadata elements needed by both discovery and data use

It is vital that the “A” and “B”, and hence the “D”, metadata be able to cross-reference each other. An identifier generated by the data model links the data and metadata models. Once the data of interest is identified, by searching the “D type” metadata, the IDs of the data granules are passed to data browsing software that will interact with the “Type A” metadata allowing the user to identify and process the actual portion(s) of data of interest. This processing could include subsetting and aggregation of the data, in some cases producing new data granules that will be registered in the NDG, in others the result will be a temporary data set that will be discarded after use.

Also, a summary of the data contents is passed to the “B” metadata. This contains details that are of use to the discovery service, such as parameters represented in the data, but that are dealt with in more detail in the “A” data use metadata.

Figure 3 - Relating the Metadata and Data Models

3.Future directions

Currently, the NDG has produced a prototype in which the “B” metadata was used to generate a “D type” discovery format that then allowed the user to access data in raw or virtualised format. This has pointed out the need for more work in a number of areas. These include:

  • engagement with existing earth science disciplines’ taxonomies and controlled vocabularies to extend and clarify them;
  • extension to discovery terms to map into, and cross-index with, “foreign” discipline terminologies;
  • a need to provide semantic descriptions of a data structures, such as a “marine section”, that can be recognised by characteristics like its dimensionality and the value range of some dimensions, that can encapsulate representations in DFDL or similar;
  • convergence with emerging standards, such as those emerging from the Open GIS Consortium (OGC) and ISO’s TC211 activity.

In the meantime, a series of systems providing real functionality to earth scientists are being produced or planned, these utilising the evolving NDG framework coupled with a service-oriented implementation strategy to allow the incorporation of developments without excessive disruption or delay. Also, the EcoGRID project has been started to bring the technologies and concepts of the NDG to the field of ecology.

[1] See abstract submitted to GGF10 by Andrew Woolf regarding “Data Modelling and Metadata in the Semantic Grid”