MEDIN / GeoData Institute Proposals August 2008

MEDIN IFP 002: Review approaches for archiving data across a DAC network

1  Background

The issue of multi-domain records generated by a single survey campaign held by a single DAC has been exercising COWRIE in developing its stewardship plan. It also affects other data repositories as well as records currently held within DACs where project level data combines records of sediments, biota, water quality and bathymetry. One might logically see these datasets being held by separate DACs, but the association with a single survey campaign recommends keeping that data together. The key issue is one of interoperability and the ability, however this is achieved, to re-assemble records.

The value-add of the domain specific data being handled by the domain specific DAC is very strong, in terms of the ability to QA the records, generate effective user related metadata and to ensure that where archive data is post-processed that this is undertaken within the context of that discipline and applying best practice.

The disadvantage of disaggregating these records is the risk that the project level information is not drawn back together if a specific search is made for records on a particular topic across multiple archives – to the detriment of the user.

The options to overcome these constraints is what is being evaluated here, and it may well be that there is no one solution. Varied options from enhanced metadata-driven approaches, duplication of data resources, disaggregation of data, maintaining a single archive store and single portal. The organisational and managerial issues surrounding these requirements are as challenging as any purely technical solution, and the two areas need to be evaluated together to develop options for MEDIN.

COWRIE has decided that, at least at the present, that the DAC network is not mature enough to address the issues of handling a distributed stewardship role, both because there are gaps in the network and there is no common model for the management and metadata for records. This is the very thing that MEDIN seeks to address and this is why this is such a timely and significant programme for the GeoData Institute and its relationship with DACS.

GeoData is maintaining many of the records that IFP002 refers to (the specifically named data recourses within the Brief (Cowrie and mALSF) are both maintained currently by GeoData), and a number of others are supported by us (REC, ECA etc). However, this is not the full picture as, for example there are many other marine ALSF records and datasets that are maintained elsewhere (archaeology) that relate to the way the funding was distributed by ALSF not by the way that the users would want to source the datasets. This introduces data policy and data management issues for future data collection and survey – which is being considered by the recently proposed ‘data clause’ within MEDIN. Such advice is likely to need updating with the outcomes of the cross DAC review, so that the appropriate records to facilitate whichever, option/s are chosen are collected at source.

There is clearly an urgent need to seek resolution of these issues if the complexity of Marine Spatial Planning is to be assisted by the existing data resources. Currently, the largely academic and research holdings within the DACs do not actively support the marine and coastal management activities, but with the Marine Bill proposals and MEDIN initiative this is likely to need to change, as not all records of relevance are maintained within the SeaZone base-mapping or thematic layers.

2  Approach

This project is close to the scope of one that was submitted to the mALSF programme ‘Data resource interoperability evaluation’ which was a joint evaluation between existing DACs, other data holders, GeoData and ABPmer. This evaluation (Hill et al 2005) was not taken forward as a programme within mALSF who were reviewing their own position on data management. This was taking just one of these resources identified within the current brief (marine ALSF GIS) and evaluating how the interoperability between DACs could provide for the longer-term stewardship and sustainable access to these resources; where source funding was of limited duration and the records were relevant to more than one DAC.

We would intend to use some of the approaches and thinking behind this previous evaluation to inform the approach to this project. Assessing the approach to handling such multi-domain datasets necessarily also needs to consider issues of interoperability if approaches are seen as important in reducing duplication of records and in allowing growth of specific domain record.

a) Identify and review current and historic datasets

This identification step will draw on the RouteMap to marine data recently created for COWRIE and the earlier evaluation of interoperability within mALSF projects. This stage will identify sources, both currently within DACs and those outside DACs where this issue is raised and hold meetings and discussions with programme managers.

Examples of this issue include BGS which holds a number of programme records that contain sediment and biological sampling data, but where the vibrocore data are the main target for the original decision to place the records with that particular DAC. Equally, a number of separate data management programmes and data delivery and metadata programmes for the UK are being managed within other non-DAC organisations. This includes the many programmes that are run by GeoData itself, but is not limited to these. For example, the Maritime Data initiative is holding data, which is widely used within the industry, but currently of no relevance to the remit of the existing DACs. CEFAS and JNCC data holdings are also relevant here. There are many similar resources where existing DACs have no delivery or data management role – where the data are essentially, non-archival, but where the longer-term management of socio-economic or administrative/licensing data is relevant to the marine operational environment. The scope for evaluating the extent of records relevant to this review may be very wide and we would focus on the main resources (in terms of usage and volume).

In the case of the COWRIE and mALSF programmes, interoperability is a founding principle upon which these specifications are built and conform to relevant ISO/TC211[1] and Open Geographic Consortium (OGC)[2] specifications. It is therefore logical to conceive of these as contributing to a framework of information resources for the marine sector. The extent to which other marine data resources match these standards needs to be assessed in helping to develop options for dealing with these multi-disciplinary datasets.

The issue of trans-boundary datasets is also relevant, where for example Northern Ireland and Republic of Ireland records and EU member State’s marine records are relevant when one is considering the regional seas.

The licensing and distribution policy issues may be a particular barrier to a number of the approaches, and the arrangements made between the survey programme funders and the DAC holding the records may also be a constraint to some of the options. These are concerns that may be very record-specific, but will need to be considered within the SWOT analysis if the options for managing such data are to be understood.

This review will consider the management approaches that have been used within these other resources, and where these do not exist will seek discussion with the groups and programme managers to establish whether a policy or approach is in formulation. Meetings will wherever possible be based on existing engagements with these groups to make this more effective in terms of programme resources.

b) Analyse the various technical and organisational solutions – including metadata solutions

Technical and organisational solutions hang largely around the adoption of standards and compliance with these standards to provide ‘a robust and future-proofed data portal providing a heterogeneous but complementary and integrated set of geo-resources’[3]. This evaluation needs to consider the interoperability options and the ability to execute high resolution, spatially and temporally sensitive, metadata driven queries across a network of peer resources, based on the relevant standards specified by the OGC. It also needs to consider the constraints to interoperability that would need to be addressed in establishing the effective network. Constraints on pan-organisation interoperability will include issues of technical infrastructure, semantic infrastructure and security and authorisation.

The technical and organisation options for interoperability will be considered, both in terms of the technical standards and in terms of the organisation and data policy standardisation. This will include both the metadata driven approach, but also include the web services based solutions. The scope of these solutions depends rather on what the DACS / MEDIN sees being delivered (metadata / visualisation and or data). Again, the solutions may depend on the partner DAC capacity and the issues of legacy systems running current archives that may not be compliant with emerging standards.

A SWOT analysis of the strengths and weaknesses of the approaches will be presented, although this is recognised that a one size fits all model may not be appropriate, given the differing characteristics, scope and capacity of the varied DACs.

Interoperability, the ability to communicate, execute and return information is central to the DAC network, and to the options to concurrently search multiple remote servers. However, the options for archiving data across the DAC network also relate to the ability or desire to disaggregate datasets (to separate biological data from physical data; to separate geophysical data from the archaeological interpretation, or to disaggregate species level records from community level habitat mapping). All of these issues are evident in the current DACs and need addressing directly with the DACs to develop appropriate options.

c) Produce a report and recommendations to MEDIN

A report of the findings will be made that summarises the approaches and provides recommendations.

3  Outputs

The output will be a report of the analysis and recommendations for MEDIN

4  Staffing

The programme would be undertaken by Jason Sadler, Oles Kit and Chris Hill. Short CVs are provided below.

Jason Sadler is IT Manager at the GeoData Institute, he has been responsible for the development of standards based metadata driven systems for the UK and international data archives, including Dubai coast, and Channel Coastal Observatory and COWRIE. He manages a team of developers principally implementing open source architectures for management of marine and coastal metadata and data.

Oles Kit is a webGIS developer within the GeoData Institute, he has previously worked as an Information Technology engineer in VISICOM and on coastal data management within the Nansen Environmental and Remote Sensing Centre in Norway. He undertook a Masters in Universitetet i Bergen in Water and Coastal Management, 2006 — 2007. Oles is currently engaged in development of the Channel Coast Observatory and other OGC compliant webGIS / metadata solutions.

Chris Hill is manager of the GeoData Institute, a university-based research and consultancy group focused on environmental data management. Chris has been instrumental in developing the GeoData activities in the coastal and marine sectors, from working directly with environmental data resources to developing the approaches to data submission, discovery and delivery within the COWRIE and marine aggregate sectors.

5  Work plan

The proposed work plan is set out below: It is envisaged that the interviews and analysis will take place over a 3 week period within October.

Any proposed seminar would need to seek a suitable date, although it is considered that that might be a shared activity between the four proposed MEDIN projects that could bring the groups together effectively into a longer c. 2 day workshop. Looking at the other programmes there is a need to review with sponsor and DAC organisations in a number of stages and therefore this may be better centrally organised. Attendance at this would also be treated as a contribution in kind.

§  Identify and review current and historical datasets Oct 2008

§  Analyses technical and organisational options Oct 2008

§  SWOT analysis Nov 2008

§  Reporting Dec 2008

6  Costs

It is recognised that this programme is a significant contribution to the MEDIN initiative in which GeoData has already played a part in attendance at meetings and including some development of MDIP Reference Document.

Staff costs (assuming 9 days of JDS, 14 days OYK, 6 days of CTH) 11800

Travel and subsistence visits to DAC offices / programme managers offices 500

Reporting (3 day JDS, 3 days OYK, 1 day CTH) 4500

We would offer this at the stated sum of £12,500 - providing a £4300 investment in kind (25 %).

Exclusive of VAT to be charged at the prevailing rates.

GeoData is completely self-funded without any Higher Education Funding Council support, or core DAC funding. As a contribution in kind we would reduce our rates for this work to acknowledge the desire to contribute to the programme. We would also seek to subsidise the meetings within the context of other programmes where liaison with the DACs is mutually supportive of the MEDIN programme.

7  GeoData Profile

GeoData Institute is an applied research agency within the University of Southampton. GeoData has been involved in data management and spatial data management for environmental applications for 25 years.