Title: / Data Distribution Study
Synopsis: / This document reports on an investigation into data distribution systems and the requirements of data creators for a system within Go-Geo!
Author: / Julie Missen, UK Data Archive
Date: / 05 August July 2004
Version: / 1.c.a
Status: / Final
Authorised: / Dr David Medyckyj-Scott, EDINA
Contents
1.0 Background 3
2.0 Introduction 4
3.0 Conducting the Data Distribution Study 4
4.0 Requirements Study 5
4.1 Results 6
4.2 Summary 8
5.0 Literature Review 8
5.1 Self Archiving 8
5.1.1 Copyright/IPR 9
5.1.2 Preservation 10
5.1.3 Peer review 10
5.1.4 Cost 10
5.1.5 Software 11
5.1.6 Case Studies 15
5.2 Peer to Peer Archiving (P2P) 17
5.2.1 How Does P2P Work? 18
5.2.2 The Centralised Model of P2P File-Sharing 19
5.2.3 The Decentralised Model of P2P File-Sharing 19
5.2.4 Advantages and Limitations of using the Peer-to-Peer Network 20
5.2.5 Copyright/IPR 20
5.2.6 Preservation 21
5.2.7 Cost 21
5.2.8 Security 21
5.2.9 Peer-to-Peer Systems 21
5.2.10 Case Studies 33
5.3 Distribution from a Traditional (Centralised) Archive 37
5.3.1 Copyright/IPR 37
5.3.2 Preservation 38
5.3.3 Cost 38
5.3.4 Case Studies 38
5.4 Data Distribution Systems Summary 39
6.0 Conclusions 41
7.0 Recommendations for Data Distribution for the Go-Geo! Portal 43
Bibliography 44
Glossary 46
Appendix A – Data Distribution Survey 48
1.0 Background
The creation and submission of geo-spatial metadata records to the Go-Geo! portal is not the end of the data collection process. Datasets are at risk of becoming forgotten once a project ends and staff move on. Unless preserved for further use, data which have been collected at significant expense, expertise and effort, may later exist in only a small number of reports which analyse only a fraction of the research potential of the data. Within a very short space of time, the data files can become lost or obsolete as the technology of the collecting institution changes. The metadata records in the Go-Geo! portal could rapidly become useless as they will describe datasets to which users have no access. There is a need to ensure that data are preserved against technological obsolescence and physical damage and that means are provided to supply them in an appropriate form to users.
Individuals, research centres and departments are generally not organised in such a way as to be able to administer distribution of data to people who approach them. While a researcher might be willing to burn data onto a CD as a one off request, they would be less willing to do this on a regular basis. One of the hopes of the Go-Geo! project is that researchers and research centres will provide metadata to Go-Geo! about smaller, more specialist datasets they have created. One of the benefits to individuals and centres of doing this is that demonstrating continued usage of data after the original research is completed can influence funders to provide further research money. However, feedback has indicated that provision of metadata from these sorts of data providers might be limited because of concerns about how they would distribute and store their data.
For larger datasets hosted and made available through national data centres and other large service providers, the technology already exists through which the Go-Geo! portal could support online data mining, visualisation, exploitation and analysis of geospatial data. However, issues of licence to use data in this way and funding to establish the technical infrastructure need to be resolved before further developments can take place. More investigation is required into the best means by which individuals and smaller organisations such as research groups/centres can share data with others.
The combination of compulsion and reward proposed to encourage metadata creation could also be applied to data archiving. The Research Assessment Exercise could reward institutions for depositing high-quality geo-spatial data sets with a suitable archival body, and funding councils and research councils could follow the example of the NERC, the ESRC and the AHRB who make it a condition of funding that all geo-spatial datasets should be archived.
It seems clear that what is required is a cost effective and easy way for those holding geo-spatial data to share their data with others within UK tertiary education. A data sharing mechanism is seen as critical to the project. Without it, the amount of geo-spatial data available for re-use may be limited.
The need for long-term preservation, along with cataloguing the existence of data, was identified during the phase one feasibility study and a recommendation was made to the JISC to consider establishing a repository for geospatial datasets which falls outside the collecting scope of the UK Data Archive and Arts and Humanities Data Service. This may be something one or more existing data centres could take on or it could become an activity within the operation of the Go-Geo! portal.
Before this investigation was undertaken, three scenarios of potential data distribution systems for the Go-Geo! project were identified:
§ self-archiving service. It was envisaged that one or more self-archiving services could be established where data producers/holders could publish data for use by others. The service would need to provide mechanisms for users to submit data, metadata and accompanying documentation (PDF, word files etc.). Metadata would also be published, possibly using OAI, and therefore harvested and stored in the Go-Geo! catalogue;
§ peer to peer (P2P) application. Data holders/custodians would set up a P2P server on institutional machines and store data in them (probably at an institutional or department level). Metadata would be published announcing the existence of servers and geospatial data. Metadata could also be published to the Go-Geo! catalogue. Users would use a P2P client to search for data or the Go-Geo! portal and then, having located a copy of the data, download it to their machine;
§ depositing copies of data with archive organisations, such as the UK Data Archive. The archive would maintain controls over the data on behalf of the owner and ensure the long-term safekeeping of the data. The archive would take over the administrative tasks associated with external users and their queries. Potential users of the data would typically find data through an online catalogue provided by the archive. Popular or large datasets may be available online otherwise on-line ordering systems are provided to order copies of datasets. If researchers and research centres did deposit their data with an archive, it would be important that the metadata records displayed by Go-Geo! recorded this fact and how to contact the archive.
2.0 Introduction
The aim of this study was to investigate a cost-effective way of distributing geo-spatial data held by individuals, research teams and departments. Three approaches were considered in this investigation, which were felt to reflect the resources available to the data creator/custodian for data distribution: P2P, self-archiving and traditional (centralised) archiving.
The first part of this report concentrates on a survey which looked at what data creators and depositors requirements are for a data distribution mechanism. Key researchers and faculty within the geographic information community in UK academia were contacted to assess their requirements and their constraints for data distribution. The survey was also posted on mailing lists and on both the portal and project web sites.
Technical options for both P2P and self-archiving were investigated through a literature review and by contacting experts in their respective fields. Existing software solutions were identified and evaluated, of particular importance was to determine how well existing software solutions could either meet, or could be modified to meet, the particular requirements for geo-spatial data distribution. The two approaches were compared against traditional (centralised) data archiving services.
3.0 Conducting the Data Distribution Study
The data distribution study began later than anticipated as there was an over run from another work package and there was a further delay due to communication difficulties with City University. Towards the end of the project this relationship was terminated and their allocated work undertaken by staff at the UK Data Archive.
At the start of the study, a meeting was held at the UK Data Archive to initiate ideas on the subject of data distribution and to draft a list of stakeholders. A list of stakeholders was drawn up by UKDA and EDINA, including key researchers from the geographic information (GI) community.
A requirements survey was then developed and distributed to stakeholders to assess their needs for data distribution issues. The survey (see Appendix A) attempted to discover how organisations and individuals would like to see data distributed in the future.
The survey looked at:
§ access conditions;
§ technical issues;
§ copyright/IPR;
§ funding;
§ licences.
The questionnaire was posted on the Go-Geo! web site, project web site, distributed at GISRUK and two workshops undertaken by the Go-Geo! metadata project.
An investigation was then undertaken to compare self-archiving and peer-to-peer with traditional (centralised) archiving. The use of OAI (Open Archives Initiative) was also considered within this study.
The peer-2-peer study should have been undertaken by City University but as this relationship was later terminated, their allocated work was undertaken by the UK Data Archive, with some consultation with an expert of the field. As this change in workload occurred at a very late stage of the project, it left less time then anticipated to complete the study, therefore a slightly scaled down version of the literature review was decided upon.
Areas of investigation included:
§ IPR/copyright. Copyright is an intellectual property right (IPR), or output of human intellect. Copyright protects the labour, skill and judgement that someone has expended in the creation of original work. Usually copyright is retained by the author, which can sometimes defined as and including the individual, organisation or institution. If a piece of work is completed as part of employment, the employer will retain copyright in your work. If commissioned to create a piece of work on behalf of someone else, then the author will retain copyright in that work;
§ preservation. The saving and storing of data (either in digital and/or paper format) for future use, either as a short-term repository of long term preservation of material in a format which will is not transferable or will not become obsolete. The cost and effort of longer term preservation may outweigh the benefits;
§ cost. This should be considered in both in terms of finances, expertise and resources;
§ software. This should include software for both storage and distribution;
§ data format/standards. The data formats supported by software;
§ service level definition. The definition of what a service will provide, for example, user support;
§ access control/security. Access and security control to the data, metadata and repository/storage facility;
§ user support/training. Support and training should be considered for data creators, depositors and users;
§ Open Archives Initiative. OAI, based at Cornell University provides the Open Archive Metadata Harvesting protocol that runs through web servers and clients to connect data providers to data services. The data provider is somebody who archives information on their site, while the service provider runs the OAI protocol to access the metadata. The OAI provide web pages for data providers and service providers to register so that they can know of each other's existence, and thereby bring about interoperable access. The OAI protocol was originally designed for e-prints, although OAI acknowledge that it needs to be extended to cover other forms of digital information. The OAI protocol demands that archives, at a minimum, use the Dublin Core metadata format, although parallel sets of metadata in other formats are not prohibited. The main commercial and public domain alternative to the OAI protocol is the IEEE Z39.50 which is widely used by large archives. The contrast between the two is that Z39.50 supports greater functionality than OAI and therefore is more complex to implement in the HTTP server and the client. OAI ensures access to a freely available repository of research information, but does not address long-term archiving issues.
A report was produced, setting out user requirements, a comparison of technical options and recommendations including costs. The report was then peer reviewed by subject experts and by a sample of those individuals who were sought to provide input during the requirement analysis stage.
4.0 Requirements Study
A requirements study was undertaken to investigate how services and organisations currently store and distribute data and what they might consider they would use in the future. The investigation was survey-based and focused on access, licensing, funding and policies.
The requirements survey was carried out electronically due to time constraints for completing this work package. Ideally, if time had allowed, face-to-face interviews or focus group discussion sessions would have been conducted, as these could have produced a greater response rate.
The survey was conducted via the web, workshops, mailing lists and by emailing UK GIS lecturers and other key academics.
The questionnaire was sent to:
§ gogeogeoxwalk mailing list (http://www.jiscmail.ac.uk./lists/gogeogeoxwalk.html);
§ ESDS site reps and website (http://www.esds.ac.uk);
§ IBG Quantitative Methods Research Group (http://www.ncl.ac.uk/geps/research/geography/sarg/qmrg.htm);
§ Geographical Information Science Research Group (GIScRG) (http://www.giscience.info/);
§ GIS-UK (http://www.jiscmail.ac.uk/lists/GIS-UK.html).
The questionnaire was posted on both the project web pages and on the Go-Geo! portal. The first 100 respondents who fully completed the survey received a book token, and were entered into the evaluation prize draw to win an Amazon voucher.
A copy of the survey questionnaire can be found in Appendix A.
4.1 Results
Fewer responses were gained than was initially hoped (ten responses). This was thought to be due to the time of year (exam pressure, marking etc) as well as survey fatigue amongst respondents, particularly as many of the contacts had also been invited to evaluate the Go-Geo! portal earlier in the year. Those responses we did receive however were from experts and were therefore regarded to be of high quality. Percentages have predominantly been used within the following section as some questions were not answered by respondents, or in some cases, multiple answers were given, therefore the number of responses was not always ten.