NEW TOOLS FOR CREATING AND SEARCHING METADATA DESCRIBING URBAN SCIENTIFIC DATA SETS

C. ISABELLA TINDALLa, Roger V. Moorea, John D. Bosleya, Ruth D. Swetnamb, Rod Bowiec, Anne De Rudderd

aThe Centre for Ecology and Hydrology, Crowmarsh Gifford, Wallingford, Oxfordshire, OX10 8BB. UK, bThe Centre for Ecology and Hydrology, Monks Wood, Abbots Ripton, Huntingdon, PE28 2LS. UK, cThe British Geological Survey, Kingsley Dunham Centre, Keyworth, Nottingham, NG12 5GG. UK, dThe British Atmospheric Data Centre, Rutherford Appleton Laboratory, Chilton, Didcot, OX11 0QX. UK.

ABSTRACT

The Urban Regeneration and the Environment research programme (URGENT) is concerned with the problems of regenerating urban conurbations in the UK. Its aim is to better understand the pathways of pollutants between the soil, the air and water and the effects of those pollutants on urban ecosystems. The outputs of the programme are models, data and information. To manage the data collected and to ensure that they reach as wide an audience as possible, four Data Centres have been established, one for each of the air, ecology, soil and water components of the Programme.

A major practical problem for the Data Centres is finding the resources to catalogue holdings, a problem that is compounded by the trend towards multidisciplinary science. This is increasing the range of data types held by Data centres and, hence, the breadth of knowledge that the data scientist must possess to undertake the cataloguing task. Many scientists now need datasets from disciplines outside their own. The problem in searching for these datasets is that the searcher is not always familiar with the terminology of the dataset he or she is seeking.

This paper reports on the tools developed by the URGENT Data Centres to collect the metadata for the data catalogue, build a thesaurus and search the dataset descriptions. The paper will conclude with the current plans for automating the generation of large proportions of metadata, which so far have had to be assembled by hand.

INTRODUCTION

Background

The Urban Regeneration and the Environment research programme (URGENT) is concerned with the problems of restoring and regenerating urban areas in the UK. Its aim is to better understand the pathways of pollutants between the soil, air and water and the effects of those pollutants on urban ecosystems. This £9.7M Programme is funded by the UK Natural Environment Research Council (NERC) and, with the active help of city authorities, industry and government, has supported 41 projects.

The major outputs of these projects are models, data and information. NERC has long recognised the value of the data that its programmes collect and generate. It therefore requires each programme to set up Data Centres to manage the data collected by the programme and to be proactive in ensuring that they reach as wide an audience as possible. Accordingly, URGENT commissioned a Scoping Study, identified the need, allocated funds and set up four Data Centres, one for each of the air, ecology, soil and water components of the Programme. Their main tasks are to acquire datasets required from third parties along with those generated by the Programme, assemble them into a database and make them available to the scientific community. The Data Centres are also responsible for standards, Quality Assurance, Data Management Plans and the dissemination, exploitation and long-term security of the data (Swetnam et al.). This paper is about how URGENT has met its obligations to inform the potential users of its results and disseminate the new knowledge and information that is now available, particularly to the scientific community.

As URGENT progressed and a clearer picture emerged as to who the end users were and what they needed, a specific dissemination task was established. Three groups of users were identified, planners, scientists and the public (Figure 1). The needs of the planners and the public were to be met by the URGENT Environmental Information Systems project (Alker et al.) and the work of the Urban Wildlife Partnership respectively (). For the scientific community, a new URGENT Data Dissemination project was established and assigned to the Data Centres.

Figure 1. Dissemination routes for URGENT science

The objective of this project was to propose and implement the means by which present and future researchers could find and obtain the data generated by the URGENT programme. A desirable outcome was that the solution should be capable of being reused by future NERC Research Programmes, in particular, the Lowland Catchment Research (LOCAR) and Catchment Hydrology And Sustainable Management (CHASM) projects.

Overview of the approach

Maintaining up to date catalogues of their holdings and enabling outside users to find datasets relevant to their projects is a permanent challenge to any Data Centre. There are several aspects to this problem, but two key elements are resources and terminology. The trend towards integrated management and integrated science requires Data Centres to hold an ever-widening range of data types. Cataloguing data has always been important, but is even more so now with the greater volumes and range of data. However, when resources are scarce, this is one of the first tasks to be abandoned. A consequence of multidisciplinary science is that many scientists need datasets from disciplines outside their own. The problem in searching for these datasets is that the searcher is not always familiar with the terminology of the dataset he or she is seeking.

The ideal solution to the problem would be to automate the entire process by deriving the catalogue from the data. However, this is not yet feasible because few current or legacy datasets contain all the required information within themselves. It is still necessary to collect or compute the summary information about a dataset by hand. Often the Data Centre does not have this information and it must be obtained from the supplier for whom the task is understandably regarded as a distraction from work of higher priority. Minimising the demand on the supplier and giving something in return are therefore important. The solution adopted has been to create a Metadata Capture Tool and make it available, at no charge, to all suppliers, in this case the URGENT scientists. The tool simplifies the task of describing their datasets and produces an export file conforming to the current metadata standards. This can be emailed to the Data Centre who add it to the central database. The information will then be made available in the form of a set of searchable web pages, linked to the main URGENT web site. The suppliers are free to retain the application and use it to catalogue and publish their own datasets descriptions from URGENT or any other project.

To aid the process of finding datasets, a Web-based Search Tool has been developed. It provides help at three levels. Users who are familiar with the URGENT Programme are provided with a fast direct route to data descriptions by science project. For general users outside the Programme or less at ease with searching, an assisted search is offered. Finally, to meet the needs of scientists searching for data outside their discipline, a sophisticated thesaurus based search tool has been added. Building a thesaurus is a non-trivial task, requiring expert knowledge from many people who are inevitably geographically dispersed. To facilitate the work of compiling and assembling the Thesaurus, a further tool has been developed, the Term Entry Tool, in which keywords are recorded, defined and the relationships between terms established.

The remainder of the paper will first expand on what metadata and metadata standards are and then describe the tools developed to collect the metadata, build the thesaurus and search the dataset descriptions. The paper will conclude with the current plans for automating the generation of large proportions of metadata, which currently have to be assembled by hand.

Metadata and metadata standards

What are metadata?

‘Metadata’ is the term currently used for the information found in a data catalogue. Metadata comprise summary information describing 'data sets' and, as a minimum, usually comprise a title, a brief description, keywords, geographic extent, contact details and availability. Figure 2 gives a more extensive list of examples.

Figure 2. The NGDF Metadata Model as adapted for URGENT

Metadata standards

The exponential increase in the volume and variety of data has led to a host of data cataloguing projects. Inevitably, the first attempts were uncoordinated and the response was a number of national and international initiatives to develop metadata standards, most originating from the geographic information (GIS) community. Two British examples are the United Kingdom Environmental Data Index (UKEDI) ( and the National Geospatial Data Framework (NGDF) ( The American equivalent is the Federal Geographic Data Committee's (FGDC) ( standard upon which the International Standards Organisation (ISO) standard (ISO19115) is based. There are, of course, many others but they are too numerous to mention here.

To check that they were appropriate and to decide which standard to follow (IGGI 2002), URGENT tabulated and matched the items of metadata information for each of the five standards above. Most allowed for information to be recorded at two levels, a summary or ‘discovery’ level and a detailed level. At the summary level, there is a high degree of commonality between the standards. At the detailed level, the FGDC and ISO standards are far more comprehensive.

Although, the more detailed information was relevant to the URGENT requirement, unfortunately, it was unavailable and resource constraints dictated that URGENT would only be able to compile metadata at the summary level. At this level, although all the information is appropriate, there is however a problem. All the standards are heavily biased towards recording the spatial aspects of a 2-D dataset, whereas much environmental data comprises long time series observed at either fixed or mobile sampling points located in a 3-D world. Here, time, height or depth and the values of the attribute itself are just as important as where they were measured. This is illustrated by river flow data or the results of quality monitoring in the air, lakes or down a borehole.

On this basis, none of the standards were suited to the need. The ISO standard, however, allows the user to define his own additional items of metadata and, therefore, could have been extended while staying within the rules. Again, unfortunately, at the time decisions had to be made, the ISO standard was still in draft form, whereas its UK counterpart, NGDF, was operational and had already been adopted by key players in the URGENT programme. Therefore, URGENT adopted and adapted NGDF. Figure 2 shows the NGDF metadata model after it was altered to accommodate datasets varying in time and 3-D space.

Tool Development

Having adopted and adapted a standard, the next task was to persuade the dataset suppliers to supply metadata that conformed. For them this was a low priority task, so the problem was how to invert the situation and create one in which they benefited as well as the Data Centre. The solution adopted was to provide them with a Metadata Capture Tool which could afterwards be used for their own purposes.

The Metadata Capture Tool

The design requirements for the Metadata Capture Tool, which were largely identified in a design study by Laxton et al., were that it should be able to:

  • capture and edit all the information specified in the modified NGDF standard
  • be used by any one familiar with MS Office
  • be given away at no charge
  • be free standing
  • allow the user to create his/her own metadata database
  • export the metadata database to the URGENT Data Centre
  • import an exported database, resolving or reporting any conflicts that arise
  • create reports
  • create a web page for each dataset
  • be reused in future programmes

To minimise the cost of development, ensure it could be given away without charge and allow the user to create their own local database for their own purposes, the system was developed as an MS-Access application. As can be seen from Figure 2, the data can be grouped, each group containing a set of related items describing a particular aspect of the dataset. This provided a natural basis for the design of the user interface illustrated in Figure 3, where each tab provides data entry and editing for a particular group.

Figure 3. The Metadata Capture Tool

The functions of most of the facilities shown are self evident, except, perhaps, the export, import and template options. The role of the import and export functions is to allow individual suppliers to send their metadata to the Data Centre and for the Data Centre to build the URGENT metadata database. The import facility checks the incoming data against the existing database and can handle both new data and updates.

The templates facility gives the Data Centre the ability to customise the appearance of the web pages. In URGENT, this has been used to format and colour code the pages according to whether the data originate from the air, ecology, soils and water components of the programme. However, it could be used to make more significant changes, if the system were to be used for a future programme.

The Metadata Search Tool

Having devised the means of collecting metadata, the next problem was to make them accessible to the scientific community. The solution chosen was to build a web based search tool accessible from the URGENT web site ( providing four levels of searching based on:

  1. single keywords
  2. project name
  3. a guided tour
  4. multiple keywords found via a thesaurus

Figure 4 shows the Search Tool’s home page and the route to the searching options. Options 1 and 2 are primarily for scientists who have been involved in the URGENT Programme and are familiar with the Programme’s structure, data and terminology. The first option provides a simple keyword search but depends on the scientist being familiar with the keywords used. Selecting the ‘Project Title’ icon on the metadata home page causes a list of all the projects within the Programme to be displayed. On selecting a project, the searcher will be presented with a list of datasets collected or generated by that project. If one of these datasets is selected, the metadata for that dataset is displayed in the format as shown in Figure 5.

Figure 4. The home page of the URGENT metadata search tool

Figure 5 An example web page displaying part of the metadata for a water project.

A problem for many users when searching is knowing where to begin. The ‘Not sure where to start?’ icon takes the user to a diagrammatic display showing the four URGENT themes of air, ecology, soils and water. Choosing ‘Ecology’, for example, provides the user with a list of the topics studied within ecology that collected or generated datasets, in this case, animals, habitats, insects, land use and plants (Figure 6). Selecting, for example, habitats then produces a more refined list of topic areas and selection from that list produces a list of relevant datasets. As with the ‘Project Title’ option, once a dataset has been selected, the metadata for that dataset are displayed, in this case with a green banner down the left-hand side to indicate an ecological dataset. All the URGENT datasets are attached to the end of at least one chain of buttons and some of them to several.

Figure 6. The ‘Not sure where to start?’ Option.

For scientists unfamiliar with the terminology used for URGENT data, the fourth option, the ‘Detailed Search’ will, with the help of a thesaurus, enable them to start with everyday words, find the equivalent terms used in URGENT and hence reach the dataset descriptions they require. It also allows them to build much more detailed search criteria than the other options. Figure 7 shows the interface. The page is divided in two parts. The upper part allows the user to ‘walk’ round the Thesaurus and the lower part builds the query.

Figure 7. The interface to the ‘Detailed Search’ option

To walk round the Thesaurus and find terms the search tool will recognise, the user enters a word in the ‘Search Terms’ box. The tool will then respond by listing all the synonymous, broader, narrower and associated terms that it knows. Choosing a term makes it the current term which may either be used as the start of a new search in the Thesaurus or to build the query. Terms added to the query can be linked by the operators ‘and include’, ‘or include’ and ‘do not include’. When the search criteria are complete, choosing ‘Search on this Query’ executes the query and takes the user to the results page showing relevant datasets.

The Thesaurus and Term Entry tool

The idea of the thesaurus developed from the need to provide something better than a simple keyword search. URGENT science is multidisciplinary and covers a very wide range of topics within the air, ecology, soils and water themes of the Programme. In this context, few scientists are likely to be familiar with the keywords used outside their own discipline – who, for example, other than a hydrologist, uses ‘stage’ to mean ‘water level’? So, in order to facilitate searching, it was decided to build a thesaurus that would allow users to start with familiar words but lead them to the preferred terms associated with datasets.