ABCD Access to Biological Collection Data

ABBIF Proposal: Architecture

(DraftMarch 14, 2006)

Index

Introduction

DarwinCore

ABCD – Access to Biological Collection Data

Protocols for Data Exchange

DiGIR

BioCASe

TAPIR

GBIF Architecture

Network Infrastructure

Latin America

Brazil

Analysis

Peru

Questionnaires

Information System

Venezuela

Bolivia

Colombia

Questionnaire

Information System

French Guyana

Ecuador

Brazil

Collections

Information Systems

Strategic Plan

Strategy: Proposed Network

Elements of the Architecture

ABBIF coordination

Data Providers

Portal

Resource Registry & Discovery

Tools

Data archive

Proposal

Participants (to be confirmed)

Workshop Program

Annex 1: Answers from Collections of Colombia

Introduction

There are a number of possibilities to design an information system when its data is actually produced and shared by different parties. Basically, a system can be centralized, distributed, or combined (mixed), with a number of variations.

A centralized system(figure 1) is recommended when data providers do not have the necessary infrastructure (hardware, software, connectivity) or expertise or even when data will only be produced for that particular system.

Figure 1. Diagram of a Centralized Information System

By adopting this architecture, data providers don’t need to store any localdata and they usually interact with an administrative interface to manage everything remotely. They also have to agree to a common format and content to be implemented in the central database. The great advantage is the low demand on informatics that will be imposed on data providers and the fact that developers will have a very controlled system to work on. The challenge is to maintain data providers actively validating and updating their data.

A distributed architectureis a system where the data is distributed but the query is centralized (figure 2) or where both data and query are distributed (figure 3).

Figure 2. Distributed data: centralized query / Figure 3. Distributed data and query

Advantages include “realtime” updating, clarity as to who the data provider is, and the possibility of a closer interaction between data providers and users. Disadvantages include the greater demand on infrastructure and expertise of each data provider and the complexity of developing and maintaining a distributed system.

The proposal is that ABBIF focuses on species and specimen data. Data will include specimen records in biological collections, observation data of field surveys, and taxonomic names. A strategy for each data component must be established.

The choice of the best architecture depends on the existing infrastructure and expertise of each data provider and custodian. Besides that, biological collections hold their data using different software in different operational systems, different formats, and recording different data elements (figure 4).

Figure 4. Diagram showing the complexity of integrating data from biological collections

In order to integrate these systems it is necessary that data providers agree to use a common data exchange model.

To determine the best architecture to be proposed for the ABBIF network, it is important to study:

What standards and protocolsare available;
What standards and protocols arethe existing networks, of direct interest to ABBIF, adopting; and,
What is the situation of localdata providers and custodians concerning infrastructure and expertise.

Standards

The adoption of standards and protocols for the exchange of data and information about biodiversity is fundamental for the development of interoperable systems. In general, one can define a standard as “something established by authority, custom, or general consent as a model or example”[1]. A communication protocol can be defined as a formal description of rules and message formats that two systems must adopt to communicate and interact. Perhaps the most important and known protocols areTCP/IP (Transmission Control Protocol / Internet Protocol), SMTP (Simple Mail Transfer Protocol), POP (Post Office Protocol) and IMAP (Internet Message Access Protocol). This group represents the basis for all data transmission through the Internet. Standard languages such as HTML (Hyper Text Markup Language) and XML (eXtensible Markup Language) are also important as they define rules for formatting the vast majority of documents through the Internet.

An important group that is discussing and developing standards and protocols for data on species and specimens is TDWG (International Working Group on Taxonomic Databases)[2].

TDWG’smission is to:

To provide an international forum for biological data projects;
To develop and promote the use of standards; and
To facilitate data exchange.

A number of working groups have been established within TDWG to develop and promote the use of standards and protocols. Of immediate interest to ABBIF we include:DarwinCore; ABCD – Access to Biological Collection Data; DiGIR; BioCASe;and TAPIR.

DarwinCore[3]

DarwinCore (DwC) is a standardthat began to be developed within the scope of the Species Analyst network based at the University of Kansas Natural History Museum and BiodiversityResearchCenter. The idea was to define common data fields to all taxonomic groups and this way standardize the integration of primary data of biological collections. This standard uses XML (defined by an XML-Schema) and is being used by most networks such as GBIF[4], MaNIS (Mammal Networked Information System)[5], OBIS (Ocean Biogeographic Information System[6]), speciesLink[7] in Brazil, among others.

It is based on a non-hierarchical set of data elements which include: InstitutionCode, CollectionCode, CatalogNumber, ScientificName, BasisOfRecord, Kingdom, Phylum, Class, Order, Family, Genus, Species, Subspecies, ScientificNameAuthor, IdentifiedBy, YearIdentified, MonthIdentified, DayIdentified, TypeStatus, ColectorNumber, FieldNumber, Collector, YearCollected, MonthCollected, DayCollected, JulianDay, TimeOfDay, ContinentOcean, Country, StateProvince, County, Locality, Longitude, Latitude, CoordinatePrecision, BoundingBox, MinimumElevation, MaximumElevation, MinimumDepth, MaximumDepth, Sex, Preparationtype, IndividualCount, PreviousCatalogNumber, RelatedCatalogNumber, RelatedCatalogItem, RelationshipType, Notes, DateLastModified. The standard accepts extensions that have been proposed for geospatial, curatorial, paleontology, microbial, and observation data[8].

ABCD – Access to Biological Collection Data[9]

ABCD is a highly structured standard for data about objects in biological collections. Its objective is the same as DarwinCore, except with much more detail as it has around 500 elements against 50 elements of DarwinCore. There are specific elements for observational data sets and for the following types of collections:

Herbaria and Botanical Gardens
Zoological Collections
Culture Collections
Mycological Collections
Plant Genetic Resources
Paleontological Collections

This data model is being used by the Biological Collection Access Service for Europe, BioCASE[10]. As DarwinCore it uses XML (defined through an XMLSchema). ABCD version 2.06[11] has been recommended by the TDWG meeting in St. Petersburg as the adopted version of the standard and has since then been ratified by TDWG members.

Protocols for Data Exchange

Networks that serve data from biological collections, besides using a standarddata model (such as DarwinCore and ABCD) also require a protocol for transferring data.

DiGIR[12]

One of the first networks of biological collections to be developed as a distributed system was The Species Analyst (TSA), at the end of the 90’s. TSA used theANSI/NISO Z39.50 protocol which was first adopted in 1988 and was used to interconnect libraries. It defines a communication standard between computers to retrieve information. An important characteristic is the fact that it supports a client-server environment which allows the separation of the user interface from the data server. Z39.50 has also been implemented on a range of platforms. Whilst Z39.50 was an effective solution, there were some issues with the protocol that convinced Species Analyst network developers to study another solution. At the time, the protocol was found to have a complicated specification, which meant a very steep learning curve for developers. Conceptual schemas were not defined with a formal language such as XML Schema; and at the time, there was limited support for XML and Unicode

In order to address these issues, developers of the Species Analyst network and a number of people involved with the TDWG[13] held a small workshop in Santa Barbara to start discussing a solution to replace Z39.50 for the biodiversity informatics community. The goal was to develop a protocol that was based entirely on the use of XML documents for messaging between clients and data providers, with a data transport mechanism that was predominantly based on HTTP. DiGIR was designed to offer the same capabilities as Z39.50 except using simpler technologies and a more formal specification for description of information resources. The result is a distributed information retrieval solution that provides an easy entry for participation in distributed information networks.

DiGIR became operational in 2003 and was adopted by a number of networks such as The Mammal Networked Information System (MaNIS), the Ocean Biogeographic Information System (OBIS), the Global Biodiversity Information Facility (GBIF), and the speciesLink Network in Brazil.

BioCASe[14]

The Biological Collection Access Service for Europe (BioCASE), a network of biological collections,adopted ABCD as the concept schema,and for this purpose modified the DiGIR protocol to meet its needs. This modified protocol is known as the BioCASE data transmission protocol or just simply BioCASE. The protocol is based on the DiGIR protocol, but was forced to incorporate some BioCASE-specific changes that unfortunately make the two incompatible.

TAPIR[15]

In 2004 GBIF promoted a study to develop a new merged protocol that would meet the needs of both DiGIR and BioCASE networks (Döring & Giovanni, 2004). This protocolwas namedTAPIR (TDWG Access Protocol for Information Retrieval) and shall be tested in 2006. It is expected that both networks, BioCASE and those that have adopted DiGIR, migrate to the new protocol. The new protocol is being tested by implementing it in two data provider software packages, representing each of the existing network communities, BioCASe (the BioCASe PyWrappersoftware ) and DiGIR (a new Java provider package currently named DiGIR2).Adetailed TAPIR specification document is also being developed.

GBIF Architecture

We have discussed possible architectures (centralized, distributed, and combined or mixed) and standards and protocols that are being adopted internationally. Another important feature of this analysis is to observe what GBIF, that is openly serving species and specimen data on the Internet,is using. GBIF plays a fundamental role as it is theglobal initiative that is integrating species and specimen data worldwide. Whatever architecture and strategy is adopted by ABBIF must be compatible with this initiative.

In 2003GBIF established its “architecture fundamentals” which are important and relevant when designing an information facility (see GBIF Biodiversity Data Architecture, 2003[16]). The basic principal was not to impose any specific software or technology, but having the access to biodiversity data as its key goal.

The document presents as basic principals:

Free access to data: this implies that any restrictions must be carried out at the data provider level, the system would not control user access to data;
Support for global users: the idea is to enable the implementation of different human languages in presentation services;
Consider human and machine users: the system would be implemented to be accessed by web browsers and web services;
Consider structured and unstructured data: the document acknowledges the importance of defining both structure and content of data (fundamental for interoperability and machine analysis) but also includes that it is important to make unstructured data available;
Reusable, replaceable, and redundant components: the idea is to develop a framework where new data providers can be rapidly added; promote the maintenance of persistent data sources, as opposed to databases where their lifetimes are tied to a project; planning for redundancy, replicating working components to different locations across the globe; and adopting an open technology framework, where operating systems, database management systems, web servers, programming languages, and other tools are a choice to be made by each participant according to existing needs and skills.

GBIF has developed a networkbased on nodes (figure 5).

Figure 5. GBIF Network: majorclasses of nodes

GBIF is responsible for running the network, establishing standards, and developing tools. The portal is the hub for the development of any service that must be centralized such as the registry of metadata and for serving data from the biodiversity data index to the end user. GBIF participants’ nodes are established to share biodiversity data. They may be gateways to data nodes or data nodes themselves. They may also provide services such as mapping, analysis, and hosting of orphaned datasets. Data nodes are primary providers of data.

When GBIF was first designed, key elements of the Portal were the Biodiversity Data Index and the Taxonomic Name service (figure 6).

Figure 6. Diagram of the GBIF portal

The Biodiversity Data Index holds a subset of the data held by the data nodes and includes specimen identifiers associated with identification, geospatial and temporal information. Centralization of these subsets of data supports a much more rapid response to user queries, minimizing network traffic.Although taxonomic names provide the primary organizational structure for biodiversity data, no complete catalogue of names is available today. This is an ever evolving task which requires international collaboration. GBIF is also involved in a number of initiatives to create web services such as mapping, georeferencing, and data cleaning.This portal presently is much more complex and figure 7 presents a diagram of how the future portal is expected to operate.

Figure 7. GBIF’s dataportal deployment model

The central column represents functions which should be executed centrally (marked as GBIF Secretariat). The components involved in delivery of services to end users and portalsare shown as replicated to a number of mirror sites. The Master Data Store needs to be implemented in a single location (and should at least be associated with a "Master" instance of the Despatcher component, but the Crawler and Validation Chain components could also be mirrored for efficiency.

The existing GBIF UDDI registry would need significant enhancement before it could properly support the process illustrated here.

The Schema Repository should be developed in close conjunction with the TDWG Technical Architecture Group and can initially be represented by a small stub implementation that offers equivalent function to the rest of the DataPortal.

The Crawler corresponds largely to the Indexer component of the existing prototype DataPortal. It includes a scheduler which identifies data resources which should be indexed or checked for updates and develops an appropriate strategy in each case for accessing modified data. It should maintain a "map" monitoring the progress made in indexing any resource so that the process can be interrupted and restarted, and also so that data providers can be notified of any records from their resource which could not be accessed for any reason. The data offered by the Service Registry will provide the basis for the Crawler's activity (including endpoint URLs, protocols and datastandards supported, acceptable times and days for crawling each provider's data, any agreements made with providers as to how much data the DataPortal should cache in the Master Data Store, etc.). The Crawler should process the data retrieved by placing an object into the Validation Chain for each record found (new and modified records; also objects indicating the completion of an indexer operation for a given provider to allow for clean-up of obsolete records, etc.).

The Validation Chain corresponds largely to the DataValidationServices described in the GBIF DataPortal Strategy, but also includes some other function from the Indexer component of the current prototype DataPortal. This is a configurable workflow component that allows a range of processing steps to be applied to each object placed into the chain. The exact steps will vary according to the nature of the record concerned. It will include the generation of a series of annotations to the object based on routines to validate or interpret the data in the record. The aim is to reach the end of the Validation Chain with a clear understanding of what the record represents in as much detail as possible, including an evaluation whether there are ambiguities or problems with any of the data elements. By the end of the chain, all objects should be in a form that can readily be stored in the Master Data Store.

The Despatcher is a new addition to the model to ensure the greatest possible flexibility in how the DataPortal may operate. The key role of this component is to forwarding the objects from the Validation Chain into the Master Data. It will however be the natural point to process information which should be included in a report to each data provider at the end of each visit to index their data. Upon further review and discussion with GBIF stakeholders (including data providers) a range of other notification services could be implemented at this point (e.g. forwarding objects or notifications to thematic and regionalportals whenever records appear which are of interest to those portals; management of notifications to users of the addition of data relating to their taxa of interest). Such extensions would be a future option, but the development of a generic Despatcher will make this easy.

The Master Data Store (DataIndex)is implemented as a database used solely for managing the best possible overview of the data in the GBIF network and does not itself support requests from users or remote portals. All such requests will be made against Slave Data Stores maintained by MySQL replication.

The Access Portal is a layered application making use of Hibernate to access data from a Slave Data Store and including a Service Layer implementing all logic associated with the DataPortal's processing of data for display. Axis will be used to provide an XML access interface to the methods offered by the Service Layer. These methods will be those required to develop an HTMLUserPortal based on the GBIF data. Axis will allow these same methods to be exposed easily as SOAP web services for use by other portals. This interface will represent a "GBIF Native PortalInterface" which will not always map directly to TDWG standards (since frequently only a tiny number of data elements are needed and these should be combined in different ways from the standards). Additional access interfaces (TAPIR, WFS, etc.) can also be implemented and exposed from the Service Layer. The DataPortal's own HTMLUserPortal and UserServices will be implemented by a JSP layer based on the XMLDataServices (the "GBIF Native PortalInterface").