Project Document Cover Sheet

Project Document Cover Sheet

Project Document Cover Sheet

Project Information
Project Acronym / WebTracks
Project Title / Infrastructure for Integration in Structural Sciences
Start Date / 1st Aug 2010 / End Date / 30th Nov 2011
Lead Institution / University of Southampton
Project Director / Simon Coles
Project Manager & contact details / Brian Matthews

Partner Institutions / STFC
Project Web URL /
Programme Name (and number) / Managing Research Data (Citing,
Linking and Integrating Research Data)
Programme Manager / Simon Hodson
Document Name
Document Title / Intercom: A protocol for link notification
Deliverable / 1.1
Author(s) / Shirley Crompton, Brian Matthews, John Casson, Arif Shaon, Mark Borkum
Date / 13/04/2012 / Filename
URL / if document is posted on project web site
Access / X Project and JISC internal /  General dissemination
Document History
Version / Date / Comments
0.1 / 29/09/2010 / Initial version – John Casson
0.2 / 20/07/2011 / Heavily revised – Shirley Crompton
0.3 / 12/08/2011 / Revision and expansion, after discussion between Crompton, Matthews and Shaon
0.4 / 15/08/2011 / Added introduction – Brian Matthews
0.5 / 06/09/2011 / Minor editorial changes, added Section 2.1.1 and updated examples in Section 3 – Shirley Crompton
0.6 / 14/02/2012 / Revisions, history and use cases
0.7 / 13/04/2012 / Final revisions, history and use cases

Table of Contents

1Introduction

1.1History and related work

1.2Conventions in this document

2Use cases

2.1Use Case 1: Citation Tracking

2.2Use Case 2: Data provenance tracking

2.3Use Case 3: Linking via Annotation

3Link Notification

3.1General Principles

3.2Architecture for the Link Notification Service

4Terminology Applicable to this Document

5Protocol Description

5.1Technical Details

5.1.1Intercom Namespace

5.1.2InteRCom Ping URL

5.1.3Metadata Format

5.1.4GET Metadata Requests

5.1.5POST Requests

5.1.6Logging Requirements

5.1.7REST Interface

5.2Conformance Requirements

6Example

6.1Linking Resources in Managed Archives

6.2Linking Resources in Managed Archives – Initiated by a Third Party

7Security Considerations

Acknowledgements

References

InteRCom Specification 1.0 (Draft)

Editors:

John Casson <>

Shirley Crompton <>

Brian Matthews <>

Arif Shaon <>

Mark Borkum<>

Date: 13/04/2012

Abstract

InteRCom is a method for managed systems to establish semantically annotated links between digital artefacts published on the web. A typical use case would be that, in the course of scientific research, a researcher writes articles on results obtained from analysing primary data from experiments and refers to other prior work as well as creating derived data. The holding entities would need to be notified to provide a “link-back” corresponding to the citation.

Aggregating links between digital research resources provides an RDF graph of citation and provenance that captures the research process in context. These graphs can be traversed on the web and interrogated to support value added services like impact evaluation. Reverse linking is supported as link assertions are intended to be stored on both the Source and Target Resources.

1 Introduction

The Inter-Repository Communication protocol (InteRCom) is a general purpose application layer protocol for linking digital data resources of any type across the web. It provides a HTTP REST-based mechanism for managed resource archives or data management tools to create link requests and to exchange metadata on web-based representations of heterogeneous research objects. InteRCom is a peer-to-peer protocol with no requirement for centralised services.

1.1 History and related work

The origin of this work dates back to the CLADDIER project [Claddier 2007], which discussed the problem of linking citations between published resources. In this project, a use case of relating publications to associated raw data was developed, and the problem of tracing “forward” and “backward” citations, and how to track these between a number of different participating repositories was identified. This project produced a discussion of the problem [Matthews et. al. 2007] and considered a number of protocols available to provide notifications of this protocol. Such protocols included harvesting (e.g. OAI-PMH) and pull protocols (e.g. SWORD or those based on RSS or ATOM).

A “push” protocol, where the notification of the citation is actively directed at the participating repository was suggested as being suitable. Claddier then considered a number of “linkback”[1] protocols which have been proposed for this purpose, and proposed to use the well-known TrackBack [Trackback 2008] protocol as the basis for notification protocol which uses the REST web service model [Jacobs 2001]. This is a simple and established protocol, based on HTTP, and thus a straightforward extension to existing practice. An initial prototype of this was produced within the STFC ePubs, and the BADC repositories [Matthews et. al. 2007b] [Matthews et. al. 2008]

The work was subsequently, extended in the StoreLink [Matthews et. al. 2009] which added whitelists to the protocol and provided an ePrints implementation in the National Crystallography Service. The Storelink approach has advantages over harvesting methods. It is Peer-to-Peer, which increases the chance of identification of the source and target node, supplies the context of the link (link semantics), is simple and does not rely on an aggregator service. There are also advantages over “pull” approaches (e.g. Atom), as a link is propagated directly and therefore there is no reliance on discovery by subscriber services.

Two observations on the protocol arose in StoreLink were that:

a) the step of “discovery” of the location of the notification receiver could be separated from the transmission of the link, and

b) the protocol should be made ‘general purpose’ in order to propagate links in context between any digital object.

A similar approach is being taken by the Semantic Pingback project which uses a Remote Procedure Call [SPB 2010]. While this project already has recognised the value of a general purpose notification protocol, it uses RPC, and thus requires a different communications protocol rather than building on widely used HTTP and REST based services.

Another approach is taken by the Salmon project [Salmon 2011]. This does use a HTTP protocol, but does not use general purpose RDF based ontologies as the basis for representing the information.

Thus we propose the InterCom protocol as a two stage inter-repository communication protocol. It is more flexible than StoreLink as it does not specify a fixed format for the metadata ontology and it allows the metadata properties to be defined per link. StoreLink, in contrast, specified a static list of fields to be sent. In InteRCom, a link is represented as an RDF triple. The source and target resources form the Subject and Object URIs of the assertion, and the link type is the Predicate (Figure 1). Using this approach, InteRCom can support a wide range of links to be represented between different types of data resources.

Figure 1: An Example Link Assertion

1.2 Conventions in this document

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”,”SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described I [RFC2119].

2 Use cases

We give a number of use cases where the InterCom protocol would be appropriate.

2.1 Use Case 1: Citation Tracking

In this scenario we wish to trace the citation graph between research papers.

Traditional publishing uses one directional citation which is entered by the author, where an author annotates the current publications with prior publications in order to reference work carried previously by other authors and published. Thus credit for ideas and work can be properly attributed. However, this method is intrinsically one-directional. In a traditional publication system it is difficult to track citations forward, so that readers can discovery further work which builds upon the current publication.

Such forward citation tracking has become of greater importance due to the requirement for citation and impact metrics within research assessment. Not only the number of papers generated by research is required, but also an evaluation of their impact. This can be estimated by the number of citations of the original work. Traditional citation indexes are generated via aggregation services such as Web of Knowledge which harvest citation information from a pool of recognised journals and generate a citation. However, a linked web of repositories (held in institutions or by publishers) could perform the same function. Publications in repositories could harvest citation information and record the cross citation information between papers recorded within their databases. However, they would not have access to all the papers, but only those entered from its own user community, so while the repositories would have access to the information when taken as a whole, each individual one would not have this information. A mechanism is required for repositories to propagate citation information in a targeted manner, so that the right paper in the right repository can be identified, given a URI of the paper identifying where it is being held.

In this case, an effective method of propagating citation information is a peer-to-peer link notification method as illustrated in Figure 2. When a paper is ingested into a repository:

  1. its citations are identified
  2. the location of paper within a repository (held within institution or by a publisher). This could be via the URL of the paper or via its DOI.
  3. The citation of the paper is transmitted to a citation notification service, which records the citation.
  4. The citation is recorded as ‘Cited By’ by the citing paper.

Figure 2: Generating the Citation Graph

In this case, the Cito ontology is used to form the appropriate links [Shotton & Peroni 2011].

2.2 Use Case 2: Data provenance tracking

In this scenario, we wish to link a research object with another research object . For example, we wish to associate a dataset which was used to derive the result reported in the paper. Further, we may wish to associate other information to the object, such as the raw data collected, the software packages used to undertake analysis, the research project used to fund the research, and the people and organisations involved in the scientific process. Thus we would create a graph of Provenance to trace the derivation of the research results so that the quality of the research process can be made transparent in future assessment, and earlier components of the process can be reused. Such provenance trails are supported by notations such as the Open Provenance Model [Moreau et. al. 2010] and the emerging W3C Provenance Data Model and Ontology [W3C 2012].

Typically citations will only reference publications. Data archives wish to track who has been using data resources and thus want to keep track of forward links (“cited-by” links) – they may be informed of a citation from a communication, or from a usage report for example. Once a data archive has recorded a paper as arising from a particular dataset, then the citation from the paper to the data set can be added, using the data citation form discussed above; this is not necessarily added by the author, but rather by the repository managers.

We assume that the publication P is held in library’s publication repository A, and the data set D is held in a research department’s data repository B, and that the information that the link should be created is initiated within repository A. Thus:

- Repository A can add the link P uses data from D to its knowledge base.

- Repository A can notify B that the link P uses data from D

- Repository B can add the link P uses data from D to its knowledge base.

This process can be taken further to related other entities within the provenance graph. Thus the data set can propagate the relationship that it is derived from raw data R generated and held at Facility C, and has used a software package S held at software repository D. Thus in this way, a provenance graph can be generated and propagated around the interested parties. The relationships are illustrated in Figure 2; note that in this diagram a blank node representing the activity of using the software package to generate the derived data is included.

Figure 3: Provenance Graph showing derivation of published data

2.3 Use Case 3: Linking via Annotation

Third parties which do not hold the entities which are being related can also create links between entities. For example, to continue the theme of annotating research artefacts with their provenance, an electronic laboratory notebook may add the annotation that a data set has been derived by an analysis process on a raw data set. In this case, the link has been recorded by the notebook, and both the repositories holding the raw and derived data would need to be notified that such as link exists in order to have a complete record of the provenance.

So if we assume that an electronic lab note book is used to create the citation that data set A is derived from data set B. A is held in repository X and B in repository Y. Thus in this case, the link is transmitted to both repositories, so that they can add it to both triples stores. This is illustrated in Figure 3.

Figure 4: Notification of a link via a third party

3 Link Notification

In order to complete the desired link graph we need to populate the repositories with links. In particular, we need to inform repositories that their entries have been linked so that they can add the annotated link entries to their triple stores. We propose that this would be undertaken by a Link Notification Service.

3.1 General Principles

A number of principles are adopted in the design of the link notification service.

  1. The notification should be on a Peer-2-Peer basis.
  2. The service should be generic across different types of repositories and repository software.
  3. The service should be generic across different digital object types and metadata formats (i.e. RDF Vocabularies).
  4. The notification should exchange appropriate metadata on the link from link holders to other parties with an interest in recording the link.
  5. The notification system should not determine what the target repository does with the notification of the link.
  6. The mechanism should fail gracefully in the event that a target does not exist or does not recognise the notification.
  7. The mechanism should identify the sender of the notification and defend against bogus notifications of links.

Further, it was seen as desirable if existing off-the-shelf tools and mechanisms could be adapted to build on existing established practice and save on development effort.

3.2 Architecture for the Link Notification Service

We base the link notification service on Linkback, a peer-to-peer push protocol. The Linkback model establishes a direct notification between repositories, as in

Figure 8, and operates as follows:

  1. Link holding repositories identify the resources involved in a link and the likely holder repository of those resources to identify appropriate target repositories.
  2. Link holding repositories notify the target repository directly of the link.
  3. Target repositories accept the notification of the link.

This architecture is similar to a Linkback Protocol, such as Trackback [Trackback 2008] or Pingback [Langridge & Hickson 2002]. A Linkback protocol is a protocol which has been developed largely within the Blogging community to allow notification of cross-references between Weblogs so that authors can keep track of who is linking to, or referring to their articles.

Figure 5: Linkback Model for Notification Service

This architecture needs the following components and functionality of those components.

  1. Publishers, which:
  2. Identify likely target holders of linked resources.
  3. Send link data in an appropriate format to target holders within the appropriate Linkback protocol.
  1. Subscribers, which:
  2. Receive link data in an appropriate format from source holders within the appropriate Linkback protocol.
  3. Digest link data appropriately.

Note that in this model there is no “registration of interests” with a broker; a source repository decides not “who” to notify, but merely “where” to notify – an appropriate end-point for the notification based on its URL.

Advantages

  1. No centralized broker service.
  2. No negotiation of registration with broker.
  3. No definition of interests or harvesting of catalogue required.
  4. If notification is not acknowledged, then there is no need to continue.

Disadvantages

  1. Identification of target repositories dependent on URL, which may be missing.
  2. Less flexibility in who can receive what (a repository can only get linkbacks for those resources it hosts)
  3. Linkback protocols are well-known for being vulnerable to “spamming” by bogus notifications. As a consequence of this, there may be additions to the protocol, such as registration of trusted repositories, or signatures. While we recommend such safeguards, we regard them as out of scope of this protocol.

A number of Linkback specifications exist[2], including Trackback, Pingback and Refback[3].

  • Refback uses the information sent when a user clicks on a link to register the back link to the HTTP Referer (i.e. the page on which the link was made), which can then be harvested; thus Refback is dependent on user’s clicking on a link, which is not guaranteed, and the back link could be made to any reference to the digital object, not necessarily citations.
  • Pingback uses an XML-RPC call rather than HTTP. This reduces spam and potentially richer metadata can be sent across this protocol. However, the protocol is not widely supported.
  • Trackback is a simple “framework for peer-to-peer communication”. Essentially, TrackBack involves sending a “ping” request over HTTP POST requests, saying “resource A has a link to (cites) resource B”. TrackBack is supported by blogging software such as MoveableType[4]. It has a relatively simple metadata transmission in its simplest form, but has a straightforward mechanism for extension of the metadata as it uses the POST mechanism. Problems with Spamming are well-known and mechanisms can be added to mitigate this problem.

Consequently, Trackback was chosen as basis of the Link Notification Service.

4 Terminology Applicable to this Document

We give some basic definitions as used in this protocol specification.

  • InteRCom-enabled Resource – A digital research object accessible on the web by a HTTP-based URI and which also supports the InteRCom GET and POST methods.
  • InteRCom User Agent – An entity that enacts the InteRCom protocol for a given link assertion.
  • Ping – An HTTP Post request send from an InteRCom agent to a server for the purpose of establishing an explicit relationship between Web resources.
  • Receiving/Target Resource – A Web resource to which a Ping is directed for the purpose of establishing a link between it and a Source Resource
  • Sender/Source Resource – A Web resource containing a link to the Target Resource.
  • Security Guard – A generic entity that handles authentication and/or authorisation for the Receiving Resource.
  • TrackBack Ping URL – The HTTP URI to which TrackBack Ping requests are posted.
  • URI - A HTTP-based Uniform Resource Identifier that can be de-referenced to a digital representation of a Resource.
  • URL – A HTTP-based Uniform Resource Locator that points to a digital representation of a Resource.

5 Protocol Description

5.1 Technical Details

The InteRCom mechanism uses REST GET and POST requests to exchange metadata and establish a link between web-based resources. For simplicity, the protocol is designed to be fired and forgotten by the invoking application. Should it fail at any point in the interactions; it fails silently without interrupting the processing of the invoking application. It is strongly recommended that an error message is logged by the InteRCom User Agent to facilitate error resolution (see Section 2.1.4).