Research Data Collections WG April 5, 14:00-15:30 Room MR8

Research Data Collections WG – April 5, 14:00-15:30 | room MR8

https://www.rd-alliance.org/wg-research-data-collections-rda-9th-plenary-meeting

Introduction

The Research Data Collections WG started from the need to create shared tools and services for building and managing collections of research data objects, regardless of their nature, user community/discipline and application scenario. It is a mature RDA WG, it has been already recognized and endorsed and it is presently at M12, so there are now draft deliverables. The main deliverable is an open API specification and implementation, where the API must allow Create, Read, Update, Delete and List (CRUD/L) operations. So far, the WG has established definitions, has defined an abstract Data Model for data collections, and completed the API specification draft.

The API specification

How is the collections API designed? The data model is composed of three sub-models: Service, Collection and Member. A collection can have any number of members including other collections, therefore a collection can be also a member of other collections.

After contacting the service, the operations possible on a Collection are: LIST, CREATE, READ, UPDATE, and DELETE. The same operations apply to a Collection Member. The Data Model (printed and distributed to all the participants to the meeting, and shown in the slides) includes the properties and the operations of a Collection and a Member.

Suggestions from the audience

- It might be interesting to think on how to combine these recommendations with the outputs of the Dynamic Data Citation Working Group.

- A discussion starts on how to extend the specification to versioning and version control: perhaps a new properties “version” may be useful.

- Implementation possibilities for versioning include:

1. snapshots

2. via the PID at the record level

3. as an ordered collection (where model specifies that this collection refers to the version history of a single object)

4. via rule based generation (a dynamic collection which requests the latest version of each member object)

5. via delete/create

- We may need a property which indicates who added an item to a collection (does this go in the member mapping metadata?)

- Need a description field on member items?

- What support is there for dealing with multiple agents operating on a collection at one time?

- A discussion starts about where/how the information about the license of the object should be stored at the API level, i.e. at the level of Collection or at the level of Member.

- Have we aligned with the PROV ontology? A reason to combine with PID Kernel efforts.

- A question is asked: why this has not happened 10-15 years ago, and it happened now, as it is applicable to anything? (among the different answers) because before the “digital revolution”, there was less need of making collections accessible outside.

Individual API implementations (use cases)

There are some API implementations already in place, each of which is separately presented:

- REPTOR: a data repository which turns a standard web server into a data repository; an example instance is available at http://dft-rda.esc.rzg.mpg.de/reptor;

- Tufts (Python + LDP), developed at Tufts University for the Perseids project, which is fully open source (based also on MongoDB); a demo is available at http://collections.perseids.org;

- GEOFON: a seismological data archive having a big amount of daily user requests (6 million requests in a year) to be satisfied in order to share data. Among the purposes: define collections (datasets) which point to files in their archive (via PIDs or URLs); and create big datasets providing all the data related to a project or different (overlapping) subsets from it.

- DKRZ / Climate data management, which includes some climate data processing services and also Copernicus Data Services.

- CAU Kiel/IGSN: they have about 10k geological samples, which come from different events and are put in different boxes; some rocks go to some labs and some to some other labs, and from these rock samples different new data are derived later. Hence, the way collections are created in this case is of particular importance, as this defines the way the API should be used.

Next steps

- finalize the Data Model and the API specification, revisited in light of the feedback received during/after this WG meeting and the Plenary meeting of April 6 where the API will be presented;

- examine implications for existing collections, which means convince people to implement this model;

- write the final Specification Document, which includes best practices for workflows and coordination with RDA DTR and PID Types APIs, and include best practices for using URL-unfriendly URIs;

- look for potential adopters of the API.