APPLICATION FOR RENEWAL OF CODATA TASK GROUP FOR PRESENTATION TO THE 28th CODATA GENERAL ASSEMBLY,

Taipei, 1-2November 2012

Task Groups should endeavour to support CODATA’s strategic objectives and activities where appropriate, as articulated in the CODATA Strategic Plan, 2006-12.

The Strategic Plan is currently being updated for the 2013-18 period.

Task Groups may also wish to explore synergies with ICSU strategic priorities

(

1. Name of Task Group

CODATA-ICSTI Task Group on Data Citation Standards and Practices

2. Objective(s) of Task Group

The following were the objectives presented in the first Task Group proposal in 2010:

The need for robust data citation capabilities

The growth of electronic publishing of literature has created new challenges, such as the need for mechanisms for citing online references in ways that can assure discoverability and retrieval for many years into the future. The growth in online datasets presents related, yet more complex challenges. Data citation standards and good practices can also form the basis for increased incentives, recognition, and rewards for scientific data activities that in many cases are currently lacking in all fields of research. The rapidly-expanding universe of online digital data holds the promise of allowing peer-examination and review of conclusions or analysis based on experimental or observational data, the integration of data into new forms of scholarly publishing, as well as the ability for subsequent users to make new and unforeseen uses and analyses of the same data – either in isolation, or in combination with other datasets.

This promise, however, depends upon the ability to reliably identify, locate, access, interpret and verify the version, integrity, and provenance of digital datasets. The problem of citing online data is complicated by the lack of established practices for referring to portions or subsets of data. As funding sources for scientific research have begun to require data management plans as part of their selection and approval processes, it is important that the necessary standards, incentives, and conventions to support data citation, preservation, and accessibility be put into place. There are, in fact, a number of initiatives in different organizations, countries, and disciplines already underway. An important set of technical and policy approaches have already been launched by the Internet Engineering Task Force (IETF), the U.S. National Information Standards Organization (NISO), and other standards bodies regarding persistent identifiers and online linking, including the Open Archives Initiative Object Reuse and Exchange (OAI-ORE)and InfoURI. Another important group is DataCite. The World Data System is also focusing on these issues. Other initiatives remain ad hoc and uncoordinated.

The proposed CODATA Task Group, being organized jointly by several CODATA committees and International Council for Scientific and Technical Information (ICSTI), together with representatives from several other organizations, would examine a number of key issues related to data identification, attribution, citation, and linking, help coordinate activities in this area internationally, and promote common practices and standards in the scientific community.

Issue Areas in the Development and Implementation of Scientific Data Citation There are many issues that would need to be addressed in establishing data citation standards and good practices. Below is a description of some of the topics that the Task Group would address.

Technical

1. Interoperability and Facilitation of Re-use. There is already considerable diversity in database formats, such as various flat-file, hierarchical, relational, object-oriented, and XML-based databases. There is every reason to expect that new modalities and formats for storing and manipulating digital data will continue to emerge.

2. Citation Formats. What data citation conventions have been developed already? How are they similar and how do they differ? Can they be standardized and if so, how? It should be noted, however, that citation formats are not major considerations compared to the difficulty of determining the unit of data or the identity of that which is to be linked (Cole 2008).

3. Metadata. How do metadata conventions or standards affect attribution and citation of data?

4. Database Versioning. Datasets are more dynamic than documents, and this creates additional challenges for citation practice. When should the dataset as a whole be cited? How can a specific, time-fixed version be cited? What changes to the data constitute a new contribution or added value? How should this be acknowledged? How are database versions controlled and labelled? How to cite and give credit to data compiled from a network of integrated databases?

A crucial dimension in this regard is provenance and how it is related to the need for attribution and citation. What attributions are needed given the complex provenance that is common for many types of data? How does one cite data that has been through many stages of transformation, some of them adding significant value and some trivial? How to enable citation and acknowledge data sources without hampering interoperable data systems?

Scientific

The creators and users of online scientific datasets may have diverse needs that should be considered in the development, management, and use of scientific data in different discipline or research contexts. They also may have different needs regarding persistent identifier standards and models. For example, different disciplines may have disparate needs for granularity at which digital “objects” are identified. Some need geospatial metadata while others do not. What are the differences among disciplines that need to be addressed distinctly?

Institutional Roles

Successfully developing and implementing data citation practices and standards requires the participation by all major groups with the research community. What are the roles in this regard of the respective stakeholders in the system—the data managers, researcher umbrella groups, universities, libraries, publishers, and research funders? What are the implications for these stakeholders? Does this vary by major field of science or type of research?

Intellectual Property Rights and Licensing

Any registry system must accommodate traditional intellectual property rights (IPRs), such as those established through copyright, as well as emerging mechanisms of “some rights reserved”, such as Creative Commons and Science Commons licensing.

Various important issues arise from data ownership, control, and IPRs. These are key drivers behind the different attitudes and practices toward data attribution and citation in different fields and countries, but relatively little work has focused on sorting out these issues. Principles and practices that have been tried in different contexts need to be identified, and approaches that are more appropriate in the digital age should be explored. A recent OECD study (OECD 2008) addressed publicly funded data in this way, but both public and private data need to be considered. This is important because the willingness of individuals and institutions to accept and use different attribution, citation, and reuse frameworks will depend to a large extent on the real and perceived ownership, control, and IPRs associated with various databases.

Socio-cultural and Community Norms

A major reason for promoting the adoption of standard data citation practices is to develop a common basis and community of practice for recognizing and rewarding data work and incentivizing disclosure of data in interoperable and quality controlled ways. What are the factors that need to be considered in this area? Of particular interest is how such data management activities might impact the personal performance evaluations of scientists and the reward and promotion structures in science. Another potential area of inquiry would be how citations of databases could be used as Science Indicators.

Attribution is not quite the same as citation, although citation is one of the ways of giving attribution. Licences akin to Creative Commons (CC) may require attribution, but this can result in “attribution stacking”, where the work of hundreds or even thousands may need to be acknowledged. The route through this may be by establishing community norms for what are acceptable levels of attribution for datasets. Creative Commons and Science Commons recentlyadded cut-and-paste citation support to their new version of the CC0 deed and to our norms documents (see click through to the ones with metadata to see examples).

Persistent Digital Identifiers and Institutional Sustainability

In a field that requires a lot of granularity in data use, even nominal registration fees per object can quickly become cost-prohibitive. In order for a data citation system to be useful, it must be accessible and its costs affordable by all necessary user communities.

It is important to consider data citations in the context of the semantic web. Online, the reference becomes “actionable”—the user wants to link directly to the item being cited. Distributed, linked technologies actually take us back to the original intention of citations, which is to enable the reader to discover, retrieve, and verify the identity of the referenced item. Bibliographic references presume that the desired object exists in multiple printed copies, and that any copy will do. In a digital world, only one “copy” exists. It is that copy that must be discoverable and retrievable. That sought item needs a persistent identifier. Normally, that persistent identifier is a URI.

The semantic web is predicated on linking of persistent identifiers. That model would be more inclusive and forward looking than the present framing. Within the semantic web framework, progress is being made on modelling relationships between scholarly objects. The technical standards now in place are the Open Archives Initiative – Object Reuse and Exchange protocol (OAI-ORE) (Pepe, Mayernik, Borgman & Van de Sompel, 2010).

As noted above, there is a need for registration and persistent identification for online digital datasets. Some registry and resolution models for this function have already emerged, but the various models – for-profit vs. not-for-profit, public vs. private, etc. – must be examined to assure that they are sustainable in the long term. Moreover, just as the persistence of the connection from print citations to the correct physical copies depends on libraries or publishers keeping, the persistence of the connection between data citation and the actual data ultimately must also depend on some form of commitment by durable institutions to preserving data that is cited.Although a top down, centralized archive that keeps and organizes all data is an obviously attractive concept and works in some fields, creating such a trustworthy structure is probably not feasible universally, especially given the huge increases in the amount and types of data being generated or used by the scientific community. Distributed approaches to preservation such as institutional repositories, the Data Preservation Alliance for Social Science, and LOCKSS are emerging examples of alternatives to the centralized archiving model.

Other Issues

There are certain to be other important elements to the proper development and implementation of data citation standards and good practices, especially discipline-specific ones that may be identified by the Task Group as it undertakes its activities.

3. Current Membership (Please give institution, area of expertise, telephone, and e-mail of each member)

Co-Chairs

Co-Chair, Bonnie Carroll (U.S. CODATA and CENDI)

President, Information International Associates

104 Union Valley Road

P.O. Box 4219

Oak Ridge, TN 37831-4219

USA

Tel.: +1 865 298-1220
e-mail:

Co-Chair, Jan Brase (Director, DataCite, and ICSTI representative)

Technische Informations Bibliothek (TIB)
German National Library of Science and Technology
Welfengarten 1b
30167 Hannover

GERMANY
Tel.: +0511 762 19869

e-mail:

Co-Chair,Sarah Callaghan (U.K. CODATA)

The NCAS British Atmospheric Data Centre

STFC Rutherford Appleton Laboratory

RAL Space

R25 - Room 2.05

Harwell Oxford

Didcot

OX11 0QX

England, UK.

Tel.: +44 1235 44 57 70

Membership List

(in alphabetical order)

Micah Altman

Senior Research Scientist

Institute for Quantitative Social Science

HarvardUniversity

1727 Cambridge St, K325

Cambridge, MA 02138

USA

Tel. +1

e-mail:

Elizabeth Arnaud

Project Coordinator

Understanding and Managing Biodiversity Programme

Bioversity International

Via dei Tre Dinari, 472/a

00057 Maccarese

Rome

ITALY

Tel. + 39 066118323

Email:

Christine Borgman, Professor and Presidential Chair,

Department of Information Studies, University of California, Los Angeles

Box 951520, UCLA

Los Angeles, CA90095-1520

USA

Tel.: +1 310-825-6164

Email:

Todd Carpenter

Managing Director

National Information Standards Organization

One North Charles Street
Suite 1905
Baltimore, MD21201

USA

Tel: +1 301-654-2512

Fax: +1 410-685-5278

Email:

Dora Ann Lange Canhos

Director, CRIA

Av. Romeu Tórtima 388, Barão Geraldo

13084-791 Campinas, SP

BRAZIL

Tel.: +55 19 3288 0466

Email:

Vishwas Chavan

Senior Program Officer for DIGIT

Global Biodiversity Information Facility

Universitetsparken 15

DK 2100, Copenhagen

DENMARK

Tel. +45 35 32 14 75

Email:

Nathan Cunningham

British Antarctic Survey

Madingley Road, High Cross

Cambridge

Cambridgeshire CB3 0ET

UNITED KINGDOM

Tel.: + 44 1223 221400

Email:

Michael Diepenbroek

WDC-MARE / PANGAEA -

MARUM - Center for Marine Environmental Sciences

University Bremen

Leobener Strasse

POP 330 440

28359 Bremen

GERMANY

Tel.: +49 421 218-65590

Email:

John Helly

Senior Staff Scientist

San DiegoSupercomputerCenter

Scripps Institution of Oceanography, Climate, Atmospheric Science, and Physical Oceanography

University of California, San Diego

USA

Tel.:+1 760 840 8660 or +1 858 534 5060

Email:

Jianhui LI

Director, Scientific Data Center

Computer Network Information Center

Chinese Academy of Sciences

4th South Street,Zhong Guan Cun, Haidian Distict

Beijing, 100190

CHINA

Email:

Brian McMahon

Research and Development Officer

International Union of Crystallography

5 Abbey Square, ChesterCH1 2HU

UNITED KINGDOM

Tel: +44 1244 342878

Fax: +44 1244 314888

Email:

Karen Morgenroth

National Research Council Canada
Canada Institute for Scientific and Technical Information
1200 Montreal Road, M-55
Ottawa, ONK1A 0R6
CANADA

Tel.: +613 998 8396

Email:

Yasuhiro Murayama

Director, Integrated Science Data System Research Laboratory

National Institute of Information and Communications Technology

4-2-1 Nukui-kita, Koganei

Tokyo 184-8795

JAPAN
Tel:+81-423-27-6685

Fax:+81-423-27-6678

Email:

Soren Roug

EEA Coordinator GMES Bureau

European Environmental Agency

Avenue D'Auderghem 45 - BREY 9/211

B - 1040 Brussels

BELGIUM

Tel.:

Email:

Helge Sagen

Head of Norwegian Marine Datacentre

Institute of Marine Research

Pobox 1870, Nordnes

5817 Bergen

NORWAY

Att: Helge Sagen

Tel. +47 55 23 84 47

Email:

Eefke Smit

International Association of STM Publishers,

Director, Standards and Technology

Prins Willem-Alexanderhof 5

2595 BE The Hague

THE NETHERLANDS

Tel. +31 654 321 371

Email:

Martie J. van Deventer

Portfolio Manager

CSIRSouth Africa

Information Services

PO Box 395

Pretoria0001

SOUTH AFRICA

Tel: +27 12 841-3278

Email:

John Wilbanks

Vice President, Creative Commons
Director, ScienceCommons

171 Second Street
Suite 300
San Francisco, CA94105

USA

Tel: +1 617 838 6333

Email:

Koji Zettsu, Director, Information Services Platform LaboratoryNational Institute of Information and Communications Technology, JAPAN

3-5, Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0289, Japan

Tel: +81-774-98-6921

Fax: +81-774-98-6960

Email:

Consultants

Daniel Cohen

Program Officer, NRC Board on Research Data and Information, and

U.S. Committee for CODATA

[on detail from the Library of Congress]

National Academy of Sciences, Keck-

500 Fifth Street NW

Washington, DC 20001

USA

Tel.: +1 202 334 1253

Email:

Franciel Linares,Technical Director, Information Management &TechnologyProgram,

Information International Associates, USA

104 Union Valley Road

Oak Ridge, TN37831-4219

Tel.: (865) 298-1226 (Office) (865) 363-8632 (Cell)

Email:

Yvonne Socha, MLIS candidate

University of Tennessee

PO Box 16635

Knoxville, TN37996

USA

Tel.: 865-742-3478 (cell)

Email:

Paul Uhlir

Director, NRC Board on Research Data and Information, and

U.S. National Committee for CODATA

National Academy of Sciences, Keck-511

500 Fifth Street NW

Washington, DC20001

USA

Tel.: +1 202 334 1531

Fax: +1 202 334 2231

e-mail:

4.Planned Changes in Membership (Please give institution, area of expertise, telephone, and e-mail of each member; indicate if the individual has been invited to participate and agreed to serve, or if the individual has not yet been contacted. Note also that participation by scientists from around the world as members of the Task Group and in Task Group activities is strongly encouraged. Particular attention should be given to gender balance, and representation from developing countries.)

We have already added several members, consultants, and young scientists and do not plan to add any more.

5. Please indicate whether young scientists are going to be involved in this Task Group. If so please provide details.

Participants in Task Group activities include young scientists:Sarah Callaghan, Franciel Azpurua Linares, of Information International Associates, Matthew Mayernik, NationalCenter for Atmospheric Research, Laura Wynholds, UCLA, Jillian Wallis, UCLA,and Yvonne M. Socha, University of Tennessee. Sarah and Franciel are under 35 years old and the others are under 30.

Sarah Callaghan is one of the three co-chairs of the Task Group.

Franciel Azpurua Linares provides ongoing support to the Task Group through coordination of activities, and research.

Matthew Mayernik, National Center for Atmospheric Research, Laura Wynholds, UCLA, Jillian Wallis, UCLA,served as rapporteurs for breakout sessions of the workshop “For Attribution:Developing Data Attribution and Citation Practices and Standards”,held in Berkeley, CA in August 2011.

Yvonne M. Socha is analyzing the Task Group’s bibliography of the literature relating to data citation and attribution practices.

6. Summary of activities since the 2010 General Assembly.

Commencing in March of 2011, we have heldapproximately monthly teleconferencesof the Task Group co-chairs and quarterly teleconferences of the full Task Group.

We have formed working groups for specific activities (Bibliography, Stakeholder Survey, Website/Intranet, Standards and Best Practices). The identified leads and members of these working groups have conferred regularly between teleconferences of the full Task Group, and reported upon progress in these activities. They have shared drafts of the survey (interview) questions, outlines of white paper topics, and developed lists of survey respondents.