Supplementary informationfor

Finding useful data across multiple biomedical data repositories using DataMed

Lucila Ohno-Machado1,2*, Susanna Sansone3*, George Alter4*, Ian Fore5*, Jeffrey Grethe1*, Hua Xu6*, Alejandra Gonzalez-Beltran3, Philippe Rocca-Serra3, Ergin Soysal6, Nansu Zong1, Hyeon-euiKim1

1University of California San Diego.2Veterans Administration San Diego Healthcare System.3University of Oxford.4University of Michigan Ann Arbor.5National Institutes of Health.6The University of Texas Health Science Center at Houston. *These authors contributed equally. Correspondence should be address to L. Ohno-Machado ()

Index

1.Supplementary Table 1: Number of results from different systems

2.BioCADDIE working groups

3.Recommendations for repository inclusion criteria

4.Metadata models reviewed to define the core metadata incorporated into DATS

4. Definition of Authorization

5. Identifier Resolution

6. Message brokering architecture of the indexing pipeline

7. Name list for bioCADDIE contributors

Reference

1.Supplementary Table 1: Number of results from different systems

Search Engine / Keyword Query / Natural Language Query
Total returned results / Top 50 / Total returned results / Top 50
Relevant / Partially
relevant / Not relevant / Dataset / Relevant / Partially
relevant / Not relevant / Dataset
DataMed / 94 / 46 / 4 / 0 / 50 / 81 / 46 / 2 / 2 / 50
OmicsDI / 192 / 18 / 14 / 18 / 50 / 0 / 0 / 0 / 0 / 0
Google / 115000 / 18 / 21 / 11 / 0 / 7 / 3 / 3 / 1 / 3
Bing / 1430000 / 11 / 24 / 15 / 0 / 149000 / 22 / 3 / 25 / 0

2.BioCADDIE working groups

  • bioCADDIE Working Group 2. Data Identifiers Recommendation (
  • bioCADDIE Working Group 3 (1). Descriptive Metadata for Datasets ()
  • bioCADDIE Working Group 3 (2). The DataMed's DATS model annotated with schema.org ()
  • bioCADDIE Working Group 4. Use Cases and Testing Benchmarks
  • (
  • bioCADDIE Working Group 6. Criteria for Repository Inclusion (Standards, Interoperability, Sustainability, etc.)
  • (
  • bioCADDIE Working Group 7. Accessibility Metadata for Datasets, (
  • bioCADDIE Working Group 8. Ranking Search Results (
  • bioCADDIE Working Group 9. End User Evaluation Criteria
  • (

3.Recommendations for repository inclusion criteria

  • Quality: Data sets should be accompanied by sufficient metadata to make them findable, usable, and interoperable. Data formats should follow community standards.
  • Sustainability: Repositories should actively maintain their holdings and have sustainable funding.
  • Scope: Additional repositories should be included in bioCADDIE as necessary to ensure diversity and new types of data used in research.
  • Prominence: Priority should be given to types of data that testers of the prototype will expect to find. Less prominent repositories should be considered if they introduce diversity.
  • Access: Data should be accessible to the scientific community. Metadata must be in a format usable by the bioCADDIE team for ingesting to the DataMed index. Objects must be uniquely identified and web‐resolvable by persistent identifiers.

4.Metadata models reviewed to define the core metadata incorporated into DATS

  • schema.org (
  • DataCite1
  • RIF-CS (
  • Project Open Data Metadata Schema v1.1 (
  • W3C HCLS dataset descriptions (
  • BioSample2 (
  • GEO MINiML3 (
  • PRIDE-ml (
  • ISA4
  • MAGE-tab5
  • GA4GH metadata schema (
  • SRA xml (
  • BioProject2 (
  • CDISC SDM / element of BRIDGE model6

4.Definition of Authorization

Authorization refers to the process of obtaining approval from the data provider for use of the data. Many datasets can be downloaded anonymously without prior authorization, but others require registration, a signed data use agreement, or other documentation. When data are restricted to authorized users, they are protected by an authentication process requiring users to provide proof of their identity. Data protections may extend to the way in which data are accessed.In some cases researchers must conduct analyses of sensitive data (e.g., protected health information) via remote access to protected computing environments.

5. Identifier Resolution

In order for DataMed to resolve an identifier, it must have (a) a documented prefix and (b) a resolving namespace. DataMed must maintain a landing page for each indexed dataset in addition to the landing page at the repository. These DataMed landing pages must be human and machine readable (i.e. retrieve different content through the use of appropriate “inflections”). This allows DataMed to provide landing pages for datasets that may no longer exist or be accessible (i.e. tombstone pages). Such services are already provided in the scientific literature by services such as Portico17.However, no equivalent service exists for datasets.

6.Message brokering architecture of the indexing pipeline

The pipeline adopts a message brokering architecture consisting of a dispatcher, one or more consumer heads running on separate machines and a management interface to configure and manage the ingestion and processing pipeline for metadata from a data source. The dispatcher is configured with one or more pipeline definitions to harvest/ingest documents for a source, process/enhance/transform them for indexing. The consumer head coordinates a set of consumers that listen to preconfigured message queues and apply predefined operation to the document. During processing, each metadata record to be ingested is represented by a JSON document. This JSON document contains the original document, the processing results, provenance (represented in PROV-DM) and processing information used by the processing pipeline.

7. Name list for bioCADDIE contributors

Steering Committee

Alison Yao – NIAID, NIH

Dawei Lin – NIAID, NIH

Dianne Babski - National Library of Medicine, NIH

George Komatsoulis - National Library of Medicine, NIH

Heidi Sofia – NHGRI, NIH

Jennie Larkin – NIDDK, NIH

Ron Margolis – NIDDK, NIH

Metadata Working Group (WG)

Allen Dearry – NIEHS, NIH

Carole Goble - The University of Manchester

Helen Berman - Rutgers

Jared Lyle - ICPSR, University of Michigan

John Westbrook - Rutgers

Kevin Read - NYU School of Medicine

Marc Twagirumukiza - W3C ScheMed WG

Marcelline Harris - University of Michigan

Mary Vardigan - ICPSR, University of Michigan

Matthew Brush - Oregon Health and Science University

Melissa Haendel - Monarch Initiative

Michael Braxenthaler - Roche

Michael Huerta - National Library of Medicine, NIH

Morris Swertz - CORBEL, BBMRI, ELXIR-NL and University Medical Center Groningen

Rai Winslow - John Hopkins University

Ram Gouripeddi - University of Utah

Shraddha Thakkar - NCTR FDA

Weida Tong - NCTR FDA

Accessibility Metadata WG

Alex Kanous - University of Michigan

Anne-Marie Tassé - McGill University

Damon Davis - HealthData.gov

Frank Manion - University of Michigan

Jessica Scott - GlaxoSmithKline

Kendall Roark - Purdue University

Mark Phillips - McGill University

Reagan Moore - University of North Carolina

Identifiers WG

Jo McEntyre - EMBL-EBI, ELIXIR EBI Node, Pub Med Central

Joan Starr - California Digital Library, DataCite

John Kunze - California Digital Library

Julie McMurry - Monarch Initiative

MerceCrosas - Data Science

Michel Dumontier - Center for Expanded Data Annotation and Retrieval, W3C HCLSIG

StianSoiland-Reyes - University of Manchester

Tim Clark - Harvard Medical School, FORCE11 Data Citation Implementation Group

Use Cases and Testing Benchmarks WG

Dave Kaufman - Arizona State University

Dina Demner-Fushman - National Library of Medicine, NIH

Ratna Rajesh Thangudu - BD2K Standards Coordinating Center

Steven Kleinstein - Yale School of Medicine

Thomas Radman - NIH

Todd Johnson - UTHealth

Trevor Cohen - UTHealth

Zhiyong Lu - National Library of Medicine, NIH

March 2015 workshop use case contributors

Alyson Yao, NIH

Anders Garlid, UCLA

Anita deWaard, Elsevier Research Data Services

AnupamaGururaj, UTHealth

Brian Bleakely, UCLA

Carol Bean, Stanford University

Christina Kendziorski, University of Wisconsin Madison

Cleo Maehara, UCSD

Dave Eichmann, University of Iowa

Dawei Lin, NIAID, NIH

George Alter, ICPSR, University of Michigan

Heidi Sofia, NIH

Helen Berman, Rutgers

Howard Choi, UCLA

Hua Xu, UTHeath

Ian Fore, US NCI, NIH

Ida Sim, UCSF

Jeremy Espino, University of Pittsburgh

Julia Puzak, NIH

KarmelaKrleza-Jeric, University of Split Jennie Larkin, NIH

Kevin Read, New York University

LucilaOhno-Machado, UCSD

Maryann Martone, UCSD

Melissa Haendel, Oregon Health & Science University

Peipei Ping, UCLA

Richard Gonzalez, UTHealth

Ron Margolis, NIH

Stephanie Hagstrom, UCSD

Susanna Sansone, University of Oxford

Vincent Kyi, UCLA

Ranking Search Results WG

Chelsea Ju - UCLA

Christina Kendziorski - University of Wisconsin Madison

Chun-Nan Hsu - UCSD

Elmer Bernstam - UTHealth

Gregory Gundersen - Icahn School of Medicine at Mount Sinai

Griffin Weber - Harvard Medical School

Hongfang Liu - Mayo Clinic

Jim Zheng - UTHealth

SandaHarabagiu - University of Texas at Dallas

Vincent Kyi - UCLA

Criteria for Repository Inclusion WG

Amy Pienta - University of Michigan, ICPSR

Elizabeth Bell – UCSD

John Marcotte - ICPSR / University of Michigan

John Yates - The Scripps Research Institute

Kei Cheung - Yale University

Larry Clarke - National Cancer Institute (NCI)

Marek Grabowski - University of Virginia

Matthew McAuliffe - Center for Information Technology NIH

Neil McKenna - Baylor College of Medicine

Tanya Barrett - NCBI (GEO, BioSample, BioProject), GA4GH

Tim Clark - Harvard Medical School, FORCE11 Data Citation Implementation Group

Dataset Citation Metrics WG

Daniel Mietchen - National Library of Medicine, NIH

Jennifer Lin - PLOS

Kristi Holmes - Northwestern University

Martin Fenner - DataCite

Michael Taylor - Elsevier

Rebecca Lawrence - F1000

SunjeDallmeier-Tiessen - CERN

Trisha Cruse - DataONE, CDL

Core technology Development Team (CDT)

AnupamaGururaj - UTHealth

Burak Ozyurt - UCSD

ClaudiuFarcas - UCSD

Cui Tao - UTHealth

DeevakarRogith - UTHealth

ErginSoysal - UTHealth

Larry Lui - UCSD

Mandana “Nina” Salimi - UTHealth

Min Jiang - UTHealth

Muhammad Amith - UTHealth

NansuZong - UCSD

Ngan T Nguyen-Le - UTHealth

Pratik Kumar Chaudhary – UTHealth

Ruiling Liu - UTHealth

SaeidPournejati - UTHealth

Vidya Narayana - UTHealth

Xiao Dong - UTHealth

Xiaoling Chen - UTHealth

Yaoyun Zhang - UTHealth

Yueling Li - UCSD

Pilot project PIs and collaborators

Aditya Menon – National ICT Australia

Cathy Wu – University of Delaware

Cecilia Arighi – University of Delaware

Chris Mungall – Lawrence Berkeley National Laboratory

DmitriyDligach – Boston Children’s Hospital

GuerganaSavova – Harvard University

Guoqian Jiang – Mayo Clinic

Harold Solbrig – Mayo Clinic

Hwanjo Yu – POSTECH, South Korea

Jaideep Vaidya – Rutgers

Jeeyae Choi – University of Wisconsin Milwaukee

Jina Huh – UCSD

Julio Facelli – University of Utah

Peter Rose – UCSD

Ricky Taira – UCLA

Xiaoqian Jiang – UCSD

Zhaohui Qin – Emory University

Supplements to bioCADDIE

OmicsDI

Eric Deutsch, ISB

Henning Hermjakob EMBL-EBI

Peipei Ping, UCLA

CountEverything

David Haussler, UCSC

Ida Sim, UCSF

Isaac Kohane, Harvard Medical School

FORCE11

Tim Clark, Harvard Medical School

Maryann Martone, UCSD

Administrative Supplements

Bert O’Malley, Baylor College of Medicine

Carolyn Mattingly, North Carolina State University Raleigh

George Hripcsak, Columbia University

Hongfang Lu, Mayo Clinic

Raymond Winslow, Johns Hopkins

Steven Kleinstein, Yale

Tor Wagner, University of Colorado

Trevor Cohen, UTHealth

Reference

1Brase, J. in Cooperation and Promotion of Information Resources in Science and Technology, 2009. COINFO'09. Fourth International Conference on. 257-261 (IEEE).

2Barrett, T. et al. BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucleic acids research 40, D57-D63 (2012).

3Barrett, T. et al. NCBI GEO: mining tens of millions of expression profiles—database and tools update. Nucleic Acids Research 35, D760-D765, doi:10.1093/nar/gkl887 (2007).

4Sansone, S. A. et al. Toward interoperable bioscience data. Nature genetics 44, 121-126, doi:10.1038/ng.1054 (2012).

5Rayner, T. F. et al. A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB. Bmc Bioinformatics 7, 1 (2006).

6Fridsma, D. B., Evans, J., Hastak, S. & Mead, C. N. The BRIDG project: a technical report. Journal of the American Medical Informatics Association 15, 130-137 (2008).