Supplementary informationfor
Finding useful data across multiple biomedical data repositories using DataMed
Lucila Ohno-Machado1,2*, Susanna Sansone3*, George Alter4*, Ian Fore5*, Jeffrey Grethe1*, Hua Xu6*, Alejandra Gonzalez-Beltran3, Philippe Rocca-Serra3, Ergin Soysal6, Nansu Zong1, Hyeon-euiKim1
1University of California San Diego.2Veterans Administration San Diego Healthcare System.3University of Oxford.4University of Michigan Ann Arbor.5National Institutes of Health.6The University of Texas Health Science Center at Houston. *These authors contributed equally. Correspondence should be address to L. Ohno-Machado ()
Index
1.Supplementary Table 1: Number of results from different systems
2.BioCADDIE working groups
3.Recommendations for repository inclusion criteria
4.Metadata models reviewed to define the core metadata incorporated into DATS
4. Definition of Authorization
5. Identifier Resolution
6. Message brokering architecture of the indexing pipeline
7. Name list for bioCADDIE contributors
Reference
1.Supplementary Table 1: Number of results from different systems
Search Engine / Keyword Query / Natural Language QueryTotal returned results / Top 50 / Total returned results / Top 50
Relevant / Partially
relevant / Not relevant / Dataset / Relevant / Partially
relevant / Not relevant / Dataset
DataMed / 94 / 46 / 4 / 0 / 50 / 81 / 46 / 2 / 2 / 50
OmicsDI / 192 / 18 / 14 / 18 / 50 / 0 / 0 / 0 / 0 / 0
Google / 115000 / 18 / 21 / 11 / 0 / 7 / 3 / 3 / 1 / 3
Bing / 1430000 / 11 / 24 / 15 / 0 / 149000 / 22 / 3 / 25 / 0
2.BioCADDIE working groups
- bioCADDIE Working Group 2. Data Identifiers Recommendation (
- bioCADDIE Working Group 3 (1). Descriptive Metadata for Datasets ()
- bioCADDIE Working Group 3 (2). The DataMed's DATS model annotated with schema.org ()
- bioCADDIE Working Group 4. Use Cases and Testing Benchmarks
- (
- bioCADDIE Working Group 6. Criteria for Repository Inclusion (Standards, Interoperability, Sustainability, etc.)
- (
- bioCADDIE Working Group 7. Accessibility Metadata for Datasets, (
- bioCADDIE Working Group 8. Ranking Search Results (
- bioCADDIE Working Group 9. End User Evaluation Criteria
- (
3.Recommendations for repository inclusion criteria
- Quality: Data sets should be accompanied by sufficient metadata to make them findable, usable, and interoperable. Data formats should follow community standards.
- Sustainability: Repositories should actively maintain their holdings and have sustainable funding.
- Scope: Additional repositories should be included in bioCADDIE as necessary to ensure diversity and new types of data used in research.
- Prominence: Priority should be given to types of data that testers of the prototype will expect to find. Less prominent repositories should be considered if they introduce diversity.
- Access: Data should be accessible to the scientific community. Metadata must be in a format usable by the bioCADDIE team for ingesting to the DataMed index. Objects must be uniquely identified and web‐resolvable by persistent identifiers.
4.Metadata models reviewed to define the core metadata incorporated into DATS
- schema.org (
- DataCite1
- RIF-CS (
- Project Open Data Metadata Schema v1.1 (
- W3C HCLS dataset descriptions (
- BioSample2 (
- GEO MINiML3 (
- PRIDE-ml (
- ISA4
- MAGE-tab5
- GA4GH metadata schema (
- SRA xml (
- BioProject2 (
- CDISC SDM / element of BRIDGE model6
4.Definition of Authorization
Authorization refers to the process of obtaining approval from the data provider for use of the data. Many datasets can be downloaded anonymously without prior authorization, but others require registration, a signed data use agreement, or other documentation. When data are restricted to authorized users, they are protected by an authentication process requiring users to provide proof of their identity. Data protections may extend to the way in which data are accessed.In some cases researchers must conduct analyses of sensitive data (e.g., protected health information) via remote access to protected computing environments.
5. Identifier Resolution
In order for DataMed to resolve an identifier, it must have (a) a documented prefix and (b) a resolving namespace. DataMed must maintain a landing page for each indexed dataset in addition to the landing page at the repository. These DataMed landing pages must be human and machine readable (i.e. retrieve different content through the use of appropriate “inflections”). This allows DataMed to provide landing pages for datasets that may no longer exist or be accessible (i.e. tombstone pages). Such services are already provided in the scientific literature by services such as Portico17.However, no equivalent service exists for datasets.
6.Message brokering architecture of the indexing pipeline
The pipeline adopts a message brokering architecture consisting of a dispatcher, one or more consumer heads running on separate machines and a management interface to configure and manage the ingestion and processing pipeline for metadata from a data source. The dispatcher is configured with one or more pipeline definitions to harvest/ingest documents for a source, process/enhance/transform them for indexing. The consumer head coordinates a set of consumers that listen to preconfigured message queues and apply predefined operation to the document. During processing, each metadata record to be ingested is represented by a JSON document. This JSON document contains the original document, the processing results, provenance (represented in PROV-DM) and processing information used by the processing pipeline.
7. Name list for bioCADDIE contributors
Steering Committee
Alison Yao – NIAID, NIH
Dawei Lin – NIAID, NIH
Dianne Babski - National Library of Medicine, NIH
George Komatsoulis - National Library of Medicine, NIH
Heidi Sofia – NHGRI, NIH
Jennie Larkin – NIDDK, NIH
Ron Margolis – NIDDK, NIH
Metadata Working Group (WG)
Allen Dearry – NIEHS, NIH
Carole Goble - The University of Manchester
Helen Berman - Rutgers
Jared Lyle - ICPSR, University of Michigan
John Westbrook - Rutgers
Kevin Read - NYU School of Medicine
Marc Twagirumukiza - W3C ScheMed WG
Marcelline Harris - University of Michigan
Mary Vardigan - ICPSR, University of Michigan
Matthew Brush - Oregon Health and Science University
Melissa Haendel - Monarch Initiative
Michael Braxenthaler - Roche
Michael Huerta - National Library of Medicine, NIH
Morris Swertz - CORBEL, BBMRI, ELXIR-NL and University Medical Center Groningen
Rai Winslow - John Hopkins University
Ram Gouripeddi - University of Utah
Shraddha Thakkar - NCTR FDA
Weida Tong - NCTR FDA
Accessibility Metadata WG
Alex Kanous - University of Michigan
Anne-Marie Tassé - McGill University
Damon Davis - HealthData.gov
Frank Manion - University of Michigan
Jessica Scott - GlaxoSmithKline
Kendall Roark - Purdue University
Mark Phillips - McGill University
Reagan Moore - University of North Carolina
Identifiers WG
Jo McEntyre - EMBL-EBI, ELIXIR EBI Node, Pub Med Central
Joan Starr - California Digital Library, DataCite
John Kunze - California Digital Library
Julie McMurry - Monarch Initiative
MerceCrosas - Data Science
Michel Dumontier - Center for Expanded Data Annotation and Retrieval, W3C HCLSIG
StianSoiland-Reyes - University of Manchester
Tim Clark - Harvard Medical School, FORCE11 Data Citation Implementation Group
Use Cases and Testing Benchmarks WG
Dave Kaufman - Arizona State University
Dina Demner-Fushman - National Library of Medicine, NIH
Ratna Rajesh Thangudu - BD2K Standards Coordinating Center
Steven Kleinstein - Yale School of Medicine
Thomas Radman - NIH
Todd Johnson - UTHealth
Trevor Cohen - UTHealth
Zhiyong Lu - National Library of Medicine, NIH
March 2015 workshop use case contributors
Alyson Yao, NIH
Anders Garlid, UCLA
Anita deWaard, Elsevier Research Data Services
AnupamaGururaj, UTHealth
Brian Bleakely, UCLA
Carol Bean, Stanford University
Christina Kendziorski, University of Wisconsin Madison
Cleo Maehara, UCSD
Dave Eichmann, University of Iowa
Dawei Lin, NIAID, NIH
George Alter, ICPSR, University of Michigan
Heidi Sofia, NIH
Helen Berman, Rutgers
Howard Choi, UCLA
Hua Xu, UTHeath
Ian Fore, US NCI, NIH
Ida Sim, UCSF
Jeremy Espino, University of Pittsburgh
Julia Puzak, NIH
KarmelaKrleza-Jeric, University of Split Jennie Larkin, NIH
Kevin Read, New York University
LucilaOhno-Machado, UCSD
Maryann Martone, UCSD
Melissa Haendel, Oregon Health & Science University
Peipei Ping, UCLA
Richard Gonzalez, UTHealth
Ron Margolis, NIH
Stephanie Hagstrom, UCSD
Susanna Sansone, University of Oxford
Vincent Kyi, UCLA
Ranking Search Results WG
Chelsea Ju - UCLA
Christina Kendziorski - University of Wisconsin Madison
Chun-Nan Hsu - UCSD
Elmer Bernstam - UTHealth
Gregory Gundersen - Icahn School of Medicine at Mount Sinai
Griffin Weber - Harvard Medical School
Hongfang Liu - Mayo Clinic
Jim Zheng - UTHealth
SandaHarabagiu - University of Texas at Dallas
Vincent Kyi - UCLA
Criteria for Repository Inclusion WG
Amy Pienta - University of Michigan, ICPSR
Elizabeth Bell – UCSD
John Marcotte - ICPSR / University of Michigan
John Yates - The Scripps Research Institute
Kei Cheung - Yale University
Larry Clarke - National Cancer Institute (NCI)
Marek Grabowski - University of Virginia
Matthew McAuliffe - Center for Information Technology NIH
Neil McKenna - Baylor College of Medicine
Tanya Barrett - NCBI (GEO, BioSample, BioProject), GA4GH
Tim Clark - Harvard Medical School, FORCE11 Data Citation Implementation Group
Dataset Citation Metrics WG
Daniel Mietchen - National Library of Medicine, NIH
Jennifer Lin - PLOS
Kristi Holmes - Northwestern University
Martin Fenner - DataCite
Michael Taylor - Elsevier
Rebecca Lawrence - F1000
SunjeDallmeier-Tiessen - CERN
Trisha Cruse - DataONE, CDL
Core technology Development Team (CDT)
AnupamaGururaj - UTHealth
Burak Ozyurt - UCSD
ClaudiuFarcas - UCSD
Cui Tao - UTHealth
DeevakarRogith - UTHealth
ErginSoysal - UTHealth
Larry Lui - UCSD
Mandana “Nina” Salimi - UTHealth
Min Jiang - UTHealth
Muhammad Amith - UTHealth
NansuZong - UCSD
Ngan T Nguyen-Le - UTHealth
Pratik Kumar Chaudhary – UTHealth
Ruiling Liu - UTHealth
SaeidPournejati - UTHealth
Vidya Narayana - UTHealth
Xiao Dong - UTHealth
Xiaoling Chen - UTHealth
Yaoyun Zhang - UTHealth
Yueling Li - UCSD
Pilot project PIs and collaborators
Aditya Menon – National ICT Australia
Cathy Wu – University of Delaware
Cecilia Arighi – University of Delaware
Chris Mungall – Lawrence Berkeley National Laboratory
DmitriyDligach – Boston Children’s Hospital
GuerganaSavova – Harvard University
Guoqian Jiang – Mayo Clinic
Harold Solbrig – Mayo Clinic
Hwanjo Yu – POSTECH, South Korea
Jaideep Vaidya – Rutgers
Jeeyae Choi – University of Wisconsin Milwaukee
Jina Huh – UCSD
Julio Facelli – University of Utah
Peter Rose – UCSD
Ricky Taira – UCLA
Xiaoqian Jiang – UCSD
Zhaohui Qin – Emory University
Supplements to bioCADDIE
OmicsDI
Eric Deutsch, ISB
Henning Hermjakob EMBL-EBI
Peipei Ping, UCLA
CountEverything
David Haussler, UCSC
Ida Sim, UCSF
Isaac Kohane, Harvard Medical School
FORCE11
Tim Clark, Harvard Medical School
Maryann Martone, UCSD
Administrative Supplements
Bert O’Malley, Baylor College of Medicine
Carolyn Mattingly, North Carolina State University Raleigh
George Hripcsak, Columbia University
Hongfang Lu, Mayo Clinic
Raymond Winslow, Johns Hopkins
Steven Kleinstein, Yale
Tor Wagner, University of Colorado
Trevor Cohen, UTHealth
Reference
1Brase, J. in Cooperation and Promotion of Information Resources in Science and Technology, 2009. COINFO'09. Fourth International Conference on. 257-261 (IEEE).
2Barrett, T. et al. BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucleic acids research 40, D57-D63 (2012).
3Barrett, T. et al. NCBI GEO: mining tens of millions of expression profiles—database and tools update. Nucleic Acids Research 35, D760-D765, doi:10.1093/nar/gkl887 (2007).
4Sansone, S. A. et al. Toward interoperable bioscience data. Nature genetics 44, 121-126, doi:10.1038/ng.1054 (2012).
5Rayner, T. F. et al. A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB. Bmc Bioinformatics 7, 1 (2006).
6Fridsma, D. B., Evans, J., Hastak, S. & Mead, C. N. The BRIDG project: a technical report. Journal of the American Medical Informatics Association 15, 130-137 (2008).