Brad Hemminger

This is a restatement of my research proposal for the CTSA biomedical informatics project, updated 2008-07-15 based on EDW progress. The original proposal was submitted for the initial grant, on 2006-12-06.

Proposal for Graduate Research Assistant Support

Annual cost $20,000 (half time during academic year, full-time during summer)

Goal: Unification of infrastructure and vocabularies, and acceptance of best practice data element representation by labs and projects. In order to achieve truly revolutionary breakthroughs in biomedical informatics two barriers must be broken. The first is to change the infrastructure of informatics groups from isolated silos to an integrated network of peers. Connections must be made both in technical terms and more importantly, in personal terms. Unless organizational support helps bring together the many and varied researchers and clinicians and educators on these fronts, little true integration will occur. The second barrier is one of common vocabularies and terminologies. Many groups from different backgrounds and disciplines are working in related areas. They use different terms for the same, or similar concepts. In order to have the semantic integration required to achieve useful interoperability there must be integration of terminology and ontologies across these myriad groups of participants.

Hemminger research goals under CTSA:

1.  Unification of terminologies. As part of the RENCI P20 we have surveyed about a dozen major biomedical informatics labs on campus, and recorded their complete database schemas and terminologies (summary spreadsheet). We made significant progress developing a common data model to unify their “common” concepts and define controlled vocabularies for these. We wish to extend this to unify this “common model” with national and international standards (primarily CDISC, caBIG vocabulary terms, as well as large common used ontologies such as GO, UMLS, etc, as well as the national CTSA efforts). These results would then be exchanged with the other CTSA sites and we would work on national “unification”. As part of this we’d also like to extend our surveying to include additional labs on campus (representative sampling).

2.  Overcoming non-technical barriers to infrastructure unification. In the pilot work in our RENCI P20 grant we found that most all individual research PIs and large project groups would not choose to participate in large infrastructure programs such as is proposed by CTSA, nor would they even utilize common resources such as disk space or processing time when freely available on standard machines. We have identified some of the policy, personal and social issues involved and wish to examine this in a larger context (more labs, within the specific CTSA framework) and identify possible motivations that would cause the PIs to change their behavior and to choose to participate.

Critical initial work to be done by the CTSA Biomedical Informatics Project will be to help the EDW/ Carolina Data Warehouse identify proper standardized representations for their data elements. In this project we will focus on data elements involved with the initial clinical projects being ingested into the EDW, including

·  Diabetes

·  Carpe Diem (breast cancer)

·  ERIC

·  nephrology fellows (cancer)

The EDW staff is focused on bringing up the EDW and providing initial service support. However, this work will not be fruitful if we cannot relate data between different projects at UNC, or projects across North Carolina, or nationally to other CTSA projects. In order for this to occur we will have to define common standardized data representations for all data elements. This requires surveying data elements in use, and identifying proper standardized representations that should be used to represent these data elements when storing them in the EDW clinical database and associated research datamarts (Caroline Data Warehouse). Standardized representations would come from CDISC, caBIG, etc.

This work will require one of the GRA slots assigned to our CTSA group. They will work under the supervision of the joint working group on Data Representation (formed from CTSA Data Models/Standards/Ontologies and EDW committees) to address standardized data elements. In this, they will work in conjunction with EDW staff assigned to this project. Their work on overcoming non-technical barriers will be with Dr. Hemminger of SILS. They will need to begin August 2008, and continue for the course of the grant. The position will be recruited and hired by the CTSA Data Models/Standards/Ontologies chair Brad Hemminger.

Deliverables

Data Representation:

·  Identification of data elements utilized in EDW clinical database and associated research datamarts.

·  Decision as to choice of best practice representation for each identified data element.

Non-technical barriers:

·  Survey of scientists (principal investigators and investigators for laboratories on campus with major biomedical informatics datasets), to detail reasons for and for not participating in data centers, as well as adherence to standard data models.

Example Standards Data Representations and Controlled Vocabularies/Ontologies:

·  Genbank: gene sequences

·  SwissProt: proteins

·  Miame/MGED: microarrays

·  CDISC

·  caBIG

·  Gene Ontologies (GO)

·  Mesh/UMLS