Principal Investigator/Program Director (Last, first, middle): Nicholas, Hugh B. Jr.

______

1. SPECIFIC AIMS

The primary aim of this project, to increase minority participation in biomedical research through a broadly based program of assisting minority institutions in developing all aspects of bioinformatics programs on their campuses, remains unchanged. The most intensive effort is to assist these institutions in developing the training they offer by providing training and well developed, tested training materials for bioinformatics courses to minority institutions, their faculty members, and their graduate students. The program will directly assist bioinformatics programs at two minority institutions each year and provide significant bioinformatics training at a number of other minority institutions. The program concentrates on the sequence based aspects of bioinformatics that allow scientists to make effective use of the vast amount of information being produced by various genome sequencing, gene expression, and macromolecular molecular sequence determination projects around the world. This basic area will be supplemented with information on the newer techniques designed to facilitate large-scale genomics and proteomics investigations as well as structural modeling and analysis techniques.

The program has been significantly expanded beyond the initial program focused around an introductory bioinformatics course by adding course design and development components for one introductory and two advanced courses for bioinformatics in the three component areas of bioinformatics, the biological sciences, the mathematical sciences, and computer science. This expanded program provides both immediate and long-term increases in the research opportunities available to scientists at minority institutions. The program assist in developing bioinformatics as a strong part of the curriculum in multiple departments at selected institutions; and integrate bioinformatics procedures into the repertoire of research tools used in the research groups at the selected institutions. The initial three components, the continuing core of the program, are:

  1. An intense two week summer workshop in bioinformatics for multidisciplinary faculty teams from selected minority institutions;
  2. Strengthening the bioinformatics curriculum at two minority campuses every year by teaching a bioinformatics course in collaboration with members of the above faculty teams; and
  3. A five-week research internship at the Pittsburgh Supercomputing Center (PSC) for graduate students who have completed the bioinformatics course on their campus.

This yearlong program covers major aspects of developing a strong, multidepartmental bioinformatics program at minority institutions. It recruits and trains a multidisciplinary team to staff the new program. It embeds the new program institutionally by integrating it into the teaching curriculum and by providing two years of part-time support for an on-campus liaison. Importantly, it incorporates bioinformatics into local research efforts. The new course development components, including introductory and advanced courses for biologists, mathematicians, and computer scientists, will assist minority institutions in training students from mathematical and computational backgrounds for bioinformatics careers. It assist minority institutions in providing training leading to specialization in bioinformatics to students from these diverse academic backgrounds and allow them to apply their expertise to important biological and medical research problems. Scientists recruited from minority institutions especially recruited will do the course development. This coordinated program will solidify institutional support for bioinformatics by yielding tangible results in both teaching and research within the first year and by creating a solid basis for continuing results. A second year of PSC consulting, support, and a student internship is also provided. We will couple this program to a strong evaluation protocol to both identify successes and to identify and remedy shortcomings.

2. BACKGROUND AND SIGNIFICANCE

The biomedical sciences have progressively become more molecular and quantitative with a large fraction of current research aimed at describing both normal cellular processes and disease states at the molecular level. Macromolecular sequence data is such a powerful aid to this process that a programmatic effort is underway to acquire a systematic collection of such data. The collection will include the complete genomic sequences for a large, diverse set of organisms, including humans and human pathogens. Sequence data are the successful outcomes of a set of random mutagenesis experiments conducted by nature and are a rich source of hypotheses about relationships between organism phenotypes and genotypes that can be tested experimentally. This explosion of quantitative data continues to exapnd rapidly with the development of new techniques for rapidly sequencing whole genomes at reasonable cost and for carrying out large-scale investigations of gene and protein expression with microarray and proteomics techniques.

This has led mathematically oriented biologists, as well as mathematicians, statisticians, and computer scientists to devise a growing array of analytical techniques, bioinformatics analyses, to mine this data to uncover hypotheses to efficiently drive further experimental work in productive directions as well as statistically evaluate and test hypotheses. The ability to gather and exploit this data has revolutionized molecular biology in such a short time that few biologists have received the mathematical and computer training needed to perform these analyses and exploit the resulting bioinformatics information in their experimental investigations. This has created a major demand for researchers trained in these techniques in both industry and academia.

These and related opportunities were what drove the discussions at the NIGMS meeting “Visions of the Future” in September of 2002. ( Two major themes that resulted from this meeting are in the two following paragraphs.

"Mathematization" of Biology. A fusion of biology with the physical and engineering sciences is needed. Developing mechanisms that encourage and reward mutual, cooperative interactions among mathematicians, physicists, computer scientists, engineers, and biologists will be important for achieving highly significant future advances. … A further hindrance to progress is the observation that the quantitative components of undergraduate and graduate training in biology are currently inadequate.

Interdisciplinary Science. Ongoing, daily collaboration between scientists from different disciplines is vital to the success of future biomedical research. Assembling "critical masses" of personnel representing sufficient breadth and depth of varied scientific expertise will be an important solution to tackling complicated problems in biology and facilitating the fusion of information from disparate sources. … While investigators should be encouraged to work together, they should concentrate on retaining their specialist foci.

Since the above-mentioned expansion in demand is both large and sudden many well-regarded universities are just beginning to train their students in this important area, as suggested by “Visions of the Future” because of the lack of trained faculty and established programs.

This lack of trained faculty and established programs afflicts minority institution efforts to establish bioinformatics curriculum not only in biological sciences departments but in mathematics, statistics, and computer science departments as well. Scientists from all of these areas are needed in a complete bioinformatics program that will be broad enough to attract students from the even wider range of academic disciplines that have expertise that needs to be effectively used in biological research.

Even before the biomedical community has absorbed the first round of the bioinformatics revolution at the sequence level, we are undergoing a parallel expansion of macromolecular structure data as well as a second round of the bioinformatics revolution in terms of high throughput genomics and proteomics techniques. This is seen in the recent NIH initiatives to systematically acquire these types of data.

From the above discussion it can be seen that bringing a bioinformatics program into any university differs drastically from perhaps adding another enzymologist to teach metabolic pathways to an already established biochemistry program. Establishing a bioinformatics program is a major undertaking that crosses traditional academic field boundaries. Such a program needs people with skills in various aspects of biology such as molecular biology, cell biology, structural biology, and biochemistry are obvious requirements. Other natural sciences such as chemistry, physics, and engineering have much to offer, both in terms of scientific content or understanding and in terms of their more mathematical model building and computational approach to scientific investigation. Mathematics, statistics, and computer sciences are essential contributors to a bioinformatics program as the primary source of mathematical analysis and modeling and computational processing and organizing of data. Ideally, each person will have overlapping competencies that will form a coherent whole within the program. Thus establishing a bioinformatics program requires bringing in a wider, more diverse set of skills than adding a new specialty within a single academic field or department since it requires expertise found in multiple academic fields.

Given the highly competitive market for bioinformatics expertise and its limited availability it is unlikely that institutions with limited expansion budgets can import an entire bioinformatics program by hiring new faculty. Instead, the multidisciplinary program we propose here seeks to assist developing bioinformatics programs at minority institutions through a different mechanism. We will recruit multimember faculty teams whose members already posses a strong academic backgrounds in one of the foundation fields of bioinformatics and assist these teams in implementing bioinformatics programs on their campuses. from the perspective that bioinformatics is a set of core problems whose solution involves combining expertise from the biological sciences, the mathematical sciences, and computer science. The problem set we will focus on in our initial training, and on which the course development aspects of this program will build, is the problem of how to gain the maximum amount of information from the vast amount of data being generated by the large scale projects in genome sequencing, expression, and macromolecular structure determination.

The skills to solve these problems will be a valuable research ability and open the door to many research opportunities. This is true for a scientist from a mathematical or computer science background developing new methods to analyze these data or is an experimental biologist applying bioinformatics techniques to a specific research agenda. Indeed, we believe that there will be an increasing amount of biological research carried out by teams of scientists with biological scientists working with mathematical scientists and computer scientists using integrated approaches to complicated problems involving large and complicated datasets.

3. PRELIMINARY STUDIES

This section will be divided into two parts: an overall description of the Biomedical Initiative at the PittsburghSupercomputingCenter followed by a description of the first three years of the current project: “Developing Bioinformatics Effort at MARCSchools.”

3.A.1 The Biomedical Initiative at the Pittsburgh Supercomputing Center has been a leading program in high performance computing for over 15 years. The Initiative's mission is to develop and apply new computing and scientific solutions in important biomedical areas such as structural biology and bioinformatics, cellular microphysiology, neural modeling, the Visible Human Project, and pathology. The group has an active research, training and service program. The core funding for the Biomedical Initiative is as a Research Resource (P41 RR06009), with additional support from other institutes at the NIH, NSF, DARPA and others.

The Bioinformaticsprojects include investigating new algorithms for sensitive database searches, evaluating different representations of the information contained within a multiple sequence alignment, and the identification of residues that differentiate between different sequence subfamilies. We recently completed a detailed analysis of enzyme families (aldehyde dehydrogenase and glutathione S-transferase) that identified conserved motifs, key residues for specificity and catalytic activity and provided predictions used in laboratory research. PSC maintains a large service facility, which includes a large suite of programs for database searching, multiple sequence alignment, pattern identification and searching, and phylogenetic analysis; all major sequence databases and a large number of the completed genomes are maintained online. At least one workshop each year is focused primarily on sequence-based bioinformatics. We also are working directly with a number of universities to help establish competitive bioinformatics efforts.

The Structural Biologyeffort uses a variety of computational chemistry approaches to obtain insights into structure, function and specificity. All projects employ bioinformatics to help define questions, direct projects, and interpret results. The group works directly with other experimental groups to test the predictions derived from computations. The current projects include quantum chemical models for divalent metal ion binding sites, QM and QM/MM computations to investigate enzyme mechanisms, and MD simulations to study the dynamical behavior around active and binding sites. The service effort supports all major quantum mechanics programs (e.g., GAMESS, Gaussian, NWChem), molecular mechanics (e.g., CHARMM, AMBER, NAMD) and some QM/MM programs (Dynamo). Usually there is a structural biology workshop each year, alternating between molecular dynamical simulation of biopolymers and structure determination (e.g., X-Ray, NMR and cryo-EM).

3.A.2 The Training Program includes three types of workshops: A Research workshop, which brings in top scientist within a research domain to discuss the current state-of-the art and ways that this compute-bound field might be able to successfully use facilities such as the PSC. An Application Workshop, which uses leading scientists within a research domain to teach the use of the current state-of-the art programs within a specific computational biology area. A Technology Workshop teaches researchers the best ways to utilize technology in their research. These workshops have been well received by the community. We have offered 9 Research Workshops, 28 Applications Workshops (not including Roadshows), and 9 Technology Workshops; and have had over 3000 total participants. Table I lists by type the various workshops that have been offered by the Biomedical Initiative at the PSC. These workshops have been funded as a part of PSC’s Research Resource unless otherwise noted.

The workshop most directly relevant to this application is the “Nucleic Acid and Protein Sequence Analysis workshop,” which has been offered continuously at the PSC since 1989. This week-long workshop was initially offered under the auspices of PSC’s research resource, but has since been funded by an NHGRI two year grant that is currently up for its sixth competitive renewal. We have trained 344 researchers in the fifteen offerings. This week-long workshop has been continually over-subscribed, which lead to the development of a two day version that we have taken on the road to 35 universities and presented to almost 1500 participants, often with these presentations included as requirements to a number of courses. Included among the “Roadshow Stops” are:HowardUniversity, Clark-Atlanta University, New Mexico State University, the University of Puerto Rico Medical Sciences and Cayey campuses, and the University of Texas at San Antonio and El Paso. Due to the demand and interest, we have also offered an advanced version of this workshop in Pittsburgh.

The University of California at Davis engaged the group to help organize and present a three-credit course (MCB/NPB 298) in bioinformatics during the fall quarter of 1999. Drs. Deerfield and Nicholas each made three trips to Davis during the quarter to present lectures, conduct hands-on computer laboratory sessions, and consult with faculty and students on the term projects that are the basis for a grade in the course. A number of other internationally recognized bioinformatics experts (who were selected in consultation with the PSC staff) made a single trip to University of California at Davis to present lectures and consult with students. This course is one of the first steps by the University of California at Davis to establish an extensive,

TableI.Workshops offered by PSC’s Biomedical Initiative since 1987.

  • Research Workshop (2 day workshops)

Epidemiological Modeling (12/91)

Fast Processes in Protein Folding (Fund: Biology at NSF)

High Performance Software for Computational Neuroscience (12/92)

3rd International on Human Chromosome 16 Mapping 1994 (Fund: DoE, 5/94)

Application of High Performance Computing in Bioengineering (10/94)

Gene Map Integration Research Workshop (12/95)

Biomedical Image Analysis and Visualization (7/98)

Statistical Analysis of Neuronal Data (5/02)

From Structure to Function: Frontiers of Biological Ion Channels (5/03)

  • Application Specific Workshop (3.5 to 5.5 day workshops)

Nucleic Acid and Protein Sequence Analysis (Fund: NHGRI and NCRR)

Academic course: Nucleic Acid and Protein Sequence Analysis (UC-Davis and Pitt)

Advanced Nucleic Acid and Protein Sequence Analysis

Molecular Mechanics and Dynamics (AMBER, CHARMM, NAMD)

Structure Determination with either cryo-EM (1x), NMR (2x) or X-Ray (3x)

Computational Neural Model (Mcell, Neosim)

Image Restoration (Microscopy, Pathology)

Biofluid Dynamics with Flexible Boundary Conditions

  • Technology Workshop (3.5 day workshops)

Supercomputing Techniques for Biomedical Researchers

Building and Using PC Clusters in Biomedical Research

multidisciplinary program in bioinformatics on that campus. A non-credit graduate level course was taught at the University of Pittsburgh and CarnegieMellonUniversity under the auspices of a Keck grant. A grant to the University of Pittsburgh from the Howard Hughes foundation funded the PSC to present this material in an undergraduate course at the University of Pittsburgh during the spring terms since 2000, which the PSC continues to present in subsequent spring terms.

This extensive background in training in a broad range of computational biology techniques including bioinformatics, both in Pittsburgh and on university campuses around the country, gives the PSC staff abundant experience in all of the components of this program.

3.A.3Bioinformatics Software Developed by PSC’s Biomedical Initiative has two major software development paths: 1.) development or implementation of new algorithms as original code development, or 2.) parallelization, optimization or “hardening” of existing codes.