Principal Investigator/Program Director (Last, First, Middle): Lu, Xinghua

INTRODUCTION TO AMENDMENT

We would like to thank the reviewers for their insightful and constructive critiques. Based on the reviewers’ comments, we have substantially modified the application with modification indicated by the bars on the right margin. There are few critiques were commonly pointed out by all three reviewers while some are distinct specific concerns. In this introduction, we will first summarize and address the common concerns and comments from all three reviewers and then address specific individual critiques.

Combined Comments

1. Significance, innovativeness, investigators and environment.

We are highly encouraged by the reviewers’ unanimous positive comments on the significance, innovativeness, investigators and environment for the proposed studies. The reviewers all pointed out that proposed work was significant for its “obvious potential impact” to the field that “is unmistakably significant.” The reviewers also agreed that “all aspects of the proposed work are innovative” and the work reflects an “exciting approach that clearly deserves to be tested.” We especially appreciate the expressed trusts stating that the investigators are “an excellent team” with concrete collaborative productivity and are “well qualified” for the project.

2. Evaluation of annotation assistant system.

All the reviewers shared a common concern regarding the evaluation of the annotation assistant system proposed as the Specific Aim 3 in the previous submission. The major concern was related to the lack evaluation of usability of system for which the investigators do no have strong experience. The second reviewer also pointed out that the effort devoted to development of such a system was premature due to uncertain impact. We agree with the reviewers that, at current stage, it is premature to propose implementation of an automatic annotation assistant system. Thus, we have removed the original Specific Aim 3 and reduced the duration of the study to 3 years accordingly. We have kept the research components of original Specific Aim 3 and fold them into the current Specific Aim 2. We believe that this restructure of specific aims will allow us to concentrate on statistical semantic analysis, developing annotation and information retrieval algorithms as the essential building blocks for the future implementation of the system. We will relegate implementation and evaluation of such a system to the competitive renewal period of this project.

Specific Comments from Individual reviewers

In follow paragraphs, we address the specific concerns raised by individual reviewers. If space allow, we will copy the reviewer’s comments verbatim (shown in italic font), otherwise we show the beginning and end of the paragraph of comments to indicate which comment we are addressing.

Critique 1:

1. One minor point was that they seem to have some confusion in their related work where they implied that a binary classifier equates to each document getting only one classification, but that is not true of binary classifiers – it just means each document is evaluated for one class at a time.

We would like to thank the reviewer for clarifying this point. We understand that multiple binary classifiers can be trained and then applied to a document to perform multiple one-vs-rest classification. Thus a document can be labeled (annotated) with multiple classes. Indeed, the naïve Bayes text classifier was proposed as the baseline annotation algorithm for comparison with the proposed methods in the Sections D.1.4 and D.2.1 of the original proposal. However, judging from the concerns raised by this and second reviewers, we believe that the design was not well presented. In this amendment, we devote a new subsection (Section D.1.6) to describe such experiment in detail.

2. Another minor point was that they propose to use full text only on articles in PubMed Central, but they likely could obtain many more full-text articles at their institution through MEDLINE, although those articles would not necessarily be publicly available.

We would like to thank the reviewer for pointing out this interesting direction, which is adopted in the amended proposal.

Critique #2:

1. Although impressive claims are made for this technology compared to other approaches, the preliminary evidence provided for these claims is quite modest. … For example, the top GO terms associated with the strongest topic (table 2) had mutual information measures in the range of 10^-3 and below, not particularly impressive.

We would like to point out that the MI is not a normalized quantity, and its absolute value is usually dependent on the sample size due to the empirical estimation of the probability mass P(X, Y) in Equation . In most case, the larger the sample size, the smaller the absolute MI value of a given joint event, (X, Y). Thus, it can not be used alone as the criteria for judging goodness.

2. The use of the term "concept" and "semantic content" as a synonym for the probabilistic topic definition is idiosyncratic and sometimes deceptive.

We have “overloaded” the term of “semantic” follow the convention the latent semantic indexing (LSI) (Deerwester et al., 1990), which is widely used in the information retrieval field and is closely related to our approach. In this is context, the “semantic content” refers to overall topicalities of a document, which is somewhat higher level abstraction than the “semantic” in conventional natural language processing context, which more concentrates at the words and phrases level (Jurafsky and Martin, 2000).

3. A rethinking of precisely what hypotheses need to be tested to demonstrate the value of this approach, and a reorganization of the research plan to focus on clear tests of these hypotheses would strengthen the proposal.

We appreciate and adopted this insightful suggestion. We have restructured the specific aims of the proposal, by clearly specifying the goals of experiments and connecting them with the overall specific aims.

4. Another general problem is the failure to provide reasonable points of comparison to the proposed algorithms. Many generic statements are made about "improvements," but appropriate baselines are not provided.

Reorganization of the specific aims (see above) leads to more clearly defined subtasks. In the amended proposal, we have specified baseline for most specific experiments.

5. Another general problem is the consistent failure to acknowledge possible problems in the research plan, and to provide suitable alternative approaches.

In original proposal, the alternative approaches for achieving overall goal of each Specific Aim were implicitly embodied by the variety of experiments in the original proposal. For example, for identifying descriptive semantic topics, we proposed fine-tuned LDA, mixture of LDA analyzers and multivariate information bottleneck. These experiments complement each other, thus they serve as alternative approaches for each other. To further strengthen our study, we also explicitly discuss the potential difficulties associated with each experiment and provided alternative approaches to address the problems.

6. The paragraph begin with: The training corpora described in section D.1.1 are potentially problematic …. Neither of the latter corpora contains this information.

We have re-stated the goals of using the large MEDLINE and full text journal article corpora and specified the hypotheses to be tested in the experiments utilizing these corpora, see Section D1.1 for detail. Briefly, the goals for utilizing large MEDLINE corpus are: (1) developing and evaluating probabilistic topic model capable of identifying semantic topics across various domains of biomedical knowledge; (2) identify more specific semantic topics as the basis for protein annotation; and (3) train and evaluate novel information retrieval algorithms. Development of these tools has broader impact beyond protein annotation, e.g., they can be applied in biomedical literature indexing with MeSH. The reviewer concerns regarding the difficulty of evaluating biological-relevance of topics and the lack of GO annotation are well warranted. However, these potential problems can be addressed by utilizing MeSH terms associated with every MEDLINE document. With recent increasing number of studies on the lexicons of GO system (McCray et al., 2002; Ogren et al., 2005) and mapping between GO and unified medical language system (UMLS) (Lomax and McCray, 2004), it is possible to map the topics (a distribution of lexicon) to GO terms either manually or automatically. As for training of correspondence LDA (Corr-LDA) and multivariate IB for automatic annotation, we will continue utilize the GOA corpus as proposed in the original proposal.

7. The mixture of LDAs idea in D.1.3 is not well developed. …The "partial manual evaluation of their quality" is not an adequate evaluation plan.

We have rewritten the subsection to make it clearer and explained the claim. The claim that mixture of LDA may outperform flat LDA is made based on a theoretic observation. The Dirichlet prior of a flat LDA model would assign a non-zero prior probability for every topic to a document. Thus, a flat LDA will entertain the possibility that all topics exist in a document and assign words to them, even to those unrelated to the main topics of a given document at all. This will lead to deteriorated performance when the number of topics used to model corpus becomes large. The mixture of LDA analyzers model alleviates this problem by grouping documents with similar topic contents into clusters and modeling documents within a cluster with relatively small number topics. Overall diversity of topics is captured by different LDA component in the mixture model.

8. Although the idea of changing the task so that texts are mapped to groups of GO terms (section D.1.4) is interesting in many ways, … e.g. the category associated with nucleolus has as its most frequent term the stem "ribosom", an entirely separate cell component!

We agree that it is highly desirable to annotate protein with GO terms as specific as possible as adopted by human curators, however such annotation is not readily achievable for a contemporary computational agent, due to both data sparseness and the lack of capability of full semantic and syntactic understanding of natural language by the agent (Hunter and Cohen, 2006). Thus, rather than giving up or attempting futilely using very few training cases to learn classifiers for specific annotations, it is sensible to reach a middle ground by relaxing the goal to annotating with the representative concepts and their corresponding GO terms. Hopefully, the concepts extracted by the proposed methods represent the recurring concepts of the studied corpora, and whose information contents are as distinct or specific as supported by data. Achieving such a goal would lay a basis for leaping towards the goal of more specific annotation by combining probabilistic topic model, natural language processing and information extraction.

The reviewer’s concerns regarding at what level and which GO term should be used to represent the concepts are the crux of the problem. To address the first question, we believe that the proposed Bayesian model selection and information bottleneck provide principled approaches for determining the level of the concepts based on the observed data and the information content of the concepts. As for the second question, we propose to apply manual and automatic approaches. The latter is to use GO categorizer (Joslyn et al., 2004). This is a principled approach for selecting a most representative GO term among a cluster of GO terms by studying their relationships on the GO graph and producing an ordered list of candidate GO terms.

As for the specific example mentioned by the reviewer, we believe that the term “ribosom” is the stemmed form of the term “ribosomal” rather than “ribosome,” which correctly fits the context of nucleolus, where ribosomal RNA is processed. This example further supports our notion that, without human-like understanding of free text, probabilistically detecting topical context is more robust than attempting deterministically to assign highly specific annotations based on few specific words.

9. The comparison described in section D.2.3 to "an algorithm similar to the best-performing... A fairer approach might look at multiple test corpora (e.g. using the GENIA corpus), or use the system in future competitions (such as future TREC Genomics tasks).

We agree with the reviewer regarding the difficulty and uncertainty in the value of the experiment, thus we have removed it from the proposal.

Critique 3:

We would like to thank the reviewer’s overall positive comments on the proposal. The reviewer’s specific concern regarding evaluation of the annotation systems is address at the beginning of this introduction.


A. SPECIFIC AIMS

The long term goal of this project is to develop methods to facilitate automatic protein annotation based on the biomedical literature. Understanding the function of proteins has been and remains to be a major task of biomedical research. The knowledge of protein is fundamental for understanding biologic systems, mechanism of disease and human health. A perpetual task of biomedical informatics is to acquire and represent the current and future knowledge resulting from biomedical research. The knowledge should be represented in the languages that are understandable to computational agents, so that it can be stored, retrieved, and used for reasoning and discovering new knowledge. Currently, all literature-based protein annotations are performed manually which, unfortunately, is extremely labor-intense and cannot keep up with the pace of the growth of information. Indeed, with the completion of genome sequences of several organisms, manual annotation of proteins has already become a rate-limiting step that hinders acquisition of critical function information of the large number of proteins from the exploding amount of biomedical literature. In this study, we propose to develop and apply novel approaches based on probabilistic semantic analysis and information theory to enhance current effort of automatic annotation.

Specific Aim 1: Identify/extract descriptive biological concepts from biomedical literatures. We will develop and extend algorithms based on advanced statistical semantic analysis and information theory to identify a set of descriptive biological concepts from biomedical literatures. To achieve this aim, we propose the following complementary approaches:

1. Identify descriptive and specific biological concepts by enhancing the latent Dirichlet allocation (LDA) model through fine-tuning the model parameters, incorporating more flexible parameter settings and training with large text corpora.

2. Develop a mixture of LDA analyzer to model the literatures from various biomedical domains.

3. Identify informative biological topics using information bottleneck (IB) approaches.

4. Associate the identified biological topics with the most representative Gene Ontology (GO) terms through manual and automatic annotation.