Medical Word Net and Medical Fact Net: New Information Resources for Consumer Health

1.10.2019

Abstract
If a medical information system is to mediate between experts and non-experts, then it must be able to comprehend both expert and non-expert medical vocabulary and to map between the two. Much effort has been devoted to the study of expert medical vocabulary. As computers become increasingly important to the delivery of medical information,however, then itbecomes more urgent to understandalso the language used by non-experts.

The English-language lexical databaseWordNetplays an important role in current natural language processing (NLP) applications and research, and it has inspired counterparts in some 40 languages.WordNethas wide coverage of medical terms, but its treatment of these terms is in many ways inadequate. The primary goal of this R21 proposal is to create Medical Word Net (MWN), a systematic revision and validation of WordNet’s coverage in the medical domain.

We shall draw both on our own work in medical ontology and in the construction of WordNet, and also on recent developments in information retrieval technology directed towards the construction of what are called ‘proposition banks’ or ‘fact databases.’This means thatwe shall focus not just on single words– but also on the sentences in which such words occur. We will assemble a large corpus of natural-language sentences providing medically meaningful contexts for MWN terms.This will enable us to isolate errors in WordNet’s existing medical coverage and to uncover new medically relevantterms and senses in a systematic way. It will also add to the power of MWN for NLP applications.

This corpus will derive primarily from online health information sources targeted to consumers. By using expert and non-expert human validators we will construct two sub-corpora, called Medical FactNet (MFN)and Medical Belief Net (MBN).MFN will consist of statements accredited as trueon the basis of a rigorous process of validation by medical experts, MBN of statements which non-experts believe to be true.

We shall test the methodology by building an initialcorpus of some 40,000 sentencesand evaluating its benefits for information retrieval. If this methodology is successful, then it can be scaled to much larger corpora, embracing terms and sentences in other languages andin principle also terms and sentences used by medical experts. MFN and MBN will alsosupport new types of research on consumer health from the perspectives of both psychology and linguistics, for example in exploring individual and group divergences in medical knowledge and vocabulary and in understanding non-expert medical reasoning and decision-making.

Draft Consortium Agreement

The contractual agreement is between SUNY Buffalo and PrincetonUniversity as sub-contractor.Barry Smith will direct the work in Buffalo, Christiane Fellbaum in Princeton.

Smith’s primary focus is the use of formal tools in the construction, integration and alignment of ontologies and terminologies in the domain of biomedical research. He has been involved in human subjects experiments designed to establish non-expert ontologies in the domain of spatial knowledge.

Fellbaum is one of the two Principal Investigators of the lexical database WordNet and is a member of the Cognitive Science Laboratory, where WordNet was created and is maintained. Dr. Fellbaum also directs a project "Collocations in the German Language of the 20th Century" at the Berlin-Brandenburg Academy of Sciences, which uses computational methods to studyword-sequences.

Smith and Fellbaum have been involved for some time in research at the interface between WordNet and formal ontologies.

Non-expert subjects will be recruited from the population of undergraduate students in non-medical disciplines in Princeton, using the standard methods for payment and recruitment employed by the Department of Psychology.

Expert subjects will be recruited from the population of students registered in the Primary Care Externship Program ofthe School of Medicine and Biomedical Sciences of the University at Buffalo.

The division of labor and responsibility reflects the expertise in Buffalo in the field of medical ontology and terminology and the long experience in Princetonin the construction of digital lexical databases.

The responsibilities will be divided as follows:

Shared between the two sites:

Compilation of MBN / MFN databases

Buffalo:

Expert validation of MFN database

Computer processing of MBN / MFN / MWN

Testing in query-answering systems and biomedicalinformation retrieval

Mappings to standard terminology systems

Princeton:

Compilation of MWN (lexical database for the medical/consumer health domain)

Non-expert validation of MBN / MFN databases

A. SPECIFIC AIMS

A1.WordNet (Miller, 1995, Fellbaum, 1998) is the principal lexical database used inNLP research and applications. While WordNet’s current version (2.0) has very broad medical coverage, it manifests a number of defects, which reflect the lack of domain expertise on the part of the lexicographers. The present proposal responds to calls from the research community to rectify these defects.(Magnini and Strapparava 2001, Bodenreider and Burgun, 2002, Burgun and Bodenreider, 2002)

We will create Medical Word Net (MWN), an open source lexical database which will revise and extend WordNet’s existing medical coverage in light of recent advances in medical terminology research. We will focus initially on the English-language single word expressions used and understood by non-experts, systematically validating WordNet’s medical coverage in a way which will lead to four types of modifications:

  1. the resources of a systematically assembled and validated large corpus of sentential contexts for word usage will extend existing glosses;
  2. the definitions of existing terms and the relations linking such terms (such as is_a and part_of) will be validated for correctness by medical experts;
  3. technical terms not used by non-expert speakers of English, and obsolete terms (including some derived from medieval medicine), will be eliminated;
  4. we will map the results to existing medical terminology resources such as MeSH.

We anticipate that MWN will stabilize in a lexicon of the order of 1,500 single word expressions with some 4,000 distinguishedword senses. Nearly all of the relevantword forms are present already in WordNet, but their specifically medical senses are often either unrecorded or are treated inadequately. MWN will be constructed on the basis of a scientific methodology designed (1) to document natural language sentential contexts for each relevant word sense in such a way that the expressed information can be (2) validated by medical experts and (3) accessed automatically by NLP applications used for purposes such as information retrieval, machine translation, question-answer systems, text summarization, and language generation.

A major stumbling block for existing NLP applications is that of automatic sense disambiguation. A machine can detect automatically and with high reliability that a given occurrence of the word feel is a verb. But it cannot determine which of a variety of alternative meanings it might have. WordNet 2.0 distinguishes in all 13 such meanings, of which at least three (marked ) have an obvious medical significance:

1. experience –(undergo an emotional sensation:She felt resentful;He felt regret)

2. find –(come to believe on the basis of emotion, intuitions, or indefinite grounds:I feel that he doesn't like me;I find him to be obnoxious;I found the movie rather entertaining)

3. sense –(perceive by a physical sensation, e.g., coming from the skin or muscles:He felt the wind;She felt an object brushing her arm;He felt his flesh crawl;She felt the heat when she got out of the car; He feels pain when he puts pressure on his knee.)

4. feel –(seem with respect to a given sensation given:My cold is gone – I feel fine today;She felt tired after the long hike)

5. feel –(have a feeling or perception about oneself in reaction to someone’s behavior or attitude:She felt small and insignificant;You make me feel naked;I made the students feel different about themselves)

6. feel –(undergo passive experience of: We felt the effects of inflation;her fingers felt their way through the string quartet;she felt his contempt of her)

7. feel –(be felt or perceived in a certain way: The ground feels shaky;The sheets feel soft)

8. feel –(grope or feel in search of something:He felt for his wallet)

9. feel, finger –(examine by touch:Feel this soft cloth!;The customer fingered the sweater)

10. palpate, feel –(examine (a body part) by palpation: The nurse palpated the patient's stomach;The runner felt her pulse)

11. feel –(find by testing or cautious exploration: He felt his way around the dark room)

12. feel –(produce a certain impression: It feels nice to be home again)

13. feel –(pass one's hands over the sexual organs of:He felt the girl in the movie theater)

WordNet’s architecture for distinguishing word senses has made an important contribution to advancing solutions to the problem of automatic sense disambiguation, and we will build upon this contribution here. To this end we shall assemble apilot version of a large, heterogeneous, open-source corpus of sentences about medical phenomena in the English language. The corpus will be restricted to natural language grammatically complete, generic, syntactically simplesentences which have been rated as understandable by non-expert human subjects in controlled questionnaire-based experiments.The assembly of this pilot corpus, which we estimate will contain some 40,000 validated sentences,will also be used for purposes of validation of MWN. It will itself support the creation ofMWN by yielding new families of words and word senses for inclusion.

Our use of human validationsmeans that we can further extend the usefulness of the corpus for purposes of testing new applications for consumer health information retrieval and also allowing new sorts of research in linguistics and psychology. To this end we shall exploit our validation data to create two sentence subcorpora, called Medical Fact Net (MFN) and Medical Belief Net (MBN).

MFN will consist of those sentences in the pilot corpus which receive high marks for correctness on being assessed by medical experts. MFN is thus designed to constitute a representative fraction of thetrue beliefs about medical phenomena which are intelligible to non-expert English-speakers.

MBN will consist of those sentences in the pilot corpus which receive high marks for understandability. MBN is thus designed to constitute a representative fraction of the (true and false) beliefs about medical phenomena distributed through the population of English speakers.

Both MFN and MBN will inherit from MWN the (WordNet-based) formal architecture. (Fellbaum 1998) However, we will enhance this architecture to maximize its usefulness in information retrieval and other applications. The validation process that is involved in the construction of MFN will be used to detect errors in the existing WordNet, and also to ensure that the coverage of the natural language medical lexicon in MWN is of a scientific level sufficient to allow MWN technology to work with terminology and ontology systems designed for use by experts.

Compiling MFN and MBN in tandem will allow systematic assessment of the disparity between lay beliefs and vocabulary as concerns medical phenomena and the corresponding expert medical knowledge.The ultimate goal of our work on MFN is to document the entirety of the medical knowledge that is capable of being understood by average adult consumers of healthcare services in the United States today who have no special knowledge of medical phenomena. If the methodology for the creation and validation of the pilot corpus here described proves successful, then we believe that the preconditions for the realization of thismuch larger goal will have been established. Already the response from NLP researchers and from online information providers to our initial work on MFN/MBNconvinces us that this realization would have considerable significance for the management of consumer health information in the future.

A2.Evaluation

The creation of MWN, MBN and MFN involves the use of well-tested methods for the creation of lexical databases and sentence corpora and for the handling of questionnaires in the validation of sentences for understandability and correctness. We will demonstrate the scientific contributions involved in the creation of this set of resources by showing that these resources themselves bring tangible benefits to the task of consumer health information retrieval.

We will evaluate MWN and MFN by measuring the benefits they bring when incorporated into an existing on-line consumer health portal based on term-search technology. (We already have expressions of interest in this regard.)

We will also test whether exploiting the resources of MWN and MFN can lead to improved results in retrieval ofexpert information by assessing the added value generated by their use for purposes of document retrieval as measured against standard benchmarks.

The community of research groups using WordNet for applications is already very large and the results of our work will be made freely available at every stage to other researchers, who will be invited to propose further strategies for evaluation.

B. Background and Significance

B1. The Importance of Studying Non-Expert Medical Language

This proposal draws on studies of computer-based tools for consumer health information retrieval. (Slaughter 2002, Smith et al., 2002, Tse, 2002, Tse and Soergel 2003, Tse 2003, Zeng et al., in press) Such studies point to a mismatch between existing tools and the non-expert language used by most consumers – the language used not only by patients but also by family members, advisors, administrators, lawyers and so forth, and to some degree also by nurses and physicians.

Where the usage of medical terms by professionals is at least in principle subject to control by standardization efforts, the highly contextually dependent usage of medical terms on the part of lay personsis much more difficult to capture in applications. The state-of-the-art of how to use a lay term is a matter of convention established more or less ephemerally by everyday talk, not only between experts and non-experts but also among the non-experts themselves.

The taxonomies reflecting popular lexicalizations in all domains are much less elaborate at both the upper and lower levels than in the corresponding technical lexicons.There are no popular terms linking infectious disease and mumps, so that in the popular medical taxonomy of diseases the former immediately subsumes the latter. The popular medical vocabulary naturally covers only a small segment of the encyclopedic vocabulary of medical professionals. It lexicalizes mainly at the level of taxonomic orders. Popular medical terms (flu) are often fuzzier than technical medical terms. Many popular terms also cover a larger range of referent types than do technical terms, othersmay cover only part of the extension of their technical counterparts. We hypothesize, however, that with few exceptions the focal meanings(Berlin & Kay 1969)will be identical. (Constructing MFN and MBN will allow us to test this and related hypotheses in a systematic way.)

The lower degree of differentiation in popular language leads to intersections with families of technical terms such that the popular terms fall short of exact coverage. Many single terms used by both experts and non-experts – for example bacteria,colon, cyst, dermatitis, embryo, glucose, hepatitis, melanoma, septic, spasm– belong to much larger families of cognate terms whose remaining members (for example acystia, baeocystin, blastocyst, cysteamine, cysteine, polycystic) are used only by experts.

B2. Mismatches in Doctor-Patient Communication

The skills of a physician in general practice comprise the ability to acquire relevant and reliable information through communication with patients, and then it is non-expert language that serves as the medium for knowledge exchange across the linguistic divide. The physician must also have the practical knowledge which enables him to convey diagnostic and therapeutic information in ways tailored to the individual patient.

Since the physician, too, is a member of the wider community of non-experts and continues to use the non-expert language for everyday purposes, one might assume that there are no difficulties in principle keeping him from being able to formulate medical knowledge in a vocabulary that the patient can understand. As (Slaughter 2002, Smith et al., 2002)have shown, however, there are limits to this competence. (Slaughter 2002) examines dialogue between physicians and patients in the form of question-answer pairs, focusingespecially on the relations documented in the UMLS Semantic Network. Only some 30% of the relations used by professionals in their answers directly match the relations consumers used in their questions. An example of one such question-answer pair is taken from (Slaughter, p. 224):

Question Text: My seven-year-old son developed a rash today that I believe to be chickenpox. My concern is that a friend of mine had her 10-day-old baby at my home last evening before we were aware of the illness. My son had no contact with the infant, as he was in bed during the visit, but I have read that chickenpox is contagious up to two days prior to the actual rash. Is there cause for concern at this point? [...]
Answer Text: (a) Chickenpox is the common name for varicella infection. [...] (b) You are correct in that a person with chickenpox can be contagious for 48 hours before the first vesicle is seen. [...] (c)The fact that your son did not come in close contact with the infant means he most likely did not transmit the virus. (d) Of concern, though, is the fact that newborns are at higher risk of complications of varicella, including pneumonia. [...] (e) There is a very effective means to prevent infection after exposure. A form of antibody to varicella called varicella-zoster immune globulin (VZIG) can be given up to 48 hours after exposure and still prevent disease. [...]

Such examples illustrate also that there are lexically rooted mismatchesin communication (which may in part reflect legal and ethical considerations) between experts and non-experts. Professionals often do not re-use the concepts and relations made explicit in the questions put to them by consumers. In our example, the questioner requests a yes/no-judgment on the possibility of contagion in a 10-day-old baby. In fact, however, only section (c) of the answer responds to this question, and this in a way which involves multiple departures from the type of non-expert language which the questioner can be presumed to understand. Rather, physicians expand the range of concepts and relations addressed (for example through discussion of issues of prevention, etc.).