Polysemy in a Broad-Coverage Natural Language Processing System

WilliamDolan, LucyVanderwende, StephenRichardson

Microsoft Research

1.0 Introduction

MS-NLP is a broad-coverage natural language understanding system that has been under development in Microsoft Research since 1991. Perhaps the most notable characteristic of this effort has been its emphasis on arbitrarily broad coverage of natural language phenomena. The system’s goal is to produce a useful linguistic analysis of any piece of text passed to it, regardless of whether that text is formal business prose, casual email, or technical writing from an obscure scientific domain. This emphasis on handling any sort of input has had interesting implications for the design of morphological and syntactic processing. Equally interesting, though, are its implications for semantic processing. The issue of polysemy and the attendant practical task of word sense disambiguation (WSD) take on entirely new dimensions in the context of a system like this, where a word might have innumerable possible meanings. A starting assumption, for example, is that MS-NLP will routinely have to interpret words and technical word senses that are not described in standard reference dictionaries.

This chapter describes our approach to the processing of lexical semantics in MS-NLP (see Heidorn 1999 for a comprehensive description of the system). This approach centers on MindNet, an automatically-constructed resource that blurs the distinction between a computational lexicon and a highly-structured lexical example base. MindNet represents the convergence of two previously distinct strains of research: largely symbolic work on parsing machine-readable dictionaries (MRDs) to extract structured knowledge of lexical semantics, and largely statistical efforts aimed at discriminating word senses by identifying similar usages of lexical items in corpora. We argue in this chapter that MindNet’s unique structure offers solutions to many otherwise troubling problems in computational semantics, including the arbitrary nature of word sense divisions and the problems posed by unknown words and word senses.

Is Word Sense Disambiguation Feasible?

The idea that words in a sentence can be objectively labeled with a discrete sense is both intuitively obvious and demonstrably wrong. Humans turn out to be unreliable word sense taggers, frequently disagreeing with one another and even with themselves on different days. (Computationally-oriented work on the arbtirariness of dictionary sense assignments includes Kilgariff 1993 and Atkins 1987;1991). Faced with the set of choices in a desktop dictionary, where a highly-polysemous word like line can have scores of senses, intersubjective agreement on optimal sense assignments – even for skilled human taggers working on the closed corpus of the dictionary itself -- can be as low as 60% to 70%. Most worrisome is the fact that this sort of performance certainly cannot represent a lower bound on the difficulty of this task, since desktop dictionaries are hardly comprehensive in their list of word meanings. A truly broad-coverage lexicon would have to represent far more senses, and it is likely that a larger set of sense choices will lead to more disagreements among taggers.

The sense divisions in any lexicon are ultimately arbitrary, and fail to adequately describe actual lexical usage. Kilgariff (1993), surveying this issue, concludes that word sense distinctions will never succumb to a neat classification scheme that would allow straightforward assignments of lexicographic senses to corpus occurrences of words. Given the importance of automating WSD for various computational tasks like information retrieval (Voorhees, 1994) and machine translation, this is a troubling finding. If the nature of this task cannot even be adequately formulated, attempts to automate it are bound to fail.

Consider the pair of sentences I waxed the skis and I waxed the cars. The verb wax in each sentence can be readily disambiguated by MS-NLP on syntactic grounds alone. At the core of the system’s lexicon are the Longman Dictionary of Contemporary English (LDOCE) and the American Heritage 3rd Edition (AHD3) dictionaries, and though together the two dictionaries provide 21 distinct senses of this word, only two – one from each dictionary – are transitive verb senses:

LDOCE wax v, 1: to put wax on, esp. as a polish

AHD wax v, 1: to coat, treat, or polish with wax

Either or both of these senses could be assigned to wax in the sentences I waxed the skis/I waxed the cars, yet neither is quite right. The first suggests that the motivation for waxing skis might be to polish them. This is not exactly wrong, of course, but it fails to reflect the intuition that any polishing that occurs during the process of waxing skis is incidental to the primary functional goal. This is in sharp contrast to the primarily aesthetic goal of polishing associated with waxing cars. The AHD sense, meanwhile, is ambiguous: is the intent to coat, treat, or polish? Or is it some combination of these? (See Ide & Veronis 1993 for a discussion of problematic MRD-derived hypernymy chains.)

Does it matter whether a computational system can distinguish between such fine shadings of a word’s meaning? It has certainly been argued that for the practical tasks facing NLP, the fine-grained sense divisions provided by a dictionary are already too fine-grained. (Slator & Wilks 1987; Krovetz & Croft 1992; Dolan 1994), and much of the literature on WSD assumes very coarse-grained sense distinctions.

The suggestion that NLP systems do not need to make fine sense discriminations, however, seems more an artifact of the state of the art in the field than an inherent fact about the granularity of lexical knowledge required for useful applications. Performance on tasks like information retrieval and machine translation is currently poor enough that even accurate identification of homograph-level distinctions is useful. Distinguishing between musical and fish senses of bass, for instance, can mean the difference between a poor result and one that is at least useful. In this research milieu, making an effort to distinguish between waxing as coating or waxing as polishing may seem misguided.

In our view, though, collecting and exploiting extremely fine-grained detail about word meanings is crucial if broad-coverage NLP is ever to become practical reality. For instance, the distinction between waxing as coating with wax vs. polishing with wax has important implications for translation: languages like Greek and French lexically distinguish these two possibilities. French, in fact, distinguishes among at least four classes of objects that can be waxed:

skisfarter

carspasser lecire, passer le polish

furniture, floorscirer, encaustiquer

shoescirer

Merely identifying an instance of wax with one of LDOCE or AHD3’s dictionary sense is not useful in trying to translate this word. Such problems are rife in machine translation (see Ten Hacken, 1990 for other examples), and given enough language pairs, every sense in the English lexicon will prove problematic in the same way as wax. Furthermore, though machine translation is often cited as the extreme example of an application that might require extremely fine-grained sense assignments, it is not the only one. As information retrieval moves beyond the current model of returning a lump of possibly (but probably not) relevant documents, precision and recall gains will surely follow from improved NLP capabilities in making delicate judgements about lexical relationships in documents and queries.

Our conclusion is that a broad-coverage NLP system ultimately intended to support high quality applications simply cannot be built around the traditional view of WSD as involving the assignment of one or more discrete senses to each word in the input string. Like humans, machines cannot be expected to perform reliably on a task that is incorrectly formulated. The discrete word senses found in a dictionary are useful abstractions for lexicographers and readers alike, but they are fundamentally inadequate for our purposes.

In an effort to address some of these issues, we have settled on an approach that is very much consistent with the view of polysemy described in Cruse (1986). In Cruse’s model, related meanings of a word blend fluidly into one another, and different aspects of a word’s meaning may be emphasized or de-emphasized depending on the context in which it occurs. The next section describes MindNet, and shows how our processing of the discrete senses in MRDs yields a representation of lexical semantics with the continuous properties of Cruse’s model. In addition, we explore how this representation can be arbitrarily extended without human intervention – an important ability, since we cannot a priori predict or restrict the degree of polysemy that might need to be encoded for any individual word.

2.0 MindNet

MS-NLP encompasses a set of methodologies for storing, weighting, and navigating through linguistic representations produced during the analysis of a corpus. These methodologies, along with the database that they yield, are collectively referred to as MindNet. The first MindNet database was built in 1992 by GeorgeHeidorn. For full details and background on the creation and use of MindNet, readers are referred to Richardson et al. (1998), Richardson (1997), Vanderwende (1996), and Dolan et al. (1993).

Each version of the MindNet database is produced by a fully automatic process that exploits the same broad-coverage NL parser at the heart of the grammar checker incorporated into Microsoft Word 97®. For each sentence or fragment that it processes, this parser produces syntactic parse trees and deeper logical forms (LFs), each of which is stored in its entirety in the database. These LFs are directed, labeled graphs that abstract away from surface word order and hierarchical syntactic structure to describe semantic dependencies among content words. LFs capture long-distance dependencies, resolve intrasentential anaphora, and normalize many syntactic and morphological alternations.

About 25 semantic relation types are currently identified during parsing and LF construction, including Hypernym, Logical_Subject, Logical_Object, Synonym, Goal, Source, Attribute, Part, Subclass and Purpose. This rich (and slowly expanding) set of relation types may be contrasted with simple co-occurrence statistics used to create network structures from dictionaries by researchers including Veronis and Ide (1990), Kozima and Furugori (1993), and Wilks et al. (1996). Labeled relations, while more difficult to obtain, provide crucially rich input to the similarity function that is used extensively in our work.

After LFs are created, they are fully inverted and propagated throughout the entire MindNet database, being linked to every word that they contain. Because whole LF structures are inverted, rather than just relational triples, MindNet stores a rich linguistic context for each instance of every content word in a corpus. This representation simultaneously encodes paradigmatic relations (e.g. Hypernym, Synonym) as well as syntagmatic relations (e.g., Location, Goal, Logical_Object).

Researchers who produced spreading activation networks from MRDs, including Veronis & Ide (1990) and Kozima and Furugori (1993), typically only implemented forward links (from headwords to their definition words) in those networks. Words were not related backward to any of the headwords whose definitions mentioned them, and words co-occurring in the same definition were not related directly.

There have been many other attempts to process dictionary definitions using heuristic pattern matching (e.g. Chodorow et al. 1985), specially constructed definition parsers (e.g., Wilks et al. 1996, Vossen 1995) and even general coverage syntactic parsers (e.g. Briscoe and Carroll 1993). However, none of these has succeeded in producing the breadth of semantic relations across entire dictionaries exhibited by MindNet. Most of this earlier work, in fact, focused exclusively on the extraction of paradigmatic relations, in particular Hypernym relations (e.g., car-Hypernym->vehicle). These relations, as well as any syntagmatic ones that might be identified, have generally taken the form of relational triples, with the larger context from which they were extracted being discarded (see Wilks et al. 1996). For labeled relations, only a few researchers (recently, Barrière and Popowich 1996), have appeared to be interested in entire semantic structures extracted from dictionary definitions, though they have not reported extracting a significant number of them.

As noted above, the core of MindNet has been extracted from two MRDs, LDOCE and AHD3. (This MRD-derived MindNet serves as the source of all the examples in the remainder of this chapter.) Despite our initial focus on MRDs, however, MS-NLP’s parser has not been specifically tuned to process dictionary definitions. Instead, all enhancements to the parser are geared to handle the immense variety of general text, regardless of domain or style. Fresh versions of MindNet are built regularly as part of a normal regression process. Problems introduced by daily changes to the underlying system or parsing grammar are quickly identified and fixed. Recently, MindNet was augmented by processing the full text of Microsoft Encarta®. The Encarta version of MindNet encompasses more than 5 million inverted LF structures produced from 497,000 sentences; building this MindNet took 34 hours on a P2/266 (See Richardson et al. 1998 for details.)

Weighted Paths

Inverted LF structures facilitate the access to direct and indirect relationships between the root word of each structure, which for dictionary entries is the headword, and every other word contained in the structures. These relationships, consisting of one or more semantic relations connected together, constitute paths between two words. For instance, one path linking car and person is:

car <-Logical_Object- drive- Logical_subject -> motorist -Hypernym-> person

An extended path is a path created from subpaths in two different inverted LF structures. For example, car and truck are not related directly by a semantic relation or by a LF path from any single LF structure. However, if the two paths car-Hypernym->vehicleandvehicle<-Hypernym-truck, each from a different LF structure, are joined on shared word vehicle, the resulting path is car-Hypernym->vehicle<Hypernym-truck. Adequately constrained, extended paths have proven invaluable in determining the relationship between words in MindNet that would not otherwise be connected.

Paths are automatically assigned weights which reflect their salience. The weights in MindNet are based on the computation of averaged vertex probability, which gives preference to semantic relations occurring with middle frequency: a path like

ride –Location-> car will thus be favored over a low-frequency path like

equip-Logical_Object-> low_rider or a high-frequency one like

person-Logical_Subject->go. This weighting scheme is described in detail in Richardson (1997).

MindNet’s Coverage

A frequent criticism of efforts aimed at constructing lexical knowledge bases from MRDs is that while dictionaries contain detailed information about the meanings of individual words, their coverage is spotty, and in particular, they contain little pragmatic information (Yarowsky 1992; Ide & Veronis (1993, 1998), Barriere & Popowich (1996)):

For example, the link between ash and tobacco, cigarette or tray in a network like Quillian’s is very indirect, whereas in the Brown corpus, the word ash co-occurs frequently with one of these words. (Veronis & Ide 1998)

Since pragmatic information is often a valuable cue for WSD, this is a serious concern. Yet the idea that dictionaries somehow isolate lexical from pragmatic knowledge, failing utterly to represent world knowledge, is incorrect. Standard desktop dictionaries contain voluminous amounts of “pragmatic” knowledge (see also Hobbs 1987 and Guthrie et al 1996) – it is impossible, in fact, to separate this in a principled way from purely “lexical” knowledge – but much of this information only becomes accessible when the dictionary has been fully processed and inverted..) The combined LDOCE/AHD MindNet, for instance, reveals tight connections between ash and the other words cited by Ide and Veronis:

ash <-Part – cigarette <-Part – tobacco

ashtray: a small dish for the ashes of cigarettes

cigarette: a small roll of finely cut tobacco for smoking, enclosed in a wrapper of thin paper

ash <-Purpose- ashtray -Hypernym-> receptacle <-Hypernym- tray

ashtray: a receptacle for tobacco ashes and cigarette butts

tray: a shallow, flat receptacle with its contents

Note, however, that these connections do not come directly from definitions of tobacco, cigarette, or ash butrather from joining information from the definitions of words like ashtray. Network searches that rely solely on forward-chaining methods for identifying links (e.g. Veronis & Ide, 1990) are unable to discover many of the interesting links among words.

The availability of these links surrounding ash in MindNet could be explained away as serendipitous. Our experience with MRDs, though, suggests that such serendipity is the norm rather than the exception: it is in general a poor idea to bet against lexicographers by asserting that some common-sense fact or other could not possibly be found in dictionaries. Often the facts are indeed there, waiting to be teased out by a sufficiently powerful discovery process.

That said, MRDs are finite resources written with specific goals, and it was never imagined that they would prove sufficiently broad in coverage for a system like MS-NLP. It is not difficult, in examining the LDOCE/AHD MindNet, to find significant gaps in coverage, or cases of paths that are much longer and lower-weighted than one would like for a particular connection. If our original goal had been to produce a directed, labeled graph from one or two dictionaries, the simplest strategies might have involved automated string-matching techniques (tuned to the sublanguage encountered in dictionaries), manual work, or some combination of these. Parsing dictionary text is arguably unnecessary or even undesirable for this task (Ahlswede & Evens 1988; cf. Montemagni and Vanderwende 1994).

From our standpoint, though, such criticisms reflect an undesirable focus on MRDs to the exclusion of other types of corpora. Dictionaries are a peculiar sort of corpus that are an especially interesting starting point for automatically building a database of information about word meanings, but they are just that – a starting point. String-matching or dictionary-specific parsing strategies may not even scale to another dictionary (much less to other text genres that MS-NLP will be required to mine for semantic information. Because of our emphasis on acquiring data from text sources beyond dictionaries, we rely on an industrial-strength parser – one that has been designed to cope with arbitrarily-long sentences, ill-formed inputs, and rare syntactic constructions.