Extraction Pattern Discovery through Corpus Analysis
Roman Yangarber and Ralph Grishman
Computer Science Department
New York University
New York, NY 10003
{yangarber, grishman}@cs.nyu.edu
Abstract
If we examine texts concerning different topics or subject domains, one way in which they can be characterized is by the classes of entities, and the relations between these entities, which are expressed in the texts. Such characterizations provide a valuable complement to the syntactic and lexical analyses which are primarily used in corpus linguistics. These characterizations are also crucial to the development of information extraction systems for new topics and domains. To develop a new extraction system, we must determine what relations need to be captured, and — as comprehensively as possible — how these relations are expressed. We describe in this paper a method for acquiring this information automatically from unannotated text, through a bootstrapping procedure which identifies both a set of texts about a topic and a set of relations which occur primarily in these texts. We present several examples of topics and retrieved relations, as well as an evaluation in terms of text filtering.
The call for participation for this workshop, “Information Extraction meets Corpus Linguistics,” notes that “The goals of information extraction and corpus linguistics have thus far had little in common.” It goes on to observe that corpus linguistics aims to understand the differences among texts in heterogeneous collections, largely through statistical analysis of lexical,syntactic, and discourse structures (Biber et al., 1998). Information extraction, on the other hand, has generally been applied to more homogenous collections, and focused on the “engineering” problems of developing finite-state patterns to extract the desired information.
There has been, however, a thread of research in computational linguistics in which these two areas have been closely interwoven — a thread that goes back to the 1950's, almost the beginnings of computational linguistics. In this paper, we shall briefly look at this research thread, and see how it connects to the new interest in machine learning for information extraction. In particular, we shall describe a linguistic pattern discovery procedure which can serve a role both in information extraction and in corpus linguistics.
Sublanguages and Relations
Information extraction covers a range of tasks, including name identification and classification, entity tracking, and the capture of relations and events. In this paper we shall focus primarily on the extraction of relations and events.
Corpus linguistics aims to identify differences in language characteristics between texts — differences which can be attributed to differences in authorship, differences in purpose, and differences in the domain of discourse. When differences in domain or purpose lead to clear cut differences in language, the different forms of language are sometimes referred to as sublanguages (Grishman & Kittredge, 1986). We speak, for example, of the sublanguage of weather reports, the sublanguage of a class of medical reports, or a science sublanguage. One characteristic of such domain-based sublanguages is that they talk about different, sublanguage-specific sets of concepts, relations, and events. Weather reports will discuss temperature and precipitation; medical reports will discuss symptoms, diagnoses, and treatments.
Zellig Harris (1968) provided a formal description of such sublanguages: they are characterized by a set of sublanguage word classes, and constraints on the way in which these classes can be combined in sentences to express relations and events. Thus, for example, medical reports may have a class of symptom nouns, Nsymptom, and a class of disease nouns, Ndisease, which can be combined in the form Nsymptom indicates Ndisease, but not Ndisease indicates Nsymptom. By cataloging the relations of a sublanguage, Harris suggested, it was possible to create a formal structure for the information content of the texts.
Information extraction can be seen as the task of pulling a single, externally specified relationship from some texts. In a more expansive view, however, we can envision analyzing a sublanguage and then extracting the principal relationships expressed in this sublanguage. This may be a small set for weather reports, but potentially a larger and quite complex set for a science sublanguage. Harris (1958) indicated how such extraction procedures, if they could be implemented, could have a major effect on access to the scientific literature. The information processing and language processing technology of the time was not ready for such applications, but the ideas advanced then are still appropriate today.
Starting with these initial proposals, efforts were made to apply these techniques to several types of text, both scientific articles and later medical reports (Sager et al., 1987). This was initially done by hand, but later automated in a procedure called “information formatting.”
This approach raised the issue of how the relations can be identified in a systematic way from a sample of text in the corpus. Again, the work was initially done by hand, identifying word classes and then aligning different sentence structures involving the same classes of words. Some of these steps were then automated, using the limited sublanguage corpora then available. Distributional analysis, applied to parsed text, was used to identify the major sublanguage word classes (Hirschman et al., 1975). Once the word classes had been identified, a statistical analysis of clause and noun phrase patterns was used to identify the dominant sublanguage relationships (Grishman et al., 1986).
If, following Harris, we define a sublanguage as a set of texts obeying certain constraints on co-occurrence — only certain classes of words appear in particular sentential relations — then we face some circularity in the discovery procedure: we need to identify the texts to find the relations, and need to identify the relations to find the texts. In a later section, we consider how a discovery procedure can operate by discovering texts and relations in parallel.
Extraction and Pattern Discovery
Over the past decade there has been a rapid growth in interest in information extraction. This has been prompted in part by opportunity — the explosive growth in on-line text data available through the web. It has also been driven by the recognition that simple, robust language analyzers — including, in particular, finite state transducers— can be quite effective for extraction purposes. Extraction system development has therefore become, in large part, a task of developing the (sometimes large) set of linguistic patterns required to recognize the events of interest. Limitations of extraction performance and concerns about the cost of porting systems to new domains have focused attention on the problem of pattern discovery.
The challenge here is to find all the possible linguistic expressions of information relevant to a particular specified relation or event. For a given event type, we can quickly think of the most common forms, but finding the rarer forms — the long "tail" in the distribution of expressions — is much harder, and is a primary source of the limitations in performance of extraction systems.
The “traditional” approach to IE system development has involved a manual analysis of a modest-size corpus, looking for relevant expressions and reducing them to linguistic patterns. The difficulty of developing large pattern sets by hand has led, over the past few years, to the development of supervised learning methods(Califf, 1998; Sasaki, 1999; Soderland, 1999). These methods rely on corpora which have been annotated, by hand, with the information to be extracted. Using such a corpus, a system can develop a pattern corresponding to each annotated example, and can then attempt generalizations, combining individual (specific) patterns into more general ones. While annotation of individual examples (marking the information to be extracted) may require less skill than linguistic analysis, there remains the burden of annotating hundreds or even thousands of documents in order to collect a range of the rarer linguistic expressions.
One alternative to such extensive annotation is the use of statistical methods to identify combinations of words or word classes which are likely to be relevant to the task. This can be seen as an analog of the sublanguage studies described earlier, although focussed more narrowly on a specific extraction task. One such experiment was conducted by Riloff (1996) using a corpus of (translated) foreign broadcast news reports. The goal was to extract information about terrorism, and each article was classified (by hand) to indicate whether or not it was relevant to the extraction task. Each article was decomposed into a set of patterns (such as subject-verb pairs, verb-object pairs, etc.), and then the patterns were ranked based on their frequency in the relevant articles, and their relative frequency in the relevant articles when compared to their frequency in the entire corpus. By selecting high-ranking patterns, it was possible to create a pattern set which performed almost as well at extracting selected information from the texts as a manually-constructed pattern set.
Bootstrapping
The approach just described still requires manual relevance judgements for a substantial corpus (for the terrorist corpus, Riloff used 1500 texts, of which about 50% were relevant). While manual relevance judgements are less labor-intensive than full event annotation, they are still a significant burden when dealing with a large collection.
We have recently conducted a series of experiments which acquire such relations without any pre-annotation of the corpus, and thus can take advantage of much larger collections ... tens or hundreds of thousands of documents. This approach requires only a small set of seed patterns to start, and builds the set of patterns and the corpus of relevant articles in tandem. Our method is based on the observation that the patterns of interest tend to co-occur at the document level: a document containing one pattern about a topic is more likely to contain other relevant patterns than a document taken at random from the collection. This is the analog at the level of patterns of techniques of word co-occurrence analysis, which have been used for many years to build word clusters and thesauri for information retrieval.
To begin our discovery procedure, we select a small set of “seed” patterns typical of the class of events of interest. For example, to extract information about the hiring and firing of personnel (the MUC-6 “management succession” task), one of the seed patterns was “company appointed person as president”. The system labels as relevant all the documents containing one of these patterns. Then, for each clause pattern appearing in the relevant documents, it determines its frequency in the relevant documents and in the rest of the collection, and ranks the patterns based both on the frequency in the relevant documents and on the ratio of frequencies. The system selects the highest ranking pattern, adds it to the set of patterns, and repeats the process (retrieving a new set of relevant documents). In the remainder of this section we describe the steps of this process, and an evaluation for one extraction task; further details are available in (Yangarber et al., 2000).
Method
We begin by preprocessing the corpus. The text is first processed by a ‘named entity’ tagger which identifies names of people, organizations, and locations. Then a parser is used to extract all the clauses from each document;[1] for each clause we build a tuple consisting of the head of the subject, the verb, the head of the object, locative and temporal modifiers, and certain other verb arguments. Different clause structures, including passives and relative clauses, are normalized to produce tuples with a uniform structure. These tuples will be the basis for our pattern extraction process.[2]
Next, we create a small set of ‘seed’ patterns relevant to the extraction task. These patterns will be used to initiate the discovery process.
We retrieve all documents from the corpus containing at least one seed pattern. These form our initial set of relevant documents R. Each tuple which appears in some document in R is then ranked using the following metric:
where H(p) is the set of documents containing tuple p. (This is similar to the metric used by Riloff (1996).) Very frequent tuples, occurring in over 10% of the documents, are rejected as uninformative, as are rare patterns which appear only once among the relevant documents. The top-ranked tuple based on this metric is added to the set of patterns and the process is then repeated.
The discovery procedure takes into account our confidence about the relevance of discovered patterns and retrieved documents. Each pattern is assigned a ‘precision’ measure between 0 and 1, and each document is assigned a relevancescore between 0 and 1. Seed patterns have a precision of 1, and documents matched by these seed patterns have a relevance of 1. For discovered patterns, the precision is the average relevance measure of the documents which contain that pattern. For documents which are added to the set of relevant documents in subsequent iterations, the relevance score is determined by the precision of the patterns which appear in that document. The details are presented in (Yangarber et al., 2000).
The pattern selection process is somewhat more complicated than just described. In selecting patterns, we only consider for candidacy generalized patterns, in which one or more of the slots are filled with ‘wild cards.’ In computing the score of the generalized pattern, we do not take into consideration all possible values of the wild card. We instead constrain the wild card to those values which themselves in turn produce patterns with high scores. The effect of this process is evident in some of the discovered patterns listed below, where some of the slot positions are filled with sets of values.
This process is a very selective form of distributional analysis (where we only use task-relevant patterns as context), and the sets of values obtained are natural candidates for domain-specific noun and verb classes, akin to those obtained in (Riloff & Jones, 1999). We do not presently make use of these classes in subsequent iterations of the discovery procedure, but we intend to do so in the near future.
Evaluation
We have tried this method initially to acquire patterns for the “management succession” task, which was used for MUC-6 (1995). We began with a seed containing thepatterns
Subject / Verb / Direct ObjectC-company / C-appoint / C-person
C-person / C-resign
where C-company and C-person are the classes containing the named entities for companies and people. C-appoint is the set of verbs {appoint, elect, promote, name}; C-resign is the set {resign, depart, quit, step-down}. We used a collection of 6000 news articles from the Wall Street Journal, including those from the MUC-6 training corpus.
The discovered patterns cannot be used directly for information extraction; we need to add information about which predicate to associate with each pattern (that appoint involves starting a job, while remove involves leaving a job), and which position in the pattern to associate with each predicate argument.However, it is possible to evaluate the discovered patterns by themselves, automatically, by looking at their effectiveness at text filtering — at differentiating relevant from irrelevant documents.
To do so, we used a test corpus consisting of 100 MUC-6 formal training documents (which were included in the development corpus), plus another 150 documents picked at random from the main corpus and judged for relevance by hand. (Note that these relevance judgements were only used for evaluation, and not in the discovery procedure.)
The seed patterns by themselves retrieved 184 documents. Measured against the test corpus, they produced an initial document recall of 0.11 and a precision of 0.93. After 80 iterations of the discovery procedure, the set of patterns yielded a non-zero relevance for 982 documents. On the test corpus, it produced a text filtering recall of 0.78 with a precision of 0.80.[3] This is a high level of retrieval performance, considering that only automatic methods were used to expand the initial pattern set.
It is also interesting to examine the patterns that were discovered. NYU produced an extraction system for MUC-6, developing the patterns by hand, and has since continued to extend its pattern set through the review of additional documents, so one might expect the pattern set to be relatively complete. Nonetheless, the discovery procedure yielded a number of patterns which were not present in the hand-coded system, including[4]
person came [to company] [as officer]
person rejoined [company] [as officer]
person returned [as officer]
company brought person [in] [as officer]
person tapped person [to be officer]
This is evidence of the difficulty of preparing a pattern set with good coverage by hand, and of the value of this discovery procedure for information extraction.
Clusters of Relations
As we noted above, most extraction tasks define in advance a particular relation to be extracted. However, if we are given the goal of extracting all the information ‘about’ a given class of events, we may not be sure in advance of all the relations involved. The bootstrapping approach we have just described allows us to automatically build up a set of interconnected relations — the set of relations that co-occur in the articles about a given topic.
We have used the procedure just described to explore various news topics. In each case we created a small seed, and then collected the patterns discovered by our procedure.
The first topic was ‘natural disasters’. We defined a set of disasters,
disaster = {earthquake tornado flood
hurricane landslide snowstorm
avalanche}
and some seed patterns involving these disasters:
disaster caused ...
disaster {hit | destroyed | ravaged | damaged}
{street | bridge | house | home | - }
The set of patterns discovered[5] included:
3. quake measured (number)
4. quake was felt
5. earthquake shook area
7. earthquake was centered
9. quake struck
11. quake was considered
12. quake occurred
14. quake knocked [out] power
22. storm knocked [out] power
23. damage was reported
28. aftershock injured –
29. aftershock killed people
30. the quake registered
38. it caused damage