A Context-Based Approach Towards
Content Processing of Electronic Documents

Karin Haenelt

Fraunhofer Gesellschaft e.V. – FhG
Dolivostraße 15
D 64293 Darmstadt, Germany

Abstract

This paper introduces a text-theoretically founded view on content processing of electronic documents. A central aspect is the representation of the contextual embedding of texts. It provides a basis for modelling mechanisms of the dynamic development of information and access perspectives during the process of information communication and for the management of vage and incomplete information. The paper firstly indroduces a basic concept of text production and understanding (section 2). On this basis it develpos a text model with a four-layered text representation and text-external context bindings (section 3). It then describes the components of a text analysis process from robust parsing to deep semantic analysis. It explains the establishment of conceptual and thematic access perspectives (section 4 and 5). An outlook sketches an application scenario of using the representation described in text and information retrieval and machine translation (section 6).

1 Introduction

Most of our information sources and of our publications contain essential parts in form of natural language texts. During the process of publication this information is used by authors and transformed into new documents (e.g., new texts, abstracts, translations). Basically it is the content of texts which is accessed, not just the surface structure. In order to electronically support applications which are essentially devoted to the textual content (e.g., information retrieval, machine translation, hypertext links) natural language components have to provide immediate access to the contents of the various information objects.

Natural language texts are very flexible means of information handling. They allow for the constitution of information as well as for its communication, and for the handling of heterogeneous and incomplete information as well as for the development of information in the progress of time. Successful future information systems will above all have to offer this flexibility of information handling which natural language provides.

The current state of processing of natural language texts is on the one hand characterized by different procedures and methods for individual applications, and on the other hand by results which still do not satisfy the users, and which due to increasing pretensions will less and less do so. This has been shown by practical experiences and several evaluations. Two examples may serve as illustrations:

In the area of full-text-retrieval the figures quoted again and again for some years already read as follows: ”No more than 40% precision for 20% recall” (Sparck Jones, 1987). In other words: 60% of the results are wrong, and 80% of the information available in the system is not found. More recent figures are: ”60% precision for 40% recall or 55% precision for 45% recall” (Will, 1993) (similarly (Harman, 1996), (Voorhees and Harman, 1997). Although the meaning of such figures is debatable with respect to their application relevance and their methodic basis (cf. (Kowalski, 1997)), the general tendency has been confirmed by users and developers repeatedly. Croft wrote: ”We are still doing pretty badly even with the best technique that we have found” (1988) , and: ”The most interesting thing about text, and the central problem for designers of information retrieval systems is, that the semantics of text is not well represented by surface features such as individual words.” and ”The number of retrieval errors could be reduced if information retrieval systems used better representations of the content of text.” (Croft, 1993).

In the area of machine translation the situation is similar. The Japanese JEIDA-report (Nagao, 1989: 14) describes the result of an evaluation of machine translations as follows: ”Some translations were done well. Others, however, were not translated or were translated incorrectly. In some cases, only fragments of sentences were translated and they are directly put into a sequence disregarding linguistic relationships among them.”

One major impediment to more sophisticated textual information and document handling is common to many kinds of electronic processing: the objects that really should be handled are interpreted natural language texts, that is, both the text and the knowledge communicated by those texts, rather than uninterpreted character strings. The mechanisms of text constitution or textual communication of knowledge, however, are still poorly understood. Current approaches towards content handling employ statistical methods or pre-coded knowledge bases. Lexical statistic approaches assume that the choice of vocabulary in a text is a function of subject matter. The results quoted above, however, suggest that this assumption needs refinement. Knowledge bases are utilized for two tasks, namely for concept identification for determining concepts corresponding to explictily introduced information, and for bridging inferences for closing gaps between explicitly introduced concepts in order to construct a cohesive representation. The problems with these approaches have been recognized as being twofold: The descriptions provided in a knowledge base are prepared intellectually and they are modelled under those aspects which are foreseen on the basis of a particular state of the art and for a particular task (even if a generalization is aimed at). Firstly, this procedure is very costly, and secondly experience shows, that matching texts against these schemata works satisfactory for small texts in restricted domains, but is less successful, if texts are to be processed which communicate new or newly organized knowledge. In this case either the concepts available, the granularity of their description or the contexts they appear in do not provide the information which is actually needed. The situation becomes even worse, if concept descriptions are accessed and used without consideration of any contexts (which is typically the case with the application of thesauri).

A prerequisite of managing mass data with improved application results is a better understanding of natural language mechanism of information constitution and development. The conception of the Kontext model which will be presented in this article has been motivated by the goal to explore the means natural language provides for constituting, organizing and flexibly communicating information. The model views texts in their context with other texts rather than as isolated units, because this approach provides a basis for explaining mechanisms of the development of perspectives on information. The article focusses on the representation and its use for information processing. A corresponding text analysis prototype is currently under development. Although it is not yet possible to provide a detailed specification of a completed research work on this process, some of the design considerations and insights gained from prototype development and application will be included in this article.

2 Basic Assumption: Text Production and Text Understanding are Intentional Processes

In many approaches assumptions about the understanding process have not been made explicit and it has more or less been taken for granted, that the task of a computer is to generate a ”correct” and ”objective” text representation. Much research work has been devoted towards identifying the input resources needed (rule systems, dictionaries, knowledge bases, inferences) for constructing such a representation. Although observations have been reported which do not agree with this assumption, no serious consequences have been drawn with respect to system design - at least as far as conceptual systems are concerned (in statistic approaches changes in a corpus do have effects on the processing result). Kintsch and van Dijk, for example, state: ”It is not necessary that the text be conventionally structured. Indeed the special purpose [the reader’s goals, K.H.] overrides whatever text structure there is.” (Kintsch and van Dijk (1978: 373)). Hellwig (1984) writes, that as a consequence of the hermeneutic character of text descriptions a certain freedom in text interpretation must be taken into account. Grosz and Sidner (1986: 182) report on differing text segmentations of different readers, and Passoneau and Litman (1997: 108) write: ”we do not assume that there are ”correct” segmentations.” Similarly, Corriveau (1991 and 1995) in the description of his text analysis system IDIoT states: ”there is no correct interpretation, but rather an interpretation that is reached given a certain private knowledge base and a set of time-related memory parameters that characterize the ”frame of mind” (Gardner, 1983) or ”horizon” (Gadamer, 1976) of a particular individual.” (Corriveau, 1991: 9). His consequence is a system design, in which ”all memory processes are taken to be strictly quantitative i.e., mechanical and deprived of any linguistic and semantic knowledge” and ”all ’knowledge’, that is, all qualitative information, manipulated by the proposed comprehension tool ist assumed to be strictly user-specifiable” (Corriveau, 1991: 8). Whilst this approach leads to a consequent distinction between data and algorithms, it still uses hierarchically structured domain knowlegde bases.

The problem with assuming an ”objective” result of a text analysis process and relying on well-structured background knowledge bases is twofold: To begin with, these assumptions determine a goal which obviously cannot be reached for theoretical reasons. But, moreover, this assumption blocks the way towards the exploration of the mechanisms of the dynamic development of information and access perspectives during the process of information communication and towards the management of vage and incomplete information. It seems to be the search for the reasons of the possibility of interpreting texts in different ways - depending on background information and communication goals - which leads to basic premises of these mechanisms.

A basic assumption of the KONTEXT model is, that text production and text understanding are intentional processes with varying results depending on background information and communicative goals. Further assumptions are:

(1) A distinction is made between knowledge and information: Knowledge is understood as unintentional, i.e. as independent of integrations into particular tasks and contexts (for a similar definition cf. (Searle, 1980), (Thom, 1990), (Rich and Knight, 1991)). Knowledge which has been manifested (e.g., in natural language texts) for a particular purpose is called information (following a definition by Franck (1990))

(2) It is assumed that informative texts are manifestations of access to knowledge. They, however, do not present knowledge as a whole. They rather access and fix knowledge in a particular way which serves a particular purpose in a particular communication situation. The information presented in a text is the information which is supposed to be relevant with respect to the communicative goal of a text. It would not serve a communication purpose to communicate all knowledge equivalently and in an equally detailed manner (similarly (Lang, 1977: 81/82))

(3) Each text organizes knowledge in its own way, and besides the communication of knowledge which is supposed to be new to the communication partners, it may be a particular organization of already known facts which creates relations which suit a further communication situation better and which shed a new light on previous knowledge.

(4) Information provides a particular view on knowledge and is contextually bound in two ways: Firstly, the information presented in a text highlights pieces of knowledge rather than provides a clearly cut segment of it. The information selected for textual presentation is not necessarily self-contained. It may rather be contextually bound to further knowledge outside the actual fixing. Secondly, the knowledge fixed for a text is text-internally bound into the organization of the actual fixing.

Based on these observations textual communication of knowledge can then be explained as follows: texts are construction instructions for information (similarly (Kallmeyer, Klein, Meyer-Hermann, Netzer and Siebert, 1986: 44)). Information is not just delivered as a whole to a partner. Instead understanding is an active process. The reader has to construct information in accordance with the same principles which an author has used to fix knowledge. The author of a text has found a pragmatic solution that leads to a specific goal by a chain of operations on the own knowledge, and it is this chain of operations that is imparted to the reader. The author is guiding the process of understanding by drawing the attention to those details which are suitable for the construction of new views and relations. The guidance includes instructions, which parts of knowledge or previously constituted information are to be accessed, how these parts are to be connected, how parts of the constructions are to be changed, from which perspective the constructions are to be viewed, where the construction shall be continued, etc. In this process the individual expressions have different functions. They are used to refer to areas of knowledge or information, or to constitute contexts and structures which determine access and construction operations. Nouns, for instance, are used for accessing or introducing objects (”Opera House”) , verbs are used for accessing or constituting states of affairs (”build”) and to establish relations between objects (’build (Utzon, Opera house, in(Sidney))’), anaphoric pronouns (”their shell roofs”, ”his personal style”) or definite articles (”the interiors”) are used for redirecting the reader to previously established information structures, active and passive voice are used for establishing a perspective, etc. The sequentially arranged expressions of a text function as operators which establish constructs like concepts, references to instances, contexts and thematic structures. These constructs in turn determine the access to knowledge and the composition (including changes) to a text specific information.

As can be observed, a text understanding process can have such different results with different readers as no understanding at all, partial understandings, misunderstandings, good understandings and new perspectives on previous knowledge. These differences can be explained by the assumption that each reader tries to interpret the newly communicated information on the basis of the own background knowledge in a way, that it is internally connected and contextually bound to the background knowledge. The connectedness of a view is not necessarily completely provided by the text itself. As has been mentioned, a text focusses on the information which is supposed to be relevant with respect to the communicative goal of a text, and presents this information to an extent which is supposed to be new. Further knowledge is not fixed. Contextual binding of the view presented may, however, be required for connecting the information units of the view. These connections must be provided by each reader’s own background knowledge or further accessible information (e.g., reference books). Usually, neither the knowledge area to be involved nor its extension is described (exceptions are explicit references to background information sources in scientific publications, reference books, legal texts, and others). Obviously communication succeeds on the basis of a certain breadth and depth of variation and vagueness.