Relational Data Model
in Document Hierarchical Indexing

Authors: Intentionally left blank

Affiliation: Intentionally left blank
Affiliation: Intentionally left blank
Affiliation: Intentionally left blank
Affiliation: Intentionally left blank
Affiliation: Intentionally left blank

e-mail: Intentionally left blank

Abstract. One of the problems of the development of document indexing and retrieval applications is the usage of hierarchies. In this paper we describe a method of automatic hierarchical indexing using the traditional relational data model. The main idea is to assign continuous numbers to the words (grammatical forms of the words) that characterize the nodes in the hierarchy (concept tree). One of the advantages of the proposed scheme is its simplicity. The system that implements such indexing scheme is described.

1 Introduction

Traditional document retrieval systems [1] lack the possibility to perform generalized searches (say, for topic or theme). On the other hand, the systems that permit navigation in a hierarchy, for instance, Yahoo, have their own shortcuts, for example:

  • Document indexing is usually manual, and, thus, imprecise and subjective,
  • Searches with various conditions are not allowed, and
  • The documents normally are stored only in one place in the hierarchy though they can deal with several topics in the hierarchy.

These problems can be resolved if we apply automatic indexing of documents using a concept dictionary that usually has a form of a hierarchy of concepts (concept tree), see, for example, [4]. Still, in the paper [4] the task of indexing is ignored and the data is obtained from re-processing of the texts.

There are several linguistic techniques that allow for more exact and effective document processing (for further document retrieval), see, for example, [2], [3], like morphological analysis or synonymy, etc. Morphological analysis permits to reduce several grammatical forms to only one lexeme (e.g., for lexeme take both take and took should be indexed). This may be not of great importance for the languages like English with very few grammatical forms, but it is of great utility for the languages with greater number of grammatical forms for each lexeme as, say, Spanish or Russian. For example, a Spanish verb has more than hundred grammatical forms (counting with clitics).

In this paper we propose the method of the concept tree indexing using the relational data model. The main idea is to give to all grammatical forms that correspond to a lexeme the consecutive numbers, and since the nodes in the tree are characterized by a set of lexemes, they also get this numbering. Thus, each grammatical form is represented by the number, while lexemes and nodes are represented by a numeric interval. Usage of the consecutive numbers allows for using of the relational data model that ensures simplicity of the whole method. Namely, it is possible to use any standard database manager available. The corresponding system was developed that implements rather sophisticated query language both for the words and for the nodes of the concept tree.

In the rest of the paper we describe in a greater detail the procedure of hierarchical indexing and then briefly describe the system and its query language.

2 Hierarchical Indexing

The documents about biology contain words like plants, animals, cellules, etc. Still, if one is interested more specifically in zoology, he/she wants to find the document with the words crocodiles, or mammals, etc. This specification process finishes with individual words. Still, if the user wants to find documents about biology, this also means that the documents about zoology are acceptable, but not vice versa. The structure of the concept tree resembles this specification process: the terminal nodes have the lists of words that correspond to them; the non-terminal nodes unite several terminal or non-terminal nodes (see example below). Each node has only one parent node except the uppermost node that does not have parent.

The suggested method of indexing establishes correspondence between each node (not necessarily the terminal one) and the words that are below the node in such a way that the words are indexed with continuous numbers and each node in the tree is characterized by a numeric interval. This interval is obtained from the words that are below the node.

We have the concept tree in English, because this tree is the language-independent resource. Still, the lists of words that correspond to terminal nodes are language-dependent. In our case, we used Spanish lists, though the list of words in any language can be used.

Technically, the indexation process is as follows: pass all the terminal nodes in the concept tree starting from the leftmost terminal node and enumerate all the words that correspond to them in an order starting from 0. The words in the lists for the terminal nodes are lexemes, so we generate all their grammatical forms and enumerate these forms (not the lexeme itself). At the same time, the terminal nodes are assigned the corresponding numbers, taking the lower bound of the leftmost lexeme and the upper bound of the rightmost lexeme. In case of terminal nodes their children are lexemes.

The next step is to index all non-terminal nodes. The lower value is taken from the leftmost child; the upper value is taken from the corresponding right-most child. In case of non-terminal nodes their children are terminal or non-terminal nodes.

After this processing, the tree looks like the following:

- ANY TOPIC 0 4624977

-- NATURE 0 1282748

--- nature 0 4

--- HUMAN BODY 5 368954

---- the human body 5 80

---- ANATOMY 81 31261

----- anatomy 81 162

----- reproductive system 163 221

----- respiratory system 222 252

----- skeleton 253 15288

......

Fig. 1. Constructing a query using the concept hierarchy.

The number of hyphens at the beginning of each line corresponds to the level in the concept tree. This fragment has four levels indicated with capital letters and several terminal nodes indicated with letters in lower case. The numbers are the lower and the upper bounds of the continuous enumeration. As can be seen, total number of grammatical forms that are represented in the hierarchy is 4,624,978 — the higher value of ANY TOPIC that is the uppermost node in the tree and unites all the nodes.

Also, let us have a look at the list of lexemes corresponding to terminal nodes. Say, the node the human body contains the following lexemes: anatomía (anatomy), aptitud (ability), pelo (hair), perfeccionamiento (perfection), salud (health), etc. All grammatical forms of these lexemes are indexed (also, see the list of words on the right side of the Fig. 1).

In case that the word belongs to two or more different terminal nodes it is indexed several times with different numbers (with all its grammatical forms).

Having the concept tree indexed, it is possible to start indexing the documents. If the algorithm finds in the documents the grammatical forms that are not in the tree, they are indexed in the following manner: first, the lemma is generated, then all grammatical forms for this lemma are generated, and then all these forms are enumerated.

Fig. 2. Constructing query using grammatical forms and lemmas.

After the document indexing we have a database that contains numbers of all grammatical forms that were found in the documents. Now it is possible to search rapidly constructing the query. In the query one can use individual grammatical forms, as well as lexemes, and as well as tree nodes. All of them are substituted by the corresponding intervals of numbers (see Fig. 1 and Fig. 2).

3 Description of the system

We developed the system that implements the method of indexing described above. The system has sophisticated query language with Boolean operations and proximity criteria. The main advantage of the system is the possibility to search combining themes, lemmas, grammatical forms and their logical combinations. The system was tested using the investigation projects developed in our center. So, additionally, the database has several traditional relational fields that can be used in retrieval, like, start date of the project, final date of the project, chief researcher, participants, etc.

The query is translated automatically into SQL.

Let us have a look at the example query. Say, the word (lexeme) cobre (brass) is substituted by several numerical intervals (there are several intervals because the word belongs to several terminal nodes)

(( a1.id >= 1127719 AND a1.id <= 1127720 ) or

( a1.id >= 1143963 AND a1.id <= 1143964 ) or

( a1.id >= 1481622 AND a1.id <= 1481623 ) or

( a1.id >= 4268598 AND a1.id <= 4268599 ) )

The first interval belongs to the terminal node “pure metals” and the second to the terminal node “stones, bricks, tiles, glass and metals”. Their immediate upper node is “MINERALS, METALS & ROCKS”. The third interval belongs to the terminal node “elements” with the upper node “CHEMISTRY”. The fourth interval is the terminal node “money, currency and denominations” with the immediate upper node “THE ECONOMY” (the word cobre (brass) is used in Spanish to denote a small change).

The buttons that are represented on the left side of the window of the system permits to use Boolean operations (and, or, no), brackets (one bracket that can be inserted (left or right) or two brackets for the whole expression), and the proximity button that permits to search the elements that are near (we use the distance of 10 words).

Fig. 1 shows the usage of the tree: the node “social order” is inserted, which means that any word from the corresponding word list match the query.

Fig. 2 contains besides the tree node the grammatical forms of the verb aplicar. It means that in this case only these specific forms match the query. But if we add lemma aplicar, which is chosen in the list of lemmas, then all the grammatical forms of this lemma are found.

4 Conclusions

In the paper we presented the simple method of hierarchical indexing of documents that is based on the relational data model. We enumerate words (more exactly, their grammatical forms) that correspond to the terminal nodes of the concept tree using consequent numbers, i.e, all grammatical forms that belong to the same lexeme (word) have the continuous numbering, and the words (and corresponding grammatical forms) that belong to the same node in the hierarchy also have the continuous numbering. The documents are indexed using these numbers corresponding to the grammatical forms that exist in the document. The system uses this information in document retrieval. This gives the possibility to search by topics (themes). Since there can be several words belonging to this or that theme, it is possible to calculate weights of each theme. The implementation has sophisticated query language with Boolean operations and proximity criteria.

References

1. Baeza-Yates, Ricardo, and Berthier Ribeiro-Neto. Modern information retrieval. Addison-Wesley Longman. 1999.

2. Gelbukh, Alexander. Lazy Query Enrichment: A Simple Method of Indexing Large Specialized Document Bases. Proc. DEXA-2000, 11th International Conference and Workshop on Database and Expert Systems Applications, Greenwich, England, September 4-8, 2000. Lecture Notes in Computer Science N 1873, Springer-Verlag, pp.526–535.

3. Gelbukh, Alexander and Grigori Sidorov. Intelligent system for document retrieval of the Mexican Senate. Proc. CIC-2000, Congreso Internacional de Computación, November 15 - 17, 2000, CIC, IPN, Mexico D.F., pp.315-321.

4. Gelbukh, A., G. Sidorov, and A. Guzman-Arenas. Use of a weighted topic hierarchy for text retrieval and classification. In Václav Matoušek et al. (Eds.). Text, Speech and Dialogue. Proc. 2nd International Workshop TSD-99, Plzen, Czech Republic, September 13-17, 1999. Lecture Notes in Artificial Intelligence, No. 1692, Springer ( events/ tsd99/ abstract.html), pp. 130–135.