Performance Evaluation of Relational
Implementations of Inverted Text Index
Qi Su Yu-Shan Fung
Stanford University Stanford University
ABSTRACT
Information retrieval (IR) systems are adept at processing keyword queries over unstructured text. In contrast, relational database management systems (RDBMS) are designed for queries over structured data. Recent work has demonstrated the benefits of implementing the traditional IR system of inverted index in RDBMS, such as portability, parallelism, and scalability. We perform an in-depth comparison of alternative relational implementations of inverted text index versus a traditional IR system.
1. INTRODUCTION
Database and information retrieval (IR) are two rich fields of research that have produced ubiquitous tools such as the relational database management system (RDBMS) and the web search engine. However, historically, these two fields have largely developed independently even though they share one overriding objective, management of data.
We know that traditional IR systems do not take advantage of structure of data, or metadata, very well. Conversely, relational database systems tend to have limited support for handling unstructured text. Major database vendors do offer sophisticated IR tools that are closely integrated with their database engines, for example, Oracle Text, IBM DB2 Text Information Extender, and Microsoft SQL Server Full-Text Search. These tools offer a full range of options, from Boolean, to ranked, to fuzzy search. However, each text index is defined over a single relational column. Hence, significant storage overhead is incurred, first by storing the plain text in a relational column, and again by the inverted index built by the text search tool. These tools offer powerful extensions to the traditional relational database, but do not address the full range of IR requirements. Their vendor-specific nature also means they are not portable solutions.
There has been research in the past decade investigating the use of relational databases to build inverted index-based information retrieval systems. There are several key advantages to such an approach. A pure relational implementation using standard SQL offers portability across multiple hardware platforms, OS, and database vendors. Such a system does not require software modification in order to scale on a parallel machine, as the DBMS takes care of data partitioning and parallel query processing. Use of a relational system enables searching over structured metadata in conjunction with traditional IR queries. The DBMS also provides features such as transactions, concurrent queries, and failure recovery.
Most of the previous works have picked one relational implementation and compared it with a special-purpose IR system. Some of them have focused on a particular advantage, such as scalability on a parallel cluster.
We propose a comprehensive evaluation of the alternative relational implementations of inverted text index that have been discussed in literature, with the special-purpose IR system Lucene being the baseline for comparison. We will evaluate the systems on Boolean queries, phrase queries, and relevance ranked queries, and benchmark their relative performance in terms of query response times.
In section 2, we discuss the related work in literature concerning implementing inverted index as relations and integrating IR and DBMS. In section 3, we present the baseline IR and alternative relational implementations of inverted index systems. In section 4, we review evaluation of these systems, our test dataset, and queries to be executed. Section 5 presents the relative performance results collected and our observations. Finally, in section 6, we present concluding remarks and future work.
2. RELATED WORK
Several works have picked a single relational implementation and compared its performance with a baseline special purpose IR system. Kaufmann et al [KS95] compared an IR system, BASISPlus, an early version of Oracle’s text search tool, SQL*TR, and a relational implementation of the inverted list with two relations, <term, docid> and <term, docfreq>. The evaluation dataset is a small 850,000 tuples in the <term, docid> inverted list. The queries are strictly conjunctive Boolean queries.
More recent works have shown that Boolean, proximity, and vector space ranked model searching can be effectively implemented as standard relations while offering satisfactory performance when compared to a baseline traditional IR system. Grossman et al [GFH97] demonstrates that relational implementations are effective for Boolean, proximity, and ranked queries. The relational model implemented using Microsoft SQL Server consists of doc_term table <docid, term, term freq>, doc_term_prox table <docid, term, position>, and idf table <term, idf>. The baseline IR system is Lotus Notes, which is a heavy weight system that is not built specifically for IR tasks. The authors also studied parallel performance on an AT&T 4-processor database machine.
Some works have focused on a single advantage of relational implementations over traditional IR inverted index. Grabs et al [GBS01] evaluated the performance and scalability of a database IR system on a parallel cluster. The system is implemented with BEA middleware over Oracle database, with significant emphasis on the transaction semantics to ensure high levels of search and insert parallelism. The basic data model is <term, docid>. Only Boolean queries performances are measured.
Brown et al [BCC94, BR95] demonstrated efficient inverted index implementation and fast incremental index update using a database system. However, their implementation used a persistent object store manager, which is beyond our scope of using the traditional relational model and off-the-shelf RDBMS.
A recent issue of IEEE Data Engineering Bulletin covered the work by major database vendors to integrate full text search functionality into the RDBMS. [MS01], [HN01], [DIX01] presented how IBM DB2, Microsoft SQL Server, and Oracle introduce text extensions that are tightly coupled with the database engine. However, as we discussed earlier, such an approach is limited in that each text index must be defined over a single column, and storing both the full text of the document in the database, as well as storing the inverted index on the side, incurs significant storage overhead.
3. SYSTEM IMPLEMENTATIONS
We evaluate four systems on information retrieval tasks. The baseline system is Lucene, a special-purpose IR search engine. The three relational designs are implemented using IBM DB2 Universal Database. The first relational approach uses the DB2 Text Information Extender to take care of all the indexing and query processing. The two remaining relational approaches implement the inverted index as relations, and transform keyword queries to standard SQL queries.
3.1. Lucene
Lucene is an open-source text search engine system under the Apache project. It is written entirely in Java. We chose this as our baseline system as it offers ease of deployment and a full feature set representative of a traditional IR system.
Lucene includes three key APIs, IndexWriter, IndexReader, and IndexSearcher. IndexWriter enables the user to construct an inverted text index over a corpus of documents. Indexing may be customized with parameters such as case folding, stemming, etc. The IndexReader allows the user to probe the contents of the inverted index, for example, enumerating all tokens in the index. This important aspect will be discussed in section 4.1. The IndexSearcher provides a rich set of search options, including Boolean, phrase, ranked, and fuzzy search queries.
With our corpus, we will pass one document at a time to our IndexWriter instance to be tokenized and indexed. The keys associated with document are the document ID and URL, which is just the document file name. At execution time, we use the appropriate method of the IndexSearcher instance to retrieve the relevant document URLs from the Lucene index.
By default, all Lucene search hits are returned with a ranked score. For our case, we only care about the ranking when we measure the ranked query performance. Lucene keyword queries are structured as a single search string. And queries prefix each keyword with a plus sign. Or queries are a space delimited list of keywords. Phrase queries are a space delimited list of keywords enclosed in quotes.
3.2. IBM DB2 Text Information Extender
IBM DB2 Text Information Extender (TIE) is a full-text search engine tightly coupled with the IBM DB2 Universal Database. It supports the creation of full-text indexes on textual DB2 table columns. TIE uses the table primary key to relate inverted index token entries to their original source tuple in the table. TIE is invoked as special function calls over columns, much like an user-defined function.
We create a three column relation named fulltext <docid, url, text>. The text column is of the type Binary Large Object (BLOB) and contains the full text of the document. Our simple parser creates a single large load file. DB2’s load utility batch loads the data into the relation. Then we invoke TIE to index the text column.
Boolean and phrase queries use the contains function. Ranked queries also use the score function.
And query:
SELECT url
FROM full
WHERE contains (text, ‘”keyword1” & “keyword2” & “keyword3”’)=1
Or query:
SELECT url
FROM full
WHERE contains (text, ‘”keyword1” | “keyword2” | “keyword3”’)=1
Phrase query:
SELECT url
FROM full
WHERE contains (text, ‘”keyword1 keyword2 keyword3”’)=1
Ranked query:
WITH temptable (url, score) AS
(SELECT url, score(text, ‘“keyword1” & “keyword2” & “keyword3”’) FROM fulltext)
SELECT url
FROM temptable
WHERE score>0
ORDER BY score DESC
3.3. Term - Document
In this representation, each tuple corresponds to a term/document pair. The relations are:
tf <term, docid, termfreq, positionlist>
idf<term, idf>
url<url, docid>
The bulk load files are created by invoking the Lucene IndexReader to probe the Lucene text index over our corpus. We will discuss this aspect further in sec. 4.
The positionlist attribute of the tf relation is an offset encoded list of all occurrence positions of the given term in the given document. For example, term “hello” appearing in document 2 at positions 10, 100, 102 would be encoded as the tuple
<”hello”, 2, “10,90,12”>
This representation is more compact compared to the Term-Document-Position approach and is well suited for Boolean and ranked queries. In the case of phrase or positional queries, we implement application logic to merge position lists.
There are two alternatives to implement an AND query in SQL. The natural choice is to translate an N-word query into an N-way equi-join on the document ID, where each join relation has been filtered to select documents containing one of the words.
SELECT u.url
FROM url u, tf t1, tf t2
WHERE u.docid=t1.docid AND t1.docid=t2.docid AND t1.term=’keyword1’ AND t2.term=’keyword2’
Figure 1. Query Plan for equi-join And query
An alternative due to Grossman [GFH97] treats the query keywords as an artificial relation, and joins it with the term-document relation. The result is subject to group by aggregation by document and only documents having the correct number of keyword matches are preserved.
WITH query(term)
AS (values ('keyword1'), ('keyword2'))
SELECT u.url FROM url u, tf d, query q
WHERE u.docid=d.docid AND d.term=q.term
GROUP BY u.url
HAVING count(d.term)=2
However, upon further evaluation (results not shown), we discovered that the second implementation never outperforms the first, and in some cases performs over an order of magnitude worse. Hence, from now on, all reference to And type queries on both the Term-Doc and Term-Doc-Position approaches refer to the first implementation.
Figure 2. Query Plan for alternative And query
The OR query is a simple selection with multiple Or filters.
SELECT DISTINCT(u.url)
FROM url u, tf t
WHERE u.docid=t.docid AND
( t.term='keyword1' OR t.term='keyword2' )
For phrase queries, we retrieve all candidate documents, and the position list for all the keywords in the query. Candidate documents are the results of the AND query with the same keywords. The result relation looks like <docid, position list for first keyword, position list for second keyword, … >.
SELECT u.url, t1.positionlist, t2.positionlist
FROM url u, tf t1, tf t2
WHERE u.docid=t1.docid AND t1.docid=t2.docid AND t1.term=’keyword1’ and t2.termid=’keyword2’
The application logic then traverses the multiple position lists together to find an instance where the positions in each list are one apart, in order.
Our ranked query implementation is also due to Grossman [GFH97]. First, we precompute term IDF values as log ( number of documents / document frequency of the term ). The relevance of a given document d is the summation over all terms t occurring in both the query and the document:
∑ (query.termfreq for t * t’s IDF * d.termfreq for t * t’s IDF).
WITH query(term, tf)
AS (values ('keyword1',1),('keyword2',1))
SELECT u.url, SUM (q.tf * i.idf * d.freq * i.idf) as score
FROM url u, query q, tf d, idf i
WHERE u.docid=d.docid AND q.term = i.term AND d.term = i.term
GROUP BY u.url
ORDER BY score DESC
3.4. Term - Document - Position
The term-document-position approach stores a single tuple for every single occurrence of a term in a document. Hence the term t appears 5 times in document d corresponds to five distinct tuples. The relations are:
posting <term, docid, position>
idf<term, idf>
url<url, docid>
For Boolean and ranked queries, this representation is redundant, compared to the term-document approach. At query time, we must insert the distinct operator in our query plan to eliminate duplicates in our join results. However, this representation leads to straightforward SQL translation of phrase and proximity queries. There is no need for application logic or custom user-defined functions to post-process position lists for positional matches. Positional matches are specified as SQL arithmetic predicates relating the position attributes.
The Boolean queries are very similar to the Term-Document approach, except for the addition of distinct operators.
Equi-join And:
SELECT DISTINCT(u.url)
FROM url u, posting t1, posting t2
WHERE u.docid=t1.docid AND t1.docid=t2.docid AND t1.term='keyword1' AND t2.term='keyword2'
Or:
SELECT DISTINCT(u.url)
FROM url u, posting d
WHERE u.docid=d.docid AND
( d.term='keyword1' OR d.term='keyword2' )
Phrase queries:
SELECT distinct(u.url)
FROM url u, posting t1, posting t2
WHERE t1.docid=u.docid AND t1.docid=t2.docid AND t1.term='keyword1' AND t2.term='keyword2' AND t2.position=t1.position+1
Ranked queries:
WITH query(term, tf)
AS (values ('keyword1',1),('keyword2',1)) SELECT u.url, SUM (q.tf * i.idf * i.idf) as score
FROM url u, query q, posting d, idf i
WHERE u.docid=d.docid AND q.term = i.term AND d.term = i.term
GROUP BY u.url
ORDER BY score DESC
4. SYSTEM EVALUATION
The systems are implemented on a Pentium III 800MHz workstation with 1GB of RAM running Windows 2000. The baseline IR system is Lucene version 1.2 running on JDK 1.3. The relational database is IBM DB2 UDB Enterprise Edition version 7.2. Our first relational approach uses the IBM DB2 Text Information Extender version 7.2.