Identify the track name here
Improving Intranet Search with Database-backed Technology
Omar Alonso, Oracle
Introduction
The combination of web information retrieval techniques with database technology can dramatically improve intranet searching. Techniques like classification, clustering, and the ability to acquire more knowledge about users, queries, and the collections can enhance search quality. Other methods, such as web link analysis and information visualization, can also help in quality as well as the overall user search experience.
Current Problems with Intranet Search
The enterprise intranet is a very different information space from typical internet websites. It not only differs in terms of content but also in the type of users, tasks, and quality of information.
- Users are different. Intranet users are generally all employees of a single company doing their daily work (as opposed to users who may not be doing work-related tasks).
- Tasks are different. Intranet users are looking for information to do their jobs. Usually they need the information for several reasons (solve a customer issue, close a deal, etc.).
- Amount and quality of information are different. There are lots and lots of content in different formats, with different data sources, and so on. Usually often features information with very good content; sometimes that information is a work in progress.
Searching also differs in a number of ways. There is a mix of structured, unstructured, and semi-structured data scattered in different places and systems.
We can identify three main problems with intranet search:
- Multiple repositories: there are different data sources (websites, files, email, etc.). Users expect a single search engine not one per repository.
- Performance: users expect sub-second query respond time no matter how many repositories are available.
- Quality of results: users expect good quality results with searches - not to get thousands of search results that they have to filter.
Also, given the current popularity of some web search engines, users expect a similar behavior in the enterprise intranet to their favorite web searches. Users expect to use a simple user interface that requires minimal training.
Personalization is also a major issue. For example, a search application should distinguish between an employee who is part the sales force and a consultant with a need for different information.
The Oracle solution
Oracle offers a complete technology stack solution for content search, organization, and presentation. There are two aspects of this solution. One is a comprehensive information retrieval API called Oracle Text that allows developers to build any kind of search application. The second aspect is an out-of-the-box solution application for enterprise intranet search.
Oracle Text
Oracle Text uses standard SQL to index, search, and analyze text and documents stored in the Oracle database, in files, and on the web. Oracle Text can perform linguistic analysis on documents, as well as search text using a variety of strategies including keyword searching, context queries, Boolean operations, pattern matching, mixed thematic queries, HTML/XML section searching, and so on. It can render search results in various formats including unformatted text, HTML with term highlighting, and original document format. Oracle Text supports multiple languages and uses advanced relevance-ranking technology to improve search quality. Oracle Text also offers advanced features like classification, clustering, and support for information visualization metaphors.
Oracle Ultra Search
Ultra Search can be used to search across Collaboration Suite Components, corporate web servers, databases, email servers, fileservers and Oracle Application Server 10g Portal instances. Ultra Search is based on Oracle Text technology and is an out-of-the box solution that requires no SQL coding. It uses a crawler to index documents; the documents stay in their own repositories, and the crawled information is used to build an index that stays within your firewall in an Oracle database.
Let's see how can we solve problems using Oracle.
Quality
There are four techniques that we can use to improve search quality results.
- Spelling correction
- Link awareness
- Duplicate elimination
- KWIC
The spell-checker component is pretty straightforward. The user types a query and before issuing a search the system spellchecks the entire phrase. The component has large dictionary that is also extensible.
The link awareness technique is very useful for reflecting specific characteristics of documents such as links, anchor text, and title information. There is traditional static link based analysis and query hitlist link analysis. Oracle uses a combination of both strategies that suits the intranet search topology.
Duplicate elimination is a very common problem in intranets. Often, there are several copies of the same document or web page in different places. The idea is to remove URLs with duplicate and near-duplicate content.
The keyword in context (KWIC) feature has become an easy way to have an idea of what the document or web page is all about without clicking on the link. Before KWIC, it was common to the see the first eight characters of a page but sometimes the information was misleading. Figure 1 shows a screenshot of the KWIC component.
Figure 1. KWIC in action
Classification and clustering can also help the organization and presentation of search results. We discuss both techniques in the advanced features section.
Performance
As part of the Oracledatabase 10g, Oracle Text transparently integrates with and benefits from a number of key enterprise features such as
- Data partitioning (for higher throughput and availability)
- Real application clustering (for the highest server scalability)
- Query optimization (to ensure the best response time, not only for pure text queries, but also “mixed” queries that combine text search with structure database search)
These aspects of integration are also greatly beneficial to system and database administrators, who do not have to undergo a paradigm shift to learn to manage and organization’s text assets.
Common and Rare Queries
Typically we can say that 80% of queries are common queries and 20% are less frequent or rare queries. Given this breakdown, it makes sense to have two separate indexes. In the first one we index only the URL and the title of the web page. The idea here is to exploit the structure of the web page. For the second index we use a more traditional approach that is indexing the full content of the web page.
There are number of advantages to this approach. The overall system performance is improved. The first index is smaller and it will return most of the queries. The second index is bigger and it will return rare queries. Both indexes can be managed separately.
Query Relaxation
Query relaxation enables your application to execute the most restrictive version of a query first, progressively relaxing the query until the required number of hits is obtained. For example we search first ‘JDeveloper download’ and then the query is relaxed to ‘JDeveloper NEAR download’ to obtain more hits.
Query relaxation is most effective when the application needs the top N hits to a query. Using this technique is more efficient than re-executing a query.
Ease of use
Users want to have a simple and easy to user search interface that is very similar to existing Internet search engines. Our approach is to hide all the complexity of the search engine and expose only a typical web search interface, letting the user discover the power of the engine.
Ultra Search is an out-of-the-box search solution that provides search capabilities across multiple repositories – Oracle databases, IMPA email servers, websites, files on disks and much more. It uses a crawler to index documents; the documents stay in their own repositories, while the crawled information is used to build an index that stays within your firewall in an Oracle Database 10g.
The interface has two modes: basic search and advanced search. In the basic search a simple input box is presented. The search results are presented sorted by relevance. The advanced search mode offers more control over the collection. Figure 2 shows Ultra Search in the context of the Oracle Collaboration Suite.
Figure 2. Ultra Search results from different data sources.
Personalization
It is very useful to know the users of your search engine. It also very important to know what your users search patterns are. With all that information stored in the database you can mine for specific nuggets of information that can improve the overall quality of the search results.
Query log analysis
This feature enables you to create a log of queries and to analyze the queries it contains. With query analysis, you can find out interesting information like:
. Which queries were made?
. Which queries were successful?
. Which queries were unsuccessful?
. How many times was each query made?
Not only can we learn a lot about search patterns, but also we can feed acquired information back into other components. For example the spell checker can “learn” new terminology from certain query logs.
Advanced features
In this section we present some of more advanced features of the latest release.
Classification
Oracle offers two ways of classifying content:
Rule base classification: you group the document collection together and formulate categories and the rules (categories) that define them. Then you need to classify the content according to those rules.
Supervised classification: instead of writing the rules by hand, you use a training set for automating the rule writing process.
We can group a number of categories into a new entity called taxonomy. An enterprise taxonomy should include categorizations of different information assets across multiple organizations. A taxonomy also defines a common vocabulary and can serve as a navigational aid when browsing for content.
Figure 3. Categories for the topic “database”.
Clustering
It is also possible to use an unsupervised classification technique that will automatically group all the documents by categories. This approach is very useful when no taxonomy is in place or when you are trying to discover certain groups within the collection. You can use some of the clustering output to define categories. We believe this is an iterative process.
Information Visualization
Information visualization is defined as “visual representation of abstract data to amplify cognition.” In the context of vast amounts of information, visualization techniques can help users navigate through large data sets of documents as well as aid them in selecting appropriate assets.
There are a number of visualization metaphors that are available out-of-the-box like StretchViewer (Figure 5) or based on the Oracle Interactive Viewer (Figure 4). The Interactive Viewer is a high performance Java visualization package designed to visualize and navigate complex data relationships in hierarchical, network, and relational data.
Figure 4. Visualization of document themes with Interactive Viewer.
Figure 5. StretchViewer visualization of MeSH categories.
Conclusions
The Oracle Database 10g provides a complete solution for enterprise intranet search. The solution is divided in two approaches: the Oracle Text search API and the easy to use Ultra Search application.
Oracle Text enables application developers to transparently include powerful text searching capabilities in their applications using any programming language. It makes all the normal benefits of an industrial-strength database available, without the cost of learning and supporting extra APIs and duplicated data.
Ultra Search is an easy to use application that requires no coding and minimal administration tasks to manage an intranet search engine.
The ability to find documents based on their textual, content metadata, or attributes, makes the Oracle Database the single point of integration for all data management.
References
[1] Oracle Text Reference Manual
[2] Oracle Text Application Developers Guide
[3] Oracle Text OTN page (
[4] Ultra Search Online Documentation.
[5] Ultra Search OTN page (
Paper 40185