Provides a Comprehensive Suite of Text Analysis and Text Search Tools

A REPORT.

CHAITHANYA KADIYALA.

MAHESH BOMMIREDDY.

ABSTRACT

Highlights

Provides a comprehensive suite of text analysis and text search tools

Turns unstructured information extracted from workgroup applications and large corporate solutions into business knowledge

Includes components for building scalable knowledge management, text mining and text search applications

------

IBM® Intelligent Miner™ for Text offers system integrators, solution providers and corporate application developers a set of powerful tools to enrich business intelligence solutions. These include:

Text analysis tools

Full-text search engine

Web crawler tools

A Web search solution

With these tools you can build a wide variety of applications. For example, you can categorize information from news feeds; analyze patent portfolios, customer complaint letters and competitors' Web pages; or turn your corporate intranet or your Web site into a warehouse of knowledge.

The text analysis tools

Language identification

Automatically identifies the language of a document for expedited processing.

Clustering

Groups related documents based on their content, without requiring predefined classes.

Categorization

Assigns documents to one or more user-defined categories.

Summarization

Extracts sentences from a document to create a document summary.

Feature extraction

Recognizes significant items in text, such as names, technical terms and abbreviations.

Together, these tools can extract knowledge from unstructured information, classify incoming e-mail according to predefined topics in log files, or automatically detect deadlines.

The full-text search engine

This scalable search engine allows for advanced query enhancement and result preparation to enable high-quality information retrieval. It can perform in-depth document analysis during indexing. The engine features client/server handling, linguistic support for different languages and document analysis tools.

In addition, the search engine features an online update mechanism that lets you search while the index is being updated.

You can integrate the search engine with any document management system and use third-party tools to support arbitrary document input formats.

Linguistic analysis functionality is provided for documents in 21 single-byte character set and bi-directional languages. The search engine supports a wide variety of advanced search paradigms, including Boolean queries, free-text queries, fuzzy searches and synonym searches. For Chinese, Japanese and Korean documents, it supports Boolean queries, precise term searches and fuzzy searches.

For English documents, the engine also clusters search results, refines queries based on user-assigned relevance and creates a feature index that helps you find names, locations or terms.

Other features include a sample Java™ graphical user interface (GUI), which enables easy access to the search engine, and two customizable English-language JavaBeans GUIs, which expedite development of search and administration functions.

The Web crawler tools

The Intelligent Miner for Text Web crawler and Web crawler toolkit allow you to

leverage the Internet and intranets to gain access to relevant information and support your e-business. The Web crawler monitors a user-defined set of URLs. A ready-to-run generic Web crawler is provided as an example. The Web crawler toolkit lets you develop your own Web crawler, tailor it to your own requirements and bind it to your own applications.

The Web search solution

This powerful Internet/intranet text search solution combines the functions of the search engine and the Web crawler. Simply specify the area of the Web that you want to search. The Web crawler gathers the pages, and the search engine indexes them, enabling subsequent searches on those pages. A search form and an associated Common Gateway Interface script allow you to define your own queries through a Web browser and determine the presentation of the results. A ready-to-use implementation of the Web search solution lets you get started immediately

Prerequisites

Intelligent Miner for Text requires AIX® V4.3 or higher, Microsoft Windows NT V4.0 and Service Pack 3, Sun Solaris V2.5.1, or OS/390 V2.4-2.6.

The OS/390® version requires the Text Search component of the OS/390 operating system, which is available at no cost for downloading. The OS/390 version also requires a Web server and IBM DB2® V5.1.

The Web search solution requires an Internet connection server, such as Apache, or Web servers from IBM, Lotus, Microsoft or Netscape.

NEED FOR INTELLIGENT MINER FOR TEXT :

most organisations have large number of online documents ,such as elctronic mail,intranet documents,technical reports, and news wires,that contain information of great potential value.

the process of finding this information is only implied in the documents.it requires a high degree of computer intelligence to find it.hence the mining metaphor which suggests extracting buried information.

with intelligent miner for text,you have an intelligent partner at your side and a means to automate information-gathering tasks that could previously be done only manually.

------

You're a giant utility, Electricité de France, and a big booster of electric cars

- so how do you find out what your compatriotes are saying about you?

If you're into counting bits, some 80 percent of the world's electronic

data is in text form. Call it rich text, freeform text or just plain "words,"

text has been something of a second-class citizen of the information age. Sure,

computers have processed text, published it, searched it and stored it. But until

recently, they couldn't query and analyse texts in large, structured databases,

as they routinely do with the alphanumeric strings of, say, millions of customer records. Text was the essay question in a world that preferred multiple choice exams - because that was the only kind of information a computer could conveniently process.

Now all that is beginning to change, thanks to text-mining solutions like IBM's

Discursus, which incorporates 20 years of research in machine reading and linguistic

analysis.

Consider the challenge facing Gérald Piat, a research engineer at Electricité de France

(EDF), to find out how the press views the electric car as an emerging means

of transportation - something the giant utility is promoting to save energy

and curb pollution. He could hire a small army of analysts to read the millions of

words that have been written on the subject - or he could call in Charles Huot,

lBM's segment manager for text mining at the European Center for Applied Mathematics

in Paris.

Huot and an IBM team of linguists and developers are working with Piat to perfect

Discursus, which will let EDF sample public opinion without ever conducting interviews

or poring through thousands of documents. The solution goes out over the Internet

and downloads tens of thousands of press cIips, which are then linguistically tagged,

parsed for meaning, grouped in thematic clusters and anlayzed for their

interrelationships - all without human intervention. People skilled in the very

french discipline of semiotics - inferring hidden meanings in the way words,

concepts and symbols are used - then interpret the results and devise an appropriate

media plan.

"Some customers couldn't believe that we could analyze 10,000 customer letters,

newspaper articles or patent filings - in fact, any machine- readable text - in a

matter of hours,''says Hout. "They thought we had hundreds of analysts hidden in

a back room somewhere." EDF's Piat knows better. The collaboration with IBM has

identified a subtle shift in public perceptions of the electric car - from a pricey,

experimental oddity, suited mainly for corporate fleets, to an affordable,

environmentally friendly option as a second family car.

Banks and pharmaceutical companies, among others, are using the technology to

scan thousands of customer responses - often stored and forgotten in the past -

even routing the most important ones automatically for handling according to their

content. And IBM is using text mining internally on the thousands of customer

comments that flow in as part of our Customer Relationship Management processes -

and to analyze write-in comments to the Employee Opinion Survey.

Mining on the Net

As computing moves increasingly to the network, the future of IBM's business

intelligence and network computing initiatives will become closely intertwined.

The Net is already opening up vast new sources of content for analysis - everything

from press archives for text-mining applications to census and other public databases

against which private results can be compared (e.g., how do our sales stack up against

the national average?).

Business intelligence solutions are also being hosted on the web, making them available

by subscription to smaller firms that may want to tap the power of data warehousing

and data mining without investing in their own infrastructures. Elements of IBM's

Intelligent Miner offering have been web-enabled in this way - letting customers

upload their data for analysis one day and download the results the next.

And increasing numbers of users are being given web access to business intelligence

solutions, letting them access results anywhere using a standard browser.

"Companies are finding they want tens of thousands of people to benefit

from this information, tapping into systems that were origi- nally designed for

30 to 40," says Ben Barnes, general manager, IBM Global Business Intelligence Solutions.

As technology's price/performance continues to improve, we'll see a democratization

of this knowledge tool beyond data analysts and top executives to store managers,

marketing reps and customer service representatives. META Group estimates that

the roughly 400,000 people accessing data warehouses today will mushroom to

10 million by 2000 - most of them connected via the web.

WHAT IS INTELLIGENT MINER FOR TEXT :

Intelligent Miner for Text is one of two products in the Intelligent Miner family. The other product in the family is Intelligent Miner for Data. Its data mining ability is focused on structured data in tables and in conventional databases, rather than on unstructured text in documents.

Scenarios of Intelligent Miner for Text at work

You can use the Intelligent Miner for Text tools individually or in many different combinations to create your own tailored text mining solutions. This chapter presents scenarios for using the tools individually. Each scenario makes use of one of the Intelligent Miner for Text tools. As you read through these scenarios, you will not only recognize areas in which you expected Intelligent Miner for Text could help you, but perhaps also discover tasks automated by the tools that you thought could only be done manually.

Assigning documents to predefined categories

Daily you receive many e-mail messages from customers regarding a variety of products. You want to categorize the messages by product and send them to the appropriate product representative. Because the messages are unstructured, the only way you could determine the relevant product in the past was to open the message and read it.

The tool to use

The Topic Categorization tool can help you to automatically forward each e-mail to the appropriate product representative. It analyzes documents, in this case the e-mails, and automatically determines the category or categories a document belongs to according to a predefined category system. The result of the categorization is a list of category names and confidence levels for each document.

A training tool helps you to define categories and build reliable categorizers for many applications.

Other applications

Documents on an intranet might be divided into categories, such as Personnel Policy, Lotus Notes Information, or Computer Information. It would be impractical to catalog the million or so documents on a large intranet manually. By using automatic categorization, documents can be assigned to an organization scheme. This makes documents easier to find by browsing, or by restricting the scope of a text search.

Dividing documents into groups

Whenever a search query returns a large set of qualifying documents, it is important to provide an overview of the result. Usually, documents are returned in a ranked list, that is, they are ordered with respect to their relevance to the query. However, if the query itself was rather vague, this does not help a lot.

The tool to use

A better way to present the result is to group the results into sets of related documents. This makes the understanding of the search results easier, because examination of a single document of a group might reveal whether a certain set is worth further examination. Clustering can be used to achieve this goal.

The Clustering tools ease the process of browsing to find similar or related information. They find the key concepts in a set of documents and automatically group together, or "cluster" documents that contain similar concepts. The clusters are created dynamically without requiring predefined classes. The Clustering tools generate cluster titles that are short lists of relevant phrases, characteristic for the documents contained in the cluster.

There are two different clustering techniques you can use to group your documents. Together, they can give you different insights into your document collection.

Other applications

Provide an overview of a large document collection.
Identify hidden similarities in documents.
Discover very similar or duplicate documents so that they can be removed from a document collection.
Find documents that are out of place. If a cluster contains only one document, it probably has little in common with the set of documents with which it is stored.

recognizing a document's features

You receive many electronic documents daily. You open each one, read it, print it, and then use a highlighting pen to mark the significant items, or "features" in the text. Intelligent Miner for Text offers a way to dramatically speed up this process by automatically recognizing the main features in the text.

The tool to use

The Text Analysis tools include a Feature Extraction tool. This tool can recognize significant vocabulary items in text. The process is fully automatic - the vocabulary is not predefined.

The names extraction function can provide valuable clues to the subject of a text. It can locate names in text and determine whether the name is of a person, a place, or an organization - even distinguishing when "England", for example, is the name of the country or of a person. It can recognize names even when they occur in different forms, such as "Robert Jordan" versus "Mr. Jordan" or "American Bar Association" versus "ABA."

A frequency count of all such variants enables you to find the most significant terms in a document - those that characterize it. This can be a quick way to discover what a document is about without actually reading it. You can also use such frequency statistics to find terms to use for searching for similar documents.

The terminology extraction function automatically finds many multiword terms that have a meaning of their own, for example, "laser printer." It can also recognize different forms of the same term, such as "expense account" and "expense accounts."

The abbreviation extraction function finds and links abbreviations and acronyms together with their full forms.

The relation extraction function finds information of the form "Company_X make furniture", "R. Jordan chief_executive_officer Company_X", "Atlanta location Company_X", and "52 age R. Jordan".

Other extraction functions detect other kinds of significant items, such as dates, numbers, and money amounts.

Recognizing the language of a document

You receive documents in different languages and send them for translation. You manually open the document using an electronic mail system, scan the text to determine which language the document is written in, and then send the document to a translator. Till now, the process has been slow and expensive.

The tool to use

Another of the Text Analysis tools is the Language Identification tool. It can automatically discover the language in which a document is written. Its accuracy is usually close to 100%, even for short text.

The Language Identification tool uses clues such as high-frequency words and statistics about the distributions of certain character sequences to determine the language.

Other applications

Automatically organizing collections of indexable data by language.
Restricting search results to documents in a particular language.
You can also train the Language Identification tool for other tasks, such as recognizing DBCS documents, for example, so that they are not indexed.

Searching for text

You are a multinational law office, building a digital library of all the case histories with which your company has been involved. The library is stored on a powerful server. Many of the documents, such as those concerning environmental law, are in different languages.

The documents contain vocabulary specific to the legal profession. You want your employees to be able to search the library using complex and specific queries, in several different languages. You also want to offer them a fuzzy-search capability so that they can locate abbreviated terms.

The tool to use

The Text Search Engine is an advanced search engine, the work of IBM researchers from all over the world. It provides search facilities, the ability to index and search in many languages, use supporting thesauri, and process both natural-language (free-text) and Boolean queries. With section support, you can define sections in your documents that you can index and search. For example, for your case histories you might have a section called Referenced Cases that can now be explicitly searched.

The Text Search Engine's real power is its in-depth linguistic analysis of a document text before it is indexed, and of terms in a query before a search. This gives you a search result of high precision and recall, helping you in your goal to find everything, but not too much.