Table of Contents s462

Table of Contents

1 INTRODUCTION 3

2 HISTORY 6

2.1 Early systems 6

3 NATURAL LANGUAGE PARSING 7

3.1 Rule-Based Syntactic Parsing 7

3.2 Terminal Symbols 7

3.3 Non-terminal symbols 7

3.4 Production Rules 7

3.4.1 Grammar 7

3.4.2 Parse tree 8

3.4.2.1 Top down 8

3.4.2.2 Bottom up 9

3.5 Probabilistic Parsing 11

3.5.1 Disambiguation 11

3.5.2 Training 11

3.5.2.1 Treebank 12

3.5.2.2 Incremental learning 12

3.6 Semantic Parsing 13

3.6.1 Semantic Data Models 13

3.6.2 Case Based Reasoning 14

3.6.3 Semantic Representation 14

3.6.4 Actions of the Parser 15

4 NLIDB ARCHITECTURE 17

4.1 Pattern-matching systems 17

4.2 Parsing based systems 17

4.2.1 Semantic grammar based parsing 18

4.2.2 Translation 19

5 MARKET TEST 23

5.1 Goals 23

5.2 Tests 23

5.3 Results 23

5.3.1 Impressions 23

5.3.1.1 Microsoft English Query 23

5.3.1.2 Elfsoft 24

5.3.2 Query results 25

6 FUTURE 26

6.1 Language challenges 26

6.2 Portability challenges 26

6.3 Competing systems 26

6.4 Possible avenues 26

6.4.1 Adaptation techniques 27

6.4.2 Speech-based techniques 27

6.4.3 Learning algorithms 27

6.4.3.1 User Dialogue 27

6.4.3.2 Neural Networks 28

6.4.3.3 Genetic Algorithms 28

7 CONCLUSIONS 29

8 BIBLIOGRAPHY 32

9 CONTRIBUTIONS 34

1 INTRODUCTION

The ability to exercise language to convey different thoughts and feelings differentiates human beings from animals. The definition of Natural Language Processing is the capability of a machine to understand the full context of human language about a particular topic so that the unspecified guess and general knowledge can be understood. “Thus if the machine is able to achieve this, it has come close to the notion of artificial intelligence itself”[1].

One may find interacting with a foreign person with no knowledge of English intricate and frustrating. Thus, a translator will have to come into the picture to allow one to communicate with the foreigner. Companies have related this problem to extracting data from a database management system (DBMS) such as MS Access, Oracle and others. A person with no knowledge of Structured Query Language (SQL) may find himself or herself handicapped in corresponding with the database. Therefore, companies like Microsoft and Elfsoft (English Language Frontend Software) have analysed the abilities of Natural Language Processing to develop products for people to interact with the database in simple English. This enables a user to simply enter queries in English to the Natural Language database interface. This kind of application is known as a Natural Language Interface to a DataBase (NLIDB).

The system works by utilizing the use of syntactic knowledge and the knowledge it has been provided about the relevant database.[2] Hence, it is able to implement the natural language input to the structure, scope and contents of the database. The program translates the whole query into the standard query language to extract the relevant information from the database. Thus, these products have created a revolution in extracting information from databases. They have discarded the fuss of learning SQL and time is also saved in learning this query language.

This report will look at the performance of each database interface connected to a standard database. The Northwind database has been chosen as the default database to work on. There are several companies that are offering such products in the market. Our group has found several of them, which include English Query, Elfsoft, EasyAsk and NLBean created by Mr Mark Watson. We have requested for these companies for their permissions to test their products in regards to our research. We received positive responses from Elfsoft and NLBean, but had to settle for tests on Microsoft English Query and Elfsoft only. We have also contacted EasyAsk via email but the company has provided minimal assistance in our research.

In order to produce accurate conclusions on the different interpretations of each software, we have listed out over thirty questions to test the products. Each product will be asked the same questions in the same order. The questions have been carefully planned to test the pros and the cons of each product.

These questions include:

· Listing the specific columns and rows

· Counting

· Calculations

· Cross referencing from more than one tables

· Ordinal positions

· Followed-ups

· Conclusions

· Semantics

· Grammar mistakes

· Spelling mistakes

· Out-of-context questions

There are three components in a natural language dialog system: analysis, evaluation and generation.[3] The analysis component translates the query as entered by the user into a semantic representation which is transcripted in the knowledge representation language. There may be several communication sessions between the natural language access system and user interface system to the user in order to carry out the action to derive the result. The evaluation component allows information to be absorbed by the dialog system when queries have to be satisfied or the system needs to alert the user about any major state changes. The generation component gathers the intended information that the user wants to see as provided in the query. This component will generate text, graphs, query or any other responses according to the situational context of the query.[4]

The knowledge-based database assistant (KDA) as stated, is a practical development of an intelligent database front-end to assist novice users in retrieving desirable information from an unfamiliar database system.[5] This component exists in both Microsoft English Query and Elfsoft. Thus, this useful program directs the novice user to get the relevant results by entering the accurate query or by prompting the user when insufficient information is entered to get the appropriate answer. This component can be seen in the later part in this report in both programs.

In addition, “the KDA's responding functionality, which could change the user's knowledge state, is called query guidance”.[6] It can detect a user’s scope of knowledge about the relevant database by studying the query entered by the user. If it sensed that the user has limited awareness about the database and could not retrieve his or her desired answer, the query guidance will jump into action and provide similar queries to allow the user to gather the appropriate facts from the database or present the most relevant query to the user based on the user’s perceived intention. Such a component allows the novice to get familiar with the database fast and enables the user to learn about the scope of the database based on the prompt messages and the queries generated from the KDA without the expense of learning those mass databases stored in most organizations.

2 HISTORY

As the use of databases for data storage spread during the 1970’s, the user interface to these systems represented a burden for designers worldwide. At this point, both the relational database model and the SQL interface language were yet to be developed, which means that the task of inserting and querying data was tedious and difficult.

It was therefore a logical step for programmers to attempt to develop more user-friendly and “human” interfaces to the databases. One of these approaches was the use of natural language processing, where the user interactively would be allowed to interrogate the stored information.

2.1 Early systems

The most well-known historical natural language database interface systems are:

· LUNAR, interfacing a database with information on rocks collected during American moon expeditions. It was originally published in 1972. When evaluated in 1977, it answered 78 % of questions correctly. Based on syntactic parsing, it tended to build several parse trees for the same query, and was deemed as inefficient[7] and too domain-specific and inflexible.

· LADDER, the first semantic grammar-based system, interfacing a database with information on US Navy ships.

· CHAT-80, probably the most famous example. It interfaced a database of world geography facts. The entire application (both the database and the user interface) was developed in Prolog. As the source code was freely distributed, it is still used and cited. An online version can be found at[8].

3 NATURAL LANGUAGE PARSING

3.1 Rule-Based Syntactic Parsing

Syntax means ways that words can fit together to form higher-level units such as phrases, clauses, and sentences. Therefore syntactically driven parsing means interpretations of larger groups of words are built up out of the interpretation of their syntactic constituent words or phrases. In a way this is the opposite of pattern matching as here the interpretation of the input is done as a whole.

Syntactic analyses are obtained by application of a grammar that determines what sentences are legal in the language that is being parsed.

Syntactic parsing operates through the translation of the natural language query into a parse tree, which is then converted to a SQL query. There are a number of fundamental concepts in the theory of syntactic parsing.

3.2 Terminal Symbols

A terminal symbol is the basic building block of the language, i.e. words and delimiters. Together, the set of terminal symbols form the “dictionary of words”[9] recognised by the system, i.e. the range of the vocabulary that it can read and interpret.

3.3 Non-terminal symbols

Non-terminal symbols are higher-level language terms describing concepts and connections in the syntax of the language. Examples of non-terminal symbols include1 sentence, noun phrase, verb phrase, noun, and verb.

3.4 Production Rules

As the query is analysed, a number of production rules fires to identify and classify the context of the read word. In analogy with a production system (such as the one used in PROLOG), a production rule in a context-free grammar[10] converts a left-hand non-terminal symbol to a sequence of symbols, which can be either terminal or non-terminal. Examples of production rules:

· Sentence := Noun phrase verb phrase

· Verb phrase := verb

These rules are also commonly referred to as rewrite rules.

3.4.1 Grammar

The combination of the set of terminal symbols, set of non-terminal symbols, the production rules and an assigned start symbol (the highest-level construct in the system, usually sentence) form the grammar of the syntax. The role of the grammar is to define:

· What category each word belongs to;

· What expressions are legal and syntactically correct;

· How sentences are generated.

3.4.2 Parse tree

The system analyses the sentence by reading the non-terminal symbols in order and identifying what production rule to fire. As it does so, it gradually builds a representation of the sentence referred to as a parse tree. The term has been coined from the tree-like graph that is produced, where the root is the top-level symbol (e.g. sentence), the children of each node are the right-hand non-terminal symbols and the leaves are the terminal symbols (the words). The parse tree can be built in two fundamentally different ways.

3.4.2.1 Top down

A top down parser starts at the root and gradually builds the tree downwards by matching the read terminal symbols with symbols on the right-hand side of possible production rules. Terminal or non-terminal symbols on the right hand side are added at the level below the current symbol.

This is similar to the goal-driven approach of a production system. The basic architecture of a top down parser is illustrated in figure 1.

Figure 1 Top down parsing of the sentence "the girl forgot the boy"[11]

In many situations, the first token alone does not provide enough information to make the decision on what production rule should be fired. In order to overcome this, there are two basic methods.

3.4.2.1.1 Recursive Descent

The system starts by firing the first production rule of the candidates for which the given terminal symbol could fit and builds the initial sub tree from this information. If this further downwards in the tree results in an inconsistency or syntactic error, it reverts to the point where the decision was made, removes all the nodes on the way back up and selects another of the possible productions. This is a procedure very similar to depth-first searching and backtracking in production systems.

3.4.2.1.2 Look Ahead

Look Ahead systems will not be contented by just reading one token. Rather, it reads the number of tokens necessary to identify the given right-hand side beyond any ambiguities before firing any production rules.

Grammars are characterised by the maximum number of terminal symbols required to read before all possible conflicts in the choice of production rule can be resolved. If this number is k, the grammar is referred to as an LL (k) grammar[12]. The look ahead procedure is more in analogy with a breadth-first search technique.

3.4.2.2 Bottom up

A bottom up parser, on the other hand, works from the leaf upward by “tagging” the tokens, i.e. starting from the right-hand side of the production rules and associating the read word with its category. When a full right-hand side has been identified, the production rule fires and the left-hand side non-terminal symbol is added as a branch in the level above. This methodology corresponds to the data-driven technique of production systems. The bottom up parsing technique is illustrated in figure 2.

Figure 2 Bottom up parsing of the sentence "the girl forgot the boy"[13]

In some cases, the sentence is ambiguous in itself and there are multiple production rules that match a given sentence, in which case the parser has to make a choice between the two potential interpretations. One strategy for dealing with these situations is referred to as probabilistic parsing.

3.5 Probabilistic Parsing

Probabilistic parsing takes an empirical approach to the difficult task of disambiguation, i.e. identifying which of several mutually exclusive alternate syntactic parse trees should be generated.

For example, consider the sentence “One morning I shot an elephant in my pyjamas”[14]. There are two possible syntactic parses for this sentence[15]. One implies that the person was wearing the pyjamas, while the opposing view would claim that the elephant was in the underwear (hence the joke). Although the selection between these two interpretations is obvious to a human, how is this knowledge automated in a computer?

One option, used in a.k.a. attribute grammars, is to encode information for each verb as a parameter to each production rule. However, as the dictionary grows, this approach may be too selective and require every different case to be specifically added to the production rules.